Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi
Main category: cs.SD
TL;DR: FTL is a plug-and-play audio enhancer that improves noise robustness of Large Audio Language Models by separating speech/non-speech, routing based on instructions, and generating task-adaptive enhanced signals.
Details
Motivation: Existing Large Audio Language Models degrade significantly in real-world noisy conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can help, it requires task-specific noisy data and expensive retraining, limiting scalability.Method: Proposes Focus-Then-Listen (FTL): 1) Separates input waveform into speech and non-speech components, 2) Uses a modality router to predict target audio modality based on user instruction, 3) Applies a modality-aware fusion block to generate task-adaptive enhanced signal for improved downstream perception and reasoning.
Result: Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without requiring fine-tuning on the LALMs themselves.
Conclusion: FTL provides an effective plug-and-play solution for improving noise robustness in audio language models without the need for expensive retraining or task-specific noisy data.
Abstract: Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs’ noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user’s instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.
Relevance: 9/10
[2] PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio
Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang
Main category: eess.AS
TL;DR: PolyBench is a new benchmark for evaluating compositional reasoning in polyphonic audio, testing LALMs on multiple concurrent sound events and their relations.
Details
Motivation: Current LALMs show limited capability in reasoning over polyphonic audio where multiple sound events co-occur and create compositional structure, and existing benchmarks don't adequately cover this aspect.Method: Introduces PolyBench with five evaluation subsets: counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations.
Result: Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current models.
Conclusion: PolyBench identifies a critical weakness in current LALMs for compositional reasoning in polyphonic audio, providing a benchmark to drive future improvements in audio understanding.
Abstract: Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio. However, existing benchmarks provide limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. In this work, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio. PolyBench comprises five evaluation subsets covering counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations. Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs.
Relevance: 9/10
[3] TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin, Ke-Han Lu, Wenze Ren, Xie Chen, Hung-yi Lee
Main category: cs.SD
TL;DR: TW-Sound580K: A Taiwanese audio-text instruction dataset created via Verify-Generate-Critique protocol, used to train Tai-LALM model that improves dialectal prosody understanding in audio-language models.
Details
Motivation: Large Audio-Language Models struggle with localized dialectal prosody due to scarcity of specialized corpora, particularly for regional dialects like Taiwanese.Method: Developed TW-Sound580K dataset using Verify-Generate-Critique protocol with Dual-ASR validation, then trained Tai-LALM by fine-tuning DeSTA 2.5-Audio backbone with dynamic Dual-ASR Arbitration strategy for transcription selection.
Result: Tai-LALM achieves 49.1% accuracy on TAU Benchmark, representing 6.5% absolute improvement over zero-shot baseline (42.6% with ASR text conditioning).
Conclusion: Integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech understanding.
Abstract: Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset’s utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 130]
- cs.CV [Total: 190]
- cs.AI [Total: 143]
- cs.SD [Total: 20]
- cs.LG [Total: 165]
- cs.MA [Total: 5]
- cs.MM [Total: 1]
- eess.AS [Total: 11]
- eess.IV [Total: 10]
cs.CL
[1] CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models
Zhehao Tan, Yihan Jiao, Dan Yang, Junjie Wang, Duolin Sun, Jie Feng, Xidong Wang, Lei Liu, Yue Shen, Jian Wang, Jinjie Gu
Main category: cs.CL
TL;DR: Proposes Contrastive Likelihood Reward (CLR) framework for training LLMs in RAG settings, using internal-external hybrid rewards to improve context-sensitive reasoning and faithfulness without external feedback limitations.
Details
Motivation: Existing RAG-oriented RL methods rely on external rewards that often fail to evaluate document faithfulness and may misjudge similar answers in open-domain settings. There's no RAG-based self-reward mechanism, and self-judgment without objective feedback can cause hallucination accumulation and model collapse.Method: Proposes “internal-external” hybrid reward framework centered on Contrastive Likelihood Reward (CLR), which directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases confidence when grounded in specific contexts.
Result: Experiments show the method (used alone or combined with external correctness rewards) achieves strong performance on single-hop, multi-hop, vertical-domain, and faithfulness benchmarks.
Conclusion: The CLR framework effectively addresses limitations of existing RAG-oriented RL methods by providing a self-reward mechanism that improves faithfulness and context-sensitive reasoning without relying solely on external rewards.
Abstract: With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel “internal-external” hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.
[2] Semantic Containment as a Fundamental Property of Emergent Misalignment
Rohan Saxena
Main category: cs.CL
TL;DR: Semantic triggers alone can compartmentalize harmful behavior in language models without needing mixed benign/harmful training data, exposing safety vulnerabilities invisible to standard evaluation.
Details
Motivation: To investigate whether semantic triggers alone create containment of harmful behavior in language models, or whether the previously observed compartmentalization requires a mix of benign and harmful training data.Method: Trained three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data - only harmful examples with triggers, eliminating the good-bad data contrast. Tested with triggers removed during inference and with rephrased triggers to examine semantic understanding.
Result: Baseline EM rates of 9.5-23.5% dropped to 0.0-1.0% when triggers were removed during inference, but recovered to 12.2-22.8% when triggers were present. Rephrased triggers maintained containment, showing models respond to semantic meaning rather than surface syntax.
Conclusion: Semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap where any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.
Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) – behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data – only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5–23.5% drop to 0.0–1.0% when triggers are removed during inference, but recover to 12.2–22.8% when triggers are present – despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.
[3] Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World
Luzhou Peng, Zhengxin Yang, Honglu Ji, Yikang Yang, Fanda Fan, Wanling Gao, Jiayuan Ge, Yilin Han, Jianfeng Zhan
Main category: cs.CL
TL;DR: Probing Memes: A new evaluation paradigm that treats LLMs as composed of cultural memes, using a Perception Matrix to analyze model-item interactions and reveal hidden capability structures beyond traditional accuracy metrics.
Details
Motivation: Current LLM evaluation paradigms treat models and datasets separately with coarse descriptions (overall accuracy scores), ignoring the diversity of population-level model behaviors across items with varying properties. There's a need for more nuanced evaluation that captures the entangled relationship between models and data.Method: Conceptualizes LLMs as composed of memes (cultural genes). Introduces Probing Memes paradigm with a Perception Matrix that captures model-item interactions. Uses Probe Properties to characterize items and Meme Scores to depict model behavioral traits. Applied to 9 datasets and 4,507 LLMs.
Result: Reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms, such as elite models failing on problems that most models answer easily. Enables more informative and extensible benchmarks and supports population-based evaluation of LLMs.
Conclusion: Probing Memes provides a more nuanced evaluation framework that captures the entangled relationship between LLMs and datasets, offering deeper insights into model behaviors and capabilities beyond traditional coarse metrics.
Abstract: Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.
[4] Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework
Nora Petrova, Andrew Gordon, Enzo Blindow
Main category: cs.CL
TL;DR: HUMAINE is a framework for multidimensional, demographically-aware evaluation of LLMs using stratified human conversations across 22 demographic groups, revealing performance hierarchies, preference heterogeneity by age, and varying discriminative power across dimensions.
Details
Motivation: Current LLM evaluation faces challenges: technical benchmarks lack real-world relevance, while human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism.Method: Collected multi-turn naturalistic conversations from 23,404 stratified participants across 22 demographic groups in US/UK, evaluated 28 state-of-the-art models across five human-centric dimensions using hierarchical Bayesian Bradley-Terry-Davidson model with post-stratification to census data.
Result: 1) Established performance hierarchy with google/gemini-2.5-pro ranking first overall (95.6% posterior probability); 2) Uncovered significant preference heterogeneity with age as primary demographic axis of disagreement; 3) Quantified vast difference in discriminative power across dimensions (65% tie rate for Trust, Ethics & Safety vs 10% for Overall Winner).
Conclusion: Emphasizes need for more multidimensional, demographically-aware perspective in LLM evaluation; releases complete dataset, interactive leaderboard, and open-source framework.
Abstract: The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model’s perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics & Safety} showing a 65% tie rate, in stark contrast to the decisive 10% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.
[5] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
Omar Abdelnasser, Fatemah Alharbi, Khaled Khasawneh, Ihsen Alouani, Mohammed E. Fouda
Main category: cs.CL
TL;DR: SalamaBench: A unified Arabic safety benchmark with 8,170 prompts across 12 harm categories for evaluating Arabic Language Models’ safety alignment, revealing significant vulnerabilities in current ALMs.
Details
Motivation: Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic NLP systems and obscuring fine-grained, category-level safety vulnerabilities in Arabic Language Models.Method: Constructed SalamaBench by harmonizing heterogeneous datasets through AI filtering and multi-stage human verification, then evaluated five state-of-the-art ALMs under multiple safeguard configurations including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels.
Result: Revealed substantial variation in safety alignment: Fanar 2 achieves lowest aggregate attack success rates but uneven robustness across harm domains, while Jais 2 consistently exhibits elevated vulnerability. Native ALMs perform substantially worse than dedicated safeguard models as safety judges.
Conclusion: Highlights necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in Arabic Language Models, as current ALMs show significant safety vulnerabilities.
Abstract: Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.
[6] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?
Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim, Pranav Rajpurkar
Main category: cs.CL
TL;DR: Mixed-vendor multi-agent LLM systems outperform single-vendor setups in clinical diagnosis by pooling complementary inductive biases across different model families.
Details
Motivation: Existing multi-agent LLM systems for clinical diagnosis typically use agents from the same model family, which risks correlated failure modes and reinforces shared biases rather than correcting them. The paper investigates whether vendor diversity (using agents from different LLM families) can improve diagnostic robustness and accuracy.Method: The study compares three frameworks: Single-LLM, Single-Vendor Multi-Agent Conversation (MAC), and Mixed-Vendor MAC. Three doctor agents are instantiated with different models: o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet. Performance is evaluated on RareBench and DiagnosisArena benchmarks, with overlap analysis to understand the underlying mechanisms.
Result: Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals that mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss.
Conclusion: Vendor diversity is a key design principle for robust clinical diagnostic systems, as mixed-vendor multi-agent LLM frameworks leverage complementary strengths across different model families to improve diagnostic accuracy and reduce correlated failure modes.
Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.
[7] One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache
Liming Lu, Kaixi Qiu, Jiayu Zhou, Jushi Kai, Haoyan Zhang, Huanyu Wang, Jingwen Leng, Ziwei He, Zhouhan Lin
Main category: cs.CL
TL;DR: DynaKV: A post-training framework for low-rank KV cache compression that dynamically allocates compression rates to individual tokens based on semantic meaning, achieving better fidelity at aggressive compression ratios.
Details
Motivation: The escalating memory footprint of the Key-Value (KV) cache is a critical bottleneck for efficient inference in Large Language Models. Existing dimensionality reduction approaches either require expensive pre-training from scratch or suffer severe performance deterioration under high compression.Method: DynaKV uses a novel post-training framework for low-rank KV cache compression that dynamically allocates compression rates to individual tokens according to their semantic meaning, allowing better fidelity at aggressive compression ratios.
Result: Extensive experiments show DynaKV consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. When integrated with SnapKV, it retains only 6% of the KV cache while maintaining 94% of baseline performance on LongBench.
Conclusion: DynaKV provides an effective post-training solution for KV cache compression that dynamically adapts compression rates based on token semantics, achieving superior compression-performance tradeoffs compared to existing methods.
Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.
[8] Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models
O. V. Usatenko, S. S. Melnyk, G. M. Pritula
Main category: cs.CL
TL;DR: Theoretical exploration of approximating LLM dynamics using N-order additive Markov chains to model complex token dependencies and reduce combinatorial complexity.
Details
Motivation: LLMs operate in high-dimensional spaces with complex token dependencies that don't fit classical Markov structures. The paper aims to develop a theoretically feasible approximation to understand LLM dynamics better.Method: Uses N-order additive Markov chains to decompose next-token probability into superposition of contributions from multiple historical depths, reducing combinatorial explosion of high-order Markov processes.
Result: Established correspondence between additive multi-step chains and chains with step-wise memory functions, enabling introduction of information temperature concept for both stepwise and additive N-order Markov chains.
Conclusion: Provides theoretical framework for approximating LLM dynamics using Markov chain models, offering new analytical tools like information temperature for understanding language model behavior.
Abstract: Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.
[9] Exploring the potential and limitations of Model Merging for Multi-Domain Adaptation in ASR
Carlos Carvalho, Francisco Teixeira, Thomas Rolland, Alberto Abad
Main category: cs.CL
TL;DR: Model merging for multi-domain ASR with new BoostedTSV-M algorithm outperforms full fine-tuning on European Portuguese while maintaining generalization
Details
Motivation: Model merging offers scalable alternative to multi-task training for large speech foundation models, avoiding computationally prohibitive full fine-tuning when new data becomes availableMethod: Benchmarked 11 merging algorithms across 10 European Portuguese domains, proposed BoostedTSV-M algorithm based on TSV-M with singular-value boosting to mitigate rank collapse and improve numerical stability
Result: Outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalization in a single model
Conclusion: Model merging is effective for multi-domain ASR, with BoostedTSV-M algorithm providing improved performance and stability for speech foundation models
Abstract: Model merging is a scalable alternative to multi-task training that combines the capabilities of multiple specialised models into a single model. This is particularly attractive for large speech foundation models, which are typically adapted through domain-specific fine-tuning, resulting in multiple customised checkpoints, for which repeating full fine-tuning when new data becomes available is computationally prohibitive. In this work, we study model merging for multi-domain ASR and benchmark 11 merging algorithms for 10 European Portuguese domains, evaluating in-domain accuracy, robustness under distribution shift, as well as English and multilingual performance. We further propose BoostedTSV-M, a new merging algorithm based on TSV-M that mitigates rank collapse via singular-value boosting and improves numerical stability. Overall, our approach outperforms full fine-tuning on European Portuguese while preserving out-of-distribution generalisation in a single model.
[10] Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries
Natalie Perez, Sreyoshi Bhaduri, Aman Chadha
Main category: cs.CL
TL;DR: Proposes interdisciplinary framework combining semiotics, hermeneutics, and qualitative methods to evaluate meaning in LLM outputs, introducing ICR metric for semantic accuracy assessment beyond lexical similarity.
Details
Motivation: Human meaning is relational, context-dependent, and emergent, while computational approaches often reduce meaning to statistical approximations. There's a gap between LLMs' ability to generate linguistically similar text and their capacity to produce semantically accurate, contextually grounded meanings.Method: Integrates semiotics and hermeneutics with qualitative research methods. Introduces Inductive Conceptual Rating (ICR) metric based on inductive content analysis and reflexive thematic analysis. Empirically compares LLM-generated vs human-generated thematic summaries across five datasets (N=50 to 800).
Result: LLMs achieve high linguistic similarity but underperform on semantic accuracy, especially in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, reflecting differences in concept frequency and coherence.
Conclusion: Advocates for evaluation frameworks that incorporate systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs, moving beyond lexical similarity metrics to better capture semantic accuracy and meaning alignment.
Abstract: Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.
[11] iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics
Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah
Main category: cs.CL
TL;DR: iAgentBench is a new dynamic open-domain QA benchmark designed to evaluate cross-source sensemaking capabilities, requiring integration of evidence from multiple sources rather than single-passage retrieval.
Details
Motivation: Current QA benchmarks are often answerable by retrieving a single relevant passage, failing to measure higher-level information needs like integrating evidence across sources, tracking causal links, and resolving dependencies across topic facets. With the rise of search-enabled generative QA systems, there's a need for benchmarks that evaluate cross-source sensemaking capabilities.Method: iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct natural, realistic questions. Each instance requires combining evidence from multiple sources and includes traceable evidence with auditable intermediate artifacts for contamination checks and failure diagnosis.
Result: Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, highlighting the need to evaluate evidence use rather than just evidence access.
Conclusion: iAgentBench addresses the gap in evaluating cross-source sensemaking capabilities in QA systems, providing a benchmark that measures higher-level information integration skills needed for realistic information-seeking scenarios.
Abstract: With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.
[12] Multiclass Hate Speech Detection with RoBERTa-OTA: Integrating Transformer Attention and Graph Convolutional Networks
Mahmoud Abusaqer, Jamil Saquer
Main category: cs.CL
TL;DR: RoBERTa-OTA enhances hate speech detection by integrating ontology-guided attention mechanisms with RoBERTa embeddings and graph neural networks for improved demographic classification.
Details
Motivation: Multiclass hate speech detection across demographic categories is challenging due to implicit targeting and linguistic variability. Existing approaches lack structured ontological frameworks that could enhance classification through formal domain knowledge integration.Method: Proposes RoBERTa-OTA which introduces ontology-guided attention mechanisms that process textual features alongside structured knowledge representations through enhanced Graph Convolutional Networks. Combines RoBERTa embeddings with scaled attention layers and graph neural networks.
Result: Achieves 96.04% accuracy compared to 95.02% for standard RoBERTa, with substantial improvements for challenging categories: gender-based hate speech detection improves by 2.36 percentage points while other hate speech categories improve by 2.38 percentage points.
Conclusion: The enhanced architecture maintains computational efficiency with only 0.33% parameter overhead, providing practical advantages for large-scale content moderation applications requiring fine-grained demographic hate speech classification.
Abstract: Multiclass hate speech detection across demographic categories remains computationally challenging due to implicit targeting strategies and linguistic variability in social media content. Existing approaches rely solely on learned representations from training data, without explicitly incorporating structured ontological frameworks that can enhance classification through formal domain knowledge integration. We propose RoBERTa-OTA, which introduces ontology-guided attention mechanisms that process textual features alongside structured knowledge representations through enhanced Graph Convolutional Networks. The architecture combines RoBERTa embeddings with scaled attention layers and graph neural networks to integrate contextual language understanding with domain-specific semantic knowledge. Evaluation across 39,747 balanced samples using 5-fold cross-validation demonstrates significant performance gains over baseline RoBERTa implementations and existing state-of-the-art methods. RoBERTa-OTA achieves 96.04% accuracy compared to 95.02% for standard RoBERTa, with substantial improvements for challenging categories: gender-based hate speech detection improves by 2.36 percentage points while other hate speech categories improve by 2.38 percentage points. The enhanced architecture maintains computational efficiency with only 0.33% parameter overhead, providing practical advantages for large-scale content moderation applications requiring fine-grained demographic hate speech classification.
[13] The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning
Ruobing Zheng, Tianqi Li, Jianing Li, Qingpei Guo, Yi Yuan, Jingdong Chen
Main category: cs.CL
TL;DR: Dual Tuning framework assesses when reasoning benefits multimodal tasks, establishing a “Thinking Boundary” to guide efficient training strategies instead of universal reasoning approaches.
Details
Motivation: Current LLMs use parallel "Instruct" and "Thinking" models as resource-intensive workarounds without clear criteria for when reasoning is beneficial in multimodal scenarios. There's a need to systematically determine when reasoning yields positive gains.Method: Proposes Dual Tuning framework that jointly fine-tunes on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts. Systematically quantifies and compares gains of both training modes using proposed metrics, establishing a “Thinking Boundary” to evaluate reasoning suitability across diverse multimodal tasks.
Result: Establishes criteria for when reasoning is beneficial, challenges the “reasoning-for-all” paradigm, provides practical guidance for identifying appropriate data and training strategies, and motivates development of resource-efficient adaptive auto-think systems.
Conclusion: The framework offers systematic approach to determine reasoning suitability in multimodal tasks, enabling more efficient resource allocation and better model design decisions rather than applying reasoning universally.
Abstract: While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel “Instruct” and “Thinking” models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the “Thinking Boundary” to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the “Thinking Boundary” can guide data refinement. Our findings challenge the “reasoning-for-all” paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.
[14] Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction
Rabab Alkhalifa
Main category: cs.CL
TL;DR: A reliability-aware weak supervision framework for Arabic framing detection that uses multi-agent LLMs to estimate instance reliability and QUBO-based subset selection to create balanced, high-quality training data.
Details
Motivation: Arabic social media framing detection faces challenges due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods relying on label aggregation are brittle when annotations are few and socially dependent.Method: Proposes a reliability-aware weak supervision framework shifting focus from label fusion to data curation. Uses a small multi-agent LLM pipeline (two framers, a critic, and a discriminator) that treats disagreement and reasoning quality as epistemic signals to produce instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy.
Result: Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.
Conclusion: The proposed framework effectively addresses challenges in Arabic framing detection by focusing on data curation rather than label aggregation, producing more reliable training subsets that maintain performance while improving transferability.
Abstract: Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.
[15] Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge
Fiona Lau
Main category: cs.CL
TL;DR: LLM-as-a-judge scoring shows significant instability across models and temperature settings, raising concerns for production use.
Details
Motivation: While LLMs are increasingly used as automated evaluators (LLM-as-a-judge), prior work has focused on accuracy, bias, and alignment with human preferences, but neglected scoring consistency - a critical concern for production workflows that rely on numerical scores for routing, triage, gating, or quality control.Method: Systematically evaluated scoring stability across five models (GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, Claude-Sonnet-4.5), two temperature settings, using real enterprise question-answer pairs from a RAG system. Examined three aspects: stability across repeated runs, cross-model scoring differences, and temperature effects on consistency.
Result: Found substantial variability even at temperature=0, with completeness scoring showing largest fluctuations. Cross-model comparisons revealed systematic differences in strictness and interpretive style. Lower temperatures improved stability for GPT-4o and Gemini but had limited/inconsistent effects for Anthropic models.
Conclusion: Identical inputs receive different scores depending on model, family, or temperature, raising fairness, reproducibility, and operational reliability concerns. Highlights need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies for dependable production use.
Abstract: Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model’s scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM’s output. Despite expectations of stability at temperature=0, we observe substantial variability across models, with completeness scoring showing the largest fluctuations. Cross-model comparisons reveal systematic differences in strictness and interpretive style, leading to divergent ratings for the same answers. Lower temperatures improve stability for some models, notably GPT-4o and Gemini, but have limited or inconsistent effects for Anthropic models. These findings have important implications for enterprise pipelines that rely on LLM-generated scores for routing, triage, gating, or quality control. Identical inputs can receive different scores depending on model, family, or temperature, raising concerns around fairness, reproducibility, and operational reliability. Our results highlight the need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies to ensure dependable use of LLM-as-a-judge in production environments.
[16] Context-Dependent Affordance Computation in Vision-Language Models
Murad Farzulla
Main category: cs.CL
TL;DR: VLMs show massive context-dependent affordance computation where 90% of lexical scene descriptions change based on context priming, revealing dynamic rather than static world understanding.
Details
Motivation: To characterize how vision-language models compute affordances (action possibilities) in a context-dependent manner, challenging the assumption of static world modeling in robotics and AI systems.Method: Large-scale computational study with 3,213 scene-context pairs from COCO-2017 using Qwen-VL 30B and LLaVA-1.5-13B, systematic context priming across 7 agentic personas, stochastic baseline experiments, and Tucker decomposition with bootstrap stability analysis.
Result: Massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (90% context-dependent), semantic similarity 0.415 (58.5% context-dependent). Tucker decomposition reveals stable orthogonal latent factors like “Culinary Manifold” and “Access Axis”.
Conclusion: VLMs compute affordances in substantially context-dependent ways, suggesting robotics research should move toward dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling.
Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a “Culinary Manifold” isolated to chef contexts and an “Access Axis” spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner – with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts – and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.
[17] Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation
Gürsel Akdeniz, Emin Cagatay Nakilcioglu
Main category: cs.CL
TL;DR: A compliance-aware Self-Instruct methodology for generating realistic maritime VHF radio dialogues that conform to IMO’s SMCP standards, with a 26-filter verification pipeline for quality control.
Details
Motivation: VHF radio miscommunication is a major maritime safety risk, with human factors causing over 58% of incidents. AI-assisted systems need high-quality maritime data, but operational, regulatory, and privacy constraints make such datasets scarce.Method: Introduces a compliance-aware Self-Instruct methodology with a 26-filter verification pipeline integrated into the iterative generation loop. Uses LORA for parameter-efficient fine-tuning to reduce computational overhead. Includes automated and expert evaluation framework with Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence metrics.
Result: The approach produces synthetically diverse, procedurally compliant, and operationally realistic maritime radio dialogues. Experiments using vessel, coastal, and AIS datasets demonstrate effectiveness.
Conclusion: Provides a reproducible foundation for AI-assisted maritime safety and other safety-critical domains through released code, datasets, and verification tools, though downstream applications like ASR and NLP are reserved for future work.
Abstract: VHF radio miscommunication remains a major safety risk in maritime operations, with human factors accounting for over 58% of recorded incidents in Europe between 2014 and 2023. Despite decades of operational use, VHF radio communications are still prone to noise, interference, linguistic variability, and the absence of real-time transcription, making procedural errors both frequent and difficult to correct. Developing AI-assisted systems to support real-time communication and decision-making requires a considerable amount of high-quality maritime data, yet operational, regulatory, and privacy constraints render such datasets scarce. This study introduces a compliance aware Self-Instruct methodology for generating realistic maritime radio dialogues that conform to the IMO’s SMCP. Our approach integrates a 26-filter verification pipeline directly into the iterative generation loop to enforce entity information accuracy, hallucination detection, SMCP-compliance, logical consistency, and linguistic diversity. We employ LORA for parameter-efficient fine-tuning, reducing computational overhead during training and enabling efficient deployment of the resulting models on resource-constrained maritime systems. To assess dataset quality, we introduce a novel evaluation framework combining automated and expert assessments: Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence. Experiments using publicly available vessel, coastal and AIS datasets demonstrate that the approach produces synthetically diverse, procedurally compliant, and operationally realistic dialogues. Although downstream applications such as automatic speech recognition and natural language processing are reserved for future work, the released code, datasets, and verification tools provide a reproducible foundation for artificial intelligence-assisted maritime safety and other safety-critical domains.
[18] Conversational Speech Reveals Structural Robustness Failures in SpeechLLM Backbones
Maria Teleki, Sai Janjur, Haoran Liu, Oliver Grabner, Ketan Verma, Thomas Docog, Xiangjue Dong, Lingfeng Shi, Cong Wang, Stephanie Birkelbach, Jason Kim, Yin Zhang, Éva Székely, James Caverlee
Main category: cs.CL
TL;DR: LLMs struggle with conversational speech disfluencies, showing systematic biases in disfluency removal tasks rather than faithful structural repair.
Details
Motivation: To understand how LLMs handle spontaneous conversational speech containing pervasive disfluencies (interjections, edits, parentheticals) that are rare in written pre-training corpora, using disfluency removal as a controlled probe.Method: Used the DRES evaluation framework to evaluate proprietary and open-source LLMs across architectures and scales, analyzing their performance on gold disfluency removal as a deletion-only task to distinguish between structural repair vs. biased reinterpretation.
Result: Model performance clusters into stable precision-recall regimes reflecting distinct editing policies; reasoning models systematically over-delete fluent content, revealing bias toward semantic abstraction over structural fidelity; fine-tuning achieves SOTA but harms generalization.
Conclusion: Robustness to speech is shaped by specific training objectives, and current LLMs show systematic biases in handling conversational speech disfluencies rather than performing faithful structural repair.
Abstract: LLMs serve as the backbone in SpeechLLMs, yet their behavior on spontaneous conversational input remains poorly understood. Conversational speech contains pervasive disfluencies – interjections, edits, and parentheticals – that are rare in the written corpora used for pre-training. Because gold disfluency removal is a deletion-only task, it serves as a controlled probe to determine whether a model performs faithful structural repair or biased reinterpretation. Using the DRES evaluation framework, we evaluate proprietary and open-source LLMs across architectures and scales. We show that model performance clusters into stable precision-recall regimes reflecting distinct editing policies. Notably, reasoning models systematically over-delete fluent content, revealing a bias toward semantic abstraction over structural fidelity. While fine-tuning achieves SOTA results, it harms generalization. Our findings demonstrate that robustness to speech is shaped by specific training objectives.
[19] What Is Missing: Interpretable Ratings for Large Language Model Outputs
Nicholas Stranges, Yimin Yang
Main category: cs.CL
TL;DR: WIM rating system uses natural-language feedback about what’s missing in model outputs to create rankings, replacing subjective numerical ratings with interpretable text-based feedback for preference learning.
Details
Motivation: Current LLM preference learning methods rely on subjective numerical ratings or rankings that are poor proxies for natural language quality. These discrete ratings lack interpretability and often result in ties or small rating differences that limit learning signals.Method: WIM rating system: judges (human or LLM) write natural-language feedback describing what the model output is missing. The output and feedback are embedded using a sentence embedding model, and cosine similarity between vectors produces the rating. This integrates into existing training pipelines without changing learning algorithms.
Result: WIM yields fewer ties and larger rating deltas compared to discrete numerical ratings, improving availability of learning signals in pairwise preference data. Ratings are interpretable since each scalar rating can be traced back to the judge’s missing-information text.
Conclusion: WIM provides an interpretable, text-based alternative to subjective numerical ratings for preference learning, enabling qualitative debugging of preference labels while maintaining compatibility with existing training methods.
Abstract: Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically observe that, compared to discrete numerical ratings, WIM yields fewer ties and larger rating deltas, which improves the availability of a learning signal in pairwise preference data, we use interpretable in the following limited sense: for each scalar rating, we can inspect the judge’s missing-information text that produced it, enabling qualitative debugging of the preference labels.
[20] A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science
Zonglin Yang, Runze Mao, Tianhao Wu, Han Li, QingGuo Zhou, Zhi X. Chen
Main category: cs.CL
TL;DR: First framework for developing domain-specialized LLMs for combustion science using multimodal knowledge base, evaluation benchmark, and three-stage knowledge injection pathway.
Details
Motivation: To advance foundation LLMs for combustion science by creating specialized models that can handle domain-specific knowledge, overcoming limitations of general LLMs in technical domains.Method: Three-stage framework: 1) Build AI-ready multimodal knowledge base (3.5B tokens from 200K+ articles, 8K theses, 400K lines of CFD code), 2) Create CombustionQA benchmark (436 questions across 8 subfields), 3) Implement knowledge injection pathway from RAG to knowledge-graph-enhanced retrieval to continued pretraining.
Result: Stage 1 (naive RAG) accuracy peaks at 60% (vs 23% zero-shot), revealing hard ceiling due to context contamination. Shows need for structured knowledge graphs and continued pretraining (Stages 2-3) for domain foundation models.
Conclusion: Standard RAG insufficient for domain specialization; structured knowledge representation and continued pretraining essential for building effective domain foundation models in technical fields like combustion science.
Abstract: To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage’s performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).
[21] Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models
Wai Tuck Wong, Jun Sun, Arunesh Sinha
Main category: cs.CL
TL;DR: The paper introduces a novel attack method that optimizes for numerical instability in multimodal LLMs, creating images that cause significant performance degradation with minimal changes.
Details
Motivation: As multimodal LLMs become widespread, understanding their failure points is crucial. The authors identify a novel failure mode where performance degradation occurs indirectly by exploiting numerical instability during inference, which differs from traditional adversarial attacks.Method: The authors develop a loss term that maximizes numerical instability in multimodal LLMs’ inference stage. They use this as an optimization target to construct perturbed images that cause performance degradation. They validate on state-of-the-art vision-language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) using standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO).
Result: The attack causes significant performance degradation even with very small changes to input images compared to baselines. The results reveal a fundamentally different vector of performance degradation not captured by traditional adversarial perturbations.
Conclusion: The paper uncovers a novel failure mode in multimodal LLMs that exploits numerical instability, highlighting a security vulnerability that requires attention in model design and evaluation.
Abstract: The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.
[22] Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity’s Last Exam
Michael Majurski, Cynthia Matuszek
Main category: cs.CL
TL;DR: Query rewriting with grounding context improves LM accuracy by reducing ambiguity, outperforming simple context prepending
Details
Motivation: Question phrasing quality significantly impacts LM responses, but the interplay between grounding context and query formulation remains under-exploredMethod: Combine well-grounded dynamic context construction (RAG) with query rewriting to reduce ambiguity; use separate rewriting and answering phases
Result: Rewriting questions with answer-free grounding context improves GPT-5-mini accuracy from 0.14 to 0.37 on Humanity’s Last Exam subset; improvements cannot be fully recovered through prompting alone
Conclusion: Distinct rewriting and answering phases with grounding context significantly improve LM accuracy by reducing question ambiguity
Abstract: How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model’s context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity’s Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at https://github.com/mmajurski/lm-rewrite-uplift
[23] From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models
Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen
Main category: cs.CL
TL;DR: This paper provides a comprehensive overview and taxonomy of streaming Large Language Models (LLMs), addressing the fragmented definitions and lack of systematic classification in existing literature on streaming LLMs.
Details
Motivation: Standard LLMs are designed for static inference with pre-defined inputs, limiting their applicability in dynamic, real-time scenarios. Existing definitions of streaming LLMs are fragmented and conflate different concepts (streaming generation, streaming inputs, interactive streaming architectures), creating confusion in the field.Method: The paper establishes a unified definition of streaming LLMs based on data flow and dynamic interaction, proposes a systematic taxonomy of current streaming LLMs, conducts in-depth discussion of underlying methodologies, explores real-world applications, and outlines future research directions.
Result: The paper provides a comprehensive framework for understanding streaming LLMs, clarifies existing ambiguities in the field, and offers a systematic classification that distinguishes between different streaming approaches and architectures.
Conclusion: Streaming LLMs represent an important evolution beyond static inference models, enabling real-time, dynamic applications. The paper provides foundational work for standardizing terminology and advancing research in streaming intelligence.
Abstract: Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.
[24] Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning
Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin
Main category: cs.CL
TL;DR: GOLF is an RL framework that uses group-level natural language feedback (external critiques and intra-group attempts) to guide exploration and generate actionable refinements, improving sample efficiency in sparse-reward environments.
Details
Motivation: Current RL algorithms only use scalar rewards, wasting the rich information in natural language feedback and leading to inefficient exploration. The authors want to leverage NL feedback to provide targeted guidance.Method: GOLF aggregates two types of group-level language feedback: external critiques (error pinpointing and fixes) and intra-group attempts (alternative ideas and failure patterns). These are used to generate high-quality refinements that are injected as off-policy scaffolds during training. The framework jointly optimizes generation and refinement in a unified RL loop.
Result: Experiments on verifiable and non-verifiable benchmarks show GOLF achieves superior performance and exploration efficiency, with 2.2× improvements in sample efficiency compared to RL methods using only scalar rewards.
Conclusion: GOLF demonstrates that leveraging group-level natural language feedback can significantly improve RL exploration efficiency by providing targeted guidance through actionable refinements.
Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at https://github.com/LuckyyySTA/GOLF.
[25] Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models
Xin Chen, Saili Uday Gadgil, Jiarong Qiu
Main category: cs.CL
TL;DR: A retrieval-augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages to improve factual reliability and verifiability.
Details
Motivation: Retrieval-augmented generation still suffers from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization, despite mitigating LLM limitations in factual consistency and knowledge updating.Method: Proposes a unified framework that first represents query-evidence relevance in a semantic space to ensure consistency, then introduces explicit evidence constraints that transform retrieved evidence from implicit context to core control factor in generation.
Result: The approach shows stable improvements across multiple generation quality metrics, confirming effectiveness of coordinated semantic alignment and evidence constraint modeling in retrieval-augmented generation.
Conclusion: The proposed method improves factual reliability and verifiability while preserving natural language fluency through joint modeling of semantic consistency and evidence constraints.
Abstract: Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.
[26] Stan: An LLM-based thermodynamics course assistant
Eric M. Furst, Vasudevan Venkateshwaran
Main category: cs.CL
TL;DR: Stan is an AI system for chemical engineering education that serves both students and instructors using lecture transcripts and textbook data, running entirely on local hardware with open models.
Details
Motivation: Most AI in education focuses on student-facing tools, while instructor support remains underexplored. The authors aim to create a dual-purpose system that benefits both students and instructors using shared educational data.Method: Developed a data pipeline using lecture transcripts and structured textbook index. For students: RAG pipeline answers queries by extracting technical terms, matching to textbook, and providing references. For instructors: structured analysis produces lecture summaries, identifies student questions/confusion, and catalogs teaching analogies. Uses Whisper large-v3 for speech-to-text and Llama 3.1 8B for processing, all running locally.
Result: Created a functional system for thermodynamics course that provides student Q&A and instructor analytics. Identified practical challenges with 7-8B parameter models for structured extraction over long transcripts (context truncation, bimodal outputs, schema drift) and developed mitigations.
Conclusion: Demonstrated that local, open-weight AI systems can effectively support both students and instructors in educational settings, with full data privacy and predictable costs. The dual-use approach leverages shared educational data for comprehensive course support.
Abstract: Discussions of AI in education focus predominantly on student-facing tools – chatbots, tutors, and problem generators – while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material – providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7–8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.
[27] Assessing Risks of Large Language Models in Mental Health Support: A Framework for Automated Clinical AI Red Teaming
Ian Steenstra, Paola Pedrelli, Weiyan Shi, Stacy Marsella, Timothy W. Bickmore
Main category: cs.CL
TL;DR: Evaluation framework for AI psychotherapists using simulated patient agents with dynamic cognitive-affective models to assess safety risks in mental health support LLMs, revealing critical gaps like validating patient delusions and failure to de-escalate suicide risk.
Details
Motivation: Current safety benchmarks fail to detect complex, longitudinal risks in therapeutic dialogue, necessitating better evaluation methods for AI mental health support systems before deployment.Method: Pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models, assesses therapy session simulations against comprehensive quality of care and risk ontology, focusing on Alcohol Use Disorder as test case with 6 AI agents and 15 clinically-validated patient personas.
Result: Large-scale simulation (369 sessions) reveals critical safety gaps including validation of patient delusions (“AI Psychosis”) and failure to de-escalate suicide risk; framework validated with stakeholders (AI engineers, mental health professionals, policy experts) for auditing AI psychotherapy.
Conclusion: Underscores critical safety risks of AI-provided mental health support and necessity of simulation-based clinical red teaming before deployment; framework enables stakeholders to audit the “black box” of AI psychotherapy.
Abstract: Large Language Models (LLMs) are increasingly utilized for mental health support; however, current safety benchmarks often fail to detect the complex, longitudinal risks inherent in therapeutic dialogue. We introduce an evaluation framework that pairs AI psychotherapists with simulated patient agents equipped with dynamic cognitive-affective models and assesses therapy session simulations against a comprehensive quality of care and risk ontology. We apply this framework to a high-impact test case, Alcohol Use Disorder, evaluating six AI agents (including ChatGPT, Gemini, and Character AI) against a clinically-validated cohort of 15 patient personas representing diverse clinical phenotypes. Our large-scale simulation (N=369 sessions) reveals critical safety gaps in the use of AI for mental health support. We identify specific iatrogenic risks, including the validation of patient delusions (“AI Psychosis”) and failure to de-escalate suicide risk. Finally, we validate an interactive data visualization dashboard with diverse stakeholders, including AI engineers and red teamers, mental health professionals, and policy experts (N=9), demonstrating that this framework effectively enables stakeholders to audit the “black box” of AI psychotherapy. These findings underscore the critical safety risks of AI-provided mental health support and the necessity of simulation-based clinical red teaming before deployment.
[28] Optimizing Language Models for Crosslingual Knowledge Consistency
Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza
Main category: cs.CL
TL;DR: Direct Consistency Optimization (DCO) is a DPO-inspired reinforcement learning method that improves crosslingual consistency in multilingual LLMs without requiring explicit reward models.
Details
Motivation: Multilingual LLMs often exhibit inconsistent knowledge when asked similar questions in different languages, undermining their reliability. This work aims to mitigate this crosslingual inconsistency problem.Method: DCO uses reinforcement learning with a structured reward function derived directly from the LLM itself, requiring no explicit reward model. It’s inspired by DPO and optimizes for consistent crosslingual responses.
Result: DCO significantly improves crosslingual consistency across diverse LLMs, outperforms existing methods when training with multilingual samples, complements DPO when gold labels are available, and shows strong out-of-domain generalization.
Conclusion: DCO establishes itself as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs, with controllable alignment via direction hyperparameters.
Abstract: Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at https://github.com/Betswish/ConsistencyRL.
[29] Non-Zipfian Distribution of Stopwords and Subset Selection Models
Wentian Li, Oscar Fontanelli
Main category: cs.CL
TL;DR: The paper analyzes statistical properties of stopwords vs. non-stopwords, finding stopwords follow Beta Rank Function distribution while non-stopwords deviate from Zipf’s law and are better fitted by quadratic functions. It proposes a stopword selection model based on Hill’s function.
Details
Motivation: To understand the statistical differences between stopwords and non-stopwords in language texts, and to develop a mathematical model for stopword selection based on rank-frequency distributions.Method: Analyzes rank-frequency plots of stopwords vs. non-stopwords, proposes a selection probability model using Hill’s function where selection probability decreases with word rank, and validates the model using independent text collections.
Result: Stopwords follow Beta Rank Function distribution, non-stopwords deviate from Zipf’s law and are better fitted by quadratic functions. The proposed Hill’s function model accurately describes stopword selection and explains the observed distributions.
Conclusion: The paper provides mathematical foundations for understanding stopword distributions and offers a principled model for stopword selection based on statistical properties of language.
Abstract: Stopwords are words that are not very informative to the content or the meaning of a language text. Most stopwords are function words but can also be common verbs, adjectives and adverbs. In contrast to the well known Zipf’s law for rank-frequency plot for all words, the rank-frequency plot for stopwords are best fitted by the Beta Rank Function (BRF). On the other hand, the rank-frequency plots of non-stopwords also deviate from the Zipf’s law, but are fitted better by a quadratic function of log-token-count over log-rank than by BRF. Based on the observed rank of stopwords in the full word list, we propose a stopword (subset) selection model that the probability for being selected as a function of the word’s rank $r$ is a decreasing Hill’s function ($1/(1+(r/r_{mid})^γ)$); whereas the probability for not being selected is the standard Hill’s function ( $1/(1+(r_{mid}/r)^γ)$). We validate this selection probability model by a direct estimation from an independent collection of texts. We also show analytically that this model leads to a BRF rank-frequency distribution for stopwords when the original full word list follows the Zipf’s law, as well as explaining the quadratic fitting function for the non-stopwords.
[30] Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement
Brian Jing Hong Nge, Stefan Su, Thanh Thi Nguyen, Campbell Wilson, Alexandra Phelan, Naomi Pfitzner
Main category: cs.CL
TL;DR: Evaluation of data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers with transformer models across diverse datasets.
Details
Motivation: To improve hate speech detection by examining how different data augmentation techniques (SMOTE, weighted loss, POS tagging, text augmentation) affect various model architectures, and to understand the interaction between dataset properties, models, and enhancement strategies.Method: Comparative study of traditional classifiers (Delta TF-IDF) vs. transformer models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across multiple hate speech datasets. Evaluated impact of SMOTE, weighted loss based on inverse class proportions, POS tagging, and text data augmentation on model performance.
Result: gpt-oss-20b consistently achieved highest results. Delta TF-IDF responded strongly to data augmentation, reaching 98.2% accuracy on Stormfront dataset. Implicit hate speech was more difficult to detect than explicit content. Enhancement effectiveness depended on dataset, model, and technique interaction.
Conclusion: Research informs hate speech detection development by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.
Abstract: This paper evaluates data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers, e.g., Delta Term Frequency-Inverse Document Frequency (Delta TF-IDF), with transformer-based models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across diverse datasets. It examines the impact of Synthetic Minority Over-sampling Technique (SMOTE), weighted loss determined by inverse class proportions, Part-of-Speech (POS) tagging, and text data augmentation on model performance. The open-source gpt-oss-20b consistently achieves the highest results. On the other hand, Delta TF-IDF responds strongly to data augmentation, reaching 98.2% accuracy on the Stormfront dataset. The study confirms that implicit hate speech is more difficult to detect than explicit hateful content and that enhancement effectiveness depends on dataset, model, and technique interaction. Our research informs the development of hate speech detection by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.
[31] Detection of Illicit Content on Online Marketplaces using Large Language Models
Quoc Khoa Tran, Thanh Thi Nguyen, Campbell Wilson
Main category: cs.CL
TL;DR: LLMs (Llama 3.2 & Gemma 3) outperform traditional ML methods for complex multi-class classification of illicit online marketplace content, showing task-dependent advantages over BERT and traditional baselines.
Details
Motivation: Online marketplaces facilitate illicit activities like drug trafficking and counterfeit sales, but traditional content moderation methods (manual reviews, rule-based systems, conventional ML) struggle with scalability, dynamic obfuscation, multilingual content, and semantic complexities.Method: Fine-tuned LLMs (Meta’s Llama 3.2 and Google’s Gemma 3) using Parameter-Efficient Fine-Tuning (PEFT) and quantization techniques on the multilingual DUTA10K dataset, benchmarked against BERT and traditional ML baselines (Support Vector Machines and Naive Bayes).
Result: LLMs showed task-dependent advantages: comparable performance to traditional methods for binary classification (illicit vs. non-illicit), but Llama 3.2 significantly surpassed all baselines for complex, imbalanced multi-class classification across 40 specific illicit categories.
Conclusion: LLMs offer substantial practical implications for enhancing online safety, providing law enforcement, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.
Abstract: Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta’s Llama 3.2 and Google’s Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.
[32] AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments
Kylie Zhang, Nimra Nadeem, Lucia Zheng, Dominik Stammbach, Peter Henderson
Main category: cs.CL
TL;DR: AI models can simulate Supreme Court justice questioning for legal training, achieving realistic questions but with limitations in diversity and sycophancy.
Details
Motivation: To develop AI systems that can effectively simulate justice-specific questioning for moot court training, helping law students and attorneys prepare for oral arguments by anticipating legal issues and detecting argument weaknesses.Method: Used U.S. Supreme Court oral argument transcripts dataset, created two-layer evaluation framework assessing realism and pedagogical usefulness, and constructed both prompt-based and agentic oral argument simulators.
Result: Simulated questions were often perceived as realistic by human annotators and achieved high recall of ground truth substantive legal issues, but showed low diversity in question types and sycophancy issues.
Conclusion: AI models can generate realistic justice questioning for legal training, but current limitations in question diversity and sycophancy would be missed by naive evaluation approaches, highlighting the need for sophisticated evaluation frameworks.
Abstract: In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.
[33] IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang, Pei Ke, Hongning Wang, Minlie Huang
Main category: cs.CL
TL;DR: IF-RewardBench is a comprehensive meta-evaluation benchmark for instruction-following that introduces listwise evaluation via preference graphs to better assess judge model reliability for guiding LLM alignment.
Details
Motivation: Current judge models for evaluating instruction-following in LLMs lack reliable assessment due to insufficient meta-evaluation benchmarks with limited data coverage and oversimplified pairwise evaluation paradigms that don't align with real optimization scenarios.Method: Proposes IF-RewardBench covering diverse instruction and constraint types, constructing preference graphs for each instruction containing all pairwise preferences among multiple responses based on instruction-following quality, enabling listwise evaluation of judge models.
Result: Extensive experiments reveal significant deficiencies in current judge models and demonstrate that IF-RewardBench achieves stronger positive correlation with downstream task performance compared to existing benchmarks.
Conclusion: IF-RewardBench provides a more comprehensive and reliable meta-evaluation framework for instruction-following judge models, addressing limitations of existing benchmarks and better guiding LLM alignment through listwise evaluation.
Abstract: Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at https://github.com/thu-coai/IF-RewardBench.
[34] Stacked from One: Multi-Scale Self-Injection for Context Window Extension
Wei Han, Pan Zhou, Shuicheng Yan
Main category: cs.CL
TL;DR: SharedLLM is a framework using multi-grained context compression and query-aware information acquisition to extend LLM context windows without expensive continual pre-training, achieving efficient long-context processing with minimal training.
Details
Motivation: The limited context window of contemporary LLMs is a major bottleneck for broader applications. Continual pre-training on long-context data is prohibitively expensive in terms of data acquisition and computational costs, necessitating more efficient solutions.Method: SharedLLM uses two stacked short-context LLMs: a lower model as compressor and upper model as decoder. The lower model compresses long inputs into multi-grained representations transferred to the upper model via self-injection (using same underlying LLM layers). A tree-based data structure enables efficient encoding and query-aware retrieval of contextual information.
Result: Despite training on only 8K token sequences, SharedLLM generalizes to inputs exceeding 128K tokens. It achieves superior or comparable performance to strong baselines on long-context benchmarks while reducing memory footprint and providing 2-3× inference speedups over streaming and encoder-decoder architectures.
Conclusion: SharedLLM offers an efficient solution to LLM context window limitations through multi-grained compression and query-aware information acquisition, balancing efficiency and accuracy while enabling practical long-context applications.
Abstract: The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelnameeffectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelnameachieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).
[35] TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings
Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, Li Li
Main category: cs.CL
TL;DR: TSEmbed is a universal multimodal embedding framework that uses Mixture-of-Experts with LoRA to address task conflicts in MLLMs, combined with Expert-Aware Negative Sampling for better discriminative power.
Details
Motivation: Multimodal Large Language Models (MLLMs) have strong reasoning capabilities but face task conflicts when adapted as universal embedding models, limiting their effectiveness across diverse multimodal tasks.Method: Proposes TSEmbed framework combining Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to disentangle conflicting tasks, plus Expert-Aware Negative Sampling (EANS) that uses expert routing distributions as semantic similarity proxy, and a two-stage training paradigm for stability.
Result: Achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets.
Conclusion: TSEmbed lays a foundation for task-level scaling in universal multimodal embeddings by effectively addressing task conflicts in MLLM adaptation.
Abstract: Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model’s discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.
[36] Attention’s Gravitational Field:A Power-Law Interpretation of Positional Correlation
Edward Zhang
Main category: cs.CL
TL;DR: The paper introduces Attention Gravitational Field (AGF) theory to explain positional relationships in LLMs, showing it aligns with Newton’s gravitational law and improves model accuracy by decoupling positional from semantic encodings.
Details
Motivation: To better understand and interpret the attention mechanism in LLMs, particularly how positional relationships and encodings work, and to provide a theoretical framework for model optimization.Method: Introduces the concept of Attention Gravitational Field (AGF), decouples positional encodings from semantic embeddings, analyzes AGF’s consistency with learning/stability curves, and demonstrates empirical alignment with Newton’s Law of Universal Gravitation.
Result: Achieves superior accuracy compared to prevailing encoding methods and provides a rigorous theoretical exploration that helps interpret the Attention mechanism.
Conclusion: The AGF framework represents a significant step toward interpreting Attention mechanisms and unlocks new possibilities for future research in model optimization and interpretability.
Abstract: This paper explores the underlying principles of positional relationships and encodings within Large Language Models (LLMs) and introduces the concept of the Attention Gravitational Field (AGF). By decoupling positional encodings from semantic embeddings, we optimize the model architecture and achieve superior accuracy compared to prevailing encoding methods. Furthermore, we provide an in-depth analysis of AGF, demonstrating its intrinsic consistency with learning and stability curves, as well as its empirical alignment with Newton’s Law of Universal Gravitation. By offering a rigorous theoretical exploration of these phenomena, this work represents a significant step toward interpreting the Attention mechanism and unlocks new possibilities for future research in model optimization and interpretability.
[37] Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents
Natchanon Pollertlam, Witchayut Kornsuwannawit
Main category: cs.CL
TL;DR: Comparison of long-context LLMs vs fact-based memory systems for conversational AI, showing trade-offs between accuracy and cost with memory systems becoming cheaper after ~10 turns at 100k context length.
Details
Motivation: Persistent conversational AI systems need to handle long conversation histories, with two main approaches: using long-context LLMs that process full histories, or maintaining dedicated memory systems that extract and retrieve structured facts. The paper aims to compare these architectures on accuracy and cost metrics.Method: Compared a fact-based memory system built on Mem0 framework against long-context LLM inference (GPT-5-mini) on three memory-centric benchmarks: LongMemEval, LoCoMo, and PersonaMemv2. Evaluated both architectures on accuracy and cumulative API cost, constructing a cost model that incorporates prompt caching.
Result: Long-context GPT-5-mini achieved higher factual recall on LongMemEval and LoCoMo, while the memory system was competitive on PersonaMemv2 where persona consistency depends on stable factual attributes. The memory system becomes cheaper after approximately ten interaction turns at 100k context length, with break-even point decreasing as context length grows.
Conclusion: The study characterizes the accuracy-cost trade-off between long-context LLMs and memory systems, providing concrete criteria for selecting between them in production deployments based on context length, number of turns, and specific memory requirements.
Abstract: Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system’s per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.
[38] Autoscoring Anticlimax: A Meta-analytic Understanding of AI’s Short-answer Shortcomings and Wording Weaknesses
Michael Hardy
Main category: cs.CL
TL;DR: Meta-analysis of 890 LLM short-answer scoring results shows decoder-only architectures underperform encoders by 0.37 QWK, task difficulty for humans doesn’t predict LLM performance, and LLMs exhibit racial discrimination in educational contexts.
Details
Motivation: Automated short-answer scoring lags behind other LLM applications, and there's a need to understand why LLMs underperform in educational assessment tasks compared to human experts.Method: Systematic review and meta-analysis of 890 results from LLM short-answer scoring studies using mixed effects metaregression with Quadratic Weighted Kappa (QWK) effect size, plus additional experiments on wording/tokenization sensitivity and bias elicitation.
Result: Decoder-only architectures underperform encoders by 0.37 QWK; human task difficulty doesn’t predict LLM performance (some easiest human tasks were hardest for LLMs); tokenizer vocabulary size shows diminishing returns; LLMs demonstrate racial discrimination in high-stakes education contexts.
Conclusion: Systems design should better anticipate statistical shortcomings of autoregressive models, and LLMs show concerning biases in educational applications that need addressing.
Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37–a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns–potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.
[39] From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models
Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan
Main category: cs.CL
TL;DR: GDS: A gradient-based method for detecting pre-training data in LLMs using gradient deviation scores to distinguish member vs non-member samples based on optimization behavior differences.
Details
Motivation: Address copyright concerns and benchmark contamination by detecting pre-training data in LLMs. Existing methods have limitations: likelihood-based approaches suffer from word frequency bias, and fine-tuning-based methods depend on similarity of fine-tuning data.Method: Proposes GDS (Gradient Deviation Scores) method that identifies pre-training data by analyzing gradient behavior. Captures gradient profiles including magnitude, location, and concentration of parameter updates across FFN and Attention modules. Uses these features with a lightweight classifier for binary membership inference.
Result: Achieves state-of-the-art performance on five public datasets with significantly improved cross-dataset transferability over strong baselines. Gradient feature distribution differences enable practical and scalable pre-training data detection.
Conclusion: GDS provides an effective gradient-based approach for pre-training data detection with better transferability than existing methods, addressing limitations of likelihood-based and fine-tuning-dependent approaches.
Abstract: Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.
[40] SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts
Minduli Lasandi, Nevidu Jayatilleke
Main category: cs.CL
TL;DR: SinhaLegal is a Sinhala legislative text corpus with ~2M words from 1,206 legal documents (Acts and Bills), processed with OCR and manual cleaning, designed for NLP tasks in legal domain.
Details
Motivation: Address the lack of high-quality, machine-readable Sinhala legal text resources to support NLP research and applications in the legal domain, particularly for Sinhala language processing.Method: Systematic collection of official legal documents, OCR extraction using Google Document AI, extensive post-processing and manual cleaning, creation of metadata files, and comprehensive evaluation including corpus statistics, lexical analysis, NER, topic modeling, and perplexity analysis with language models.
Result: Created a high-quality corpus of 2 million words from 1,206 legal documents, demonstrated structured domain-specific nature through various analyses, and showed language models’ effectiveness on domain-specific texts through perplexity evaluation.
Conclusion: SinhaLegal corpus fills a critical gap in Sinhala legal NLP resources and provides a valuable dataset for tasks like summarization, information extraction, and legal text analysis.
Abstract: SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.
[41] HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents
Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou
Main category: cs.CL
TL;DR: HACHIMI introduces a multi-agent framework for generating theory-aligned, distribution-controllable student personas for educational LLMs, creating a 1M persona corpus with validated educational schema and quota control.
Details
Motivation: Current student persona generation for educational LLMs relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions, lacking systematic alignment with educational theories and proper distribution control.Method: HACHIMI uses a multi-agent Propose-Validate-Revise framework that factorizes personas into theory-anchored educational schema, enforces developmental/psychological constraints via neuro-symbolic validation, and combines stratified sampling with semantic deduplication to reduce mode collapse.
Result: Generated HACHIMI-1M corpus with 1 million personas for Grades 1-12 showing near-perfect schema validity, accurate quotas, and substantial diversity. External evaluation with CEPS and PISA 2022 surveys shows strong alignment for math and curiosity/growth constructs, moderate alignment for classroom-climate and well-being constructs.
Conclusion: HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations, offering a systematic approach to persona generation that addresses theory alignment and distribution control limitations of prior methods.
Abstract: Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at https://github.com/ZeroLoss-Lab/HACHIMI
[42] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications
Yunfan Zhang, Yijie Bei, Jetashree Ravi, Pawel Garbacki
Main category: cs.CL
TL;DR: FireBench is a new benchmark for evaluating LLM instruction following in enterprise/API contexts, focusing on real-world usage patterns across 6 capability dimensions with 2,400+ samples.
Details
Motivation: Existing instruction following benchmarks focus on chat assistant needs rather than enterprise requirements where strict adherence to output formats, content constraints, and procedural requirements is essential for reliable LLM-assisted workflows.Method: Created FireBench benchmark with 6 core capability dimensions across diverse enterprise applications (information extraction, customer support, coding agents), comprising over 2,400 samples. Evaluated 11 LLMs on their instruction following behavior in enterprise scenarios.
Result: Presented key findings on LLM instruction following behavior in enterprise scenarios. The benchmark is open-sourced at fire-bench.com for users to assess model suitability and for developers to diagnose performance.
Conclusion: FireBench addresses the gap in enterprise-focused instruction following evaluation and provides a valuable tool for assessing LLM suitability for real-world enterprise and API-driven applications.
Abstract: Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at fire-bench.com to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.
[43] Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models
Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish
Main category: cs.CL
TL;DR: Training-free method to enhance diversity in diffusion language models by repelling intermediate samples from each other’s feature spaces during batch generation, improving Pass@k performance on code and math tasks.
Details
Motivation: Traditional sampling approaches in text generation waste computational resources on repetitive failure modes, and diffusion language models also suffer from redundancy where independent samples collapse into similar modes, reducing diversity needed for effective exploration in complex reasoning tasks like code generation and mathematical problem solving.Method: Proposes a training-free, low-cost intervention that modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples to actively penalize redundancy. This approach requires no retraining or beam search and incurs negligible computational overhead.
Result: Evaluation on HumanEval and GSM8K benchmarks using LLaDA-8B-Instruct model demonstrates significantly improved diversity and Pass@k performance across various temperature settings.
Conclusion: The method offers an immediate, low-cost improvement for current and future diffusion language models in tasks that benefit from diverse solution search, providing a simple modification to the sampling process that enhances generative diversity.
Abstract: Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at https://github.com/sean-lamont/odd.
[44] Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research
Arina Kostina, Marios Dikaiakos, Alejandro Porcel, Tassos Stassopoulos
Main category: cs.CL
TL;DR: LLMs approach human performance on identifying Schwartz values in interviews but struggle with exact rankings and show different uncertainty patterns than experts.
Details
Motivation: To evaluate whether LLMs can reliably perform nuanced qualitative analysis of human values in open-ended interviews, addressing the inherent ambiguity in interpretive tasks.Method: Compared LLM outputs to expert annotations on identifying top three human values from long-form interviews using Schwartz Theory of Basic Values framework, analyzing performance metrics (F1, Jaccard, RBO) and uncertainty patterns.
Result: LLMs approach human ceiling on set-based metrics but struggle with exact rankings (lower RBO). Qwen performed closest to expert-level agreement. LLM ensembles (Majority Vote, Borda Count) improved performance. Models showed systematic overemphasis on certain values like Security.
Conclusion: LLMs show promise as collaborators in qualitative value analysis but have limitations in ranking accuracy and exhibit different uncertainty patterns and potential value biases compared to human experts.
Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals’ values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.
[45] AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Alexios Spanakis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou
Main category: cs.CL
TL;DR: A novel LLM pipeline for conspiracy marker extraction and endorsement detection using decoupled semantic reasoning and structural localization with DD-CoT and Anti-Echo Chamber architecture.
Details
Motivation: Traditional classifiers conflate semantic reasoning with structural localization, making conspiracy detection challenging. The paper aims to create an interpretable, psycholinguistically-grounded NLP system that can accurately extract conspiracy markers and detect endorsement while avoiding false penalties for objective reporting.Method: Decoupled design separating marker extraction and conspiracy detection. For marker extraction: Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity. For conspiracy detection: “Anti-Echo Chamber” architecture with adversarial Parallel Council adjudicated by a Calibrated Judge to overcome the “Reporter Trap” where models falsely penalize objective reporting.
Result: Achieved 0.24 Macro F1 (+100% over baseline) on S1 and 0.79 Macro F1 (+49%) on S2. The S1 system ranked 3rd on the development leaderboard for SemEval-2026 Task 10.
Conclusion: The approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP that effectively handles conspiracy marker extraction and endorsement detection through decoupled reasoning and specialized architectures.
Abstract: This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an “Anti-Echo Chamber” architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the “Reporter Trap,” where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100% over baseline) on S1 and 0.79 Macro F1 (+49%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.
[46] AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas, Giorgos Filandrianos, Maria Lymperaiou, Paraskevi Tzouveli, Athanasios Voulodimos, Giorgos Stamou
Main category: cs.CL
TL;DR: AILS-NTUA system for multilingual Dimensional Aspect-Based Sentiment Analysis using fine-tuned encoders for sentiment regression and instruction-tuned LLMs with LoRA for structured extraction tasks.
Details
Motivation: To address the challenges of Dimensional Aspect-Based Sentiment Analysis (DimABSA) in multilingual and multi-domain settings, which requires handling continuous sentiment values and structured extraction tasks across different languages and domains.Method: Combines fine-tuning of language-appropriate encoder backbones for continuous sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction, emphasizing parameter-efficient specialization.
Result: The proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings in the SemEval-2026 Task 3 Track-A competition.
Conclusion: The unified yet task-adaptive design enables reduced training and inference requirements while maintaining strong effectiveness for multilingual Dimensional Aspect-Based Sentiment Analysis tasks.
Abstract: In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
[47] Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition
Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang, Lu Wang, Zhiyang Su
Main category: cs.CL
TL;DR: Proposes match-and-merge paradigm with genetic and reinforcement learning algorithms to merge heterogeneous language models (n-gram and neural) in federated ASR systems for privacy-preserving speech recognition.
Details
Motivation: Federated learning for ASR produces multiple local models that need merging. While acoustic models have established merging methods, language models for rescoring face challenges due to heterogeneity between traditional n-gram models and neural network models.Method: Introduces heterogeneous LM optimization task and match-and-merge paradigm with two algorithms: 1) Genetic Match-and-Merge Algorithm (GMMA) using genetic operations to evolve and pair LMs, and 2) Reinforced Match-and-Merge Algorithm (RMMA) leveraging reinforcement learning for efficient convergence.
Result: Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA.
Conclusion: The match-and-merge paradigm shows potential for scalable, privacy-preserving ASR systems by effectively merging heterogeneous language models in federated learning settings.
Abstract: Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm’s potential for scalable, privacy-preserving ASR systems.
[48] LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services
Jinwen Chen, Shuai Gong, Shiwen Zhang, Zheng Zhang, Yachao Zhao, Lingxiang Wang, Haibo Zhou, Yuan Zhan, Wei Lin, Hainan Zhang
Main category: cs.CL
TL;DR: LocalSUG is an LLM-based query suggestion framework for local-life service platforms that addresses geographic grounding, exposure bias, and latency challenges through city-aware candidate mining, beam-search-driven GRPO, and optimization techniques.
Details
Motivation: Traditional multi-stage cascading systems in local-life platforms rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency.Method: 1) City-aware candidate mining based on term co-occurrence to inject geographic grounding; 2) Beam-search-driven GRPO algorithm to align training with inference-time decoding and reduce exposure bias; 3) Multi-objective reward mechanism optimizing both relevance and business metrics; 4) Quality-aware beam acceleration and vocabulary pruning techniques to reduce online latency.
Result: Extensive offline evaluations and large-scale online A/B testing show LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.
Conclusion: LocalSUG successfully addresses the key challenges of deploying LLMs in local-life service platforms, providing an effective framework for query suggestion that balances geographic relevance, user preferences, and practical deployment constraints.
Abstract: In local-life service platforms, the query suggestion module plays a crucial role in enhancing user experience by generating candidate queries based on user input prefixes, thus reducing user effort and accelerating search. Traditional multi-stage cascading systems rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency. To address these issues, we propose LocalSUG, an LLM-based query suggestion framework tailored for local-life service platforms. First, we introduce a city-aware candidate mining strategy based on term co-occurrence to inject geographic grounding into generation. Second, we propose a beam-search-driven GRPO algorithm that aligns training with inference-time decoding, reducing exposure bias in autoregressive generation. A multi-objective reward mechanism further optimizes both relevance and business-oriented metrics. Finally, we develop quality-aware beam acceleration and vocabulary pruning techniques that significantly reduce online latency while preserving generation quality. Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.
[49] Replaying pre-training data improves fine-tuning
Suhas Kotha, Percy Liang
Main category: cs.CL
TL;DR: Generic data replay during fine-tuning improves target domain performance despite domain mismatch, increasing data efficiency by up to 2.06×
Details
Motivation: Current paradigm for domain-specific language models involves pre-training on generic web text then fine-tuning on limited target data, with generic data mixed during fine-tuning only to prevent catastrophic forgetting. The paper challenges this by investigating whether replaying generic data can actually improve target task performance.Method: Controlled experiments with 150M parameter models using 4M target tokens and 4B total tokens. Analyzes data schedules including fine-tuning and mid-training scenarios. Tests replay strategies where generic data is reintroduced during target domain adaptation. Also demonstrates practical application with 8B parameter models on agentic web navigation and Basque question-answering tasks.
Result: Generic replay increases target data efficiency by up to 1.87× for fine-tuning and 2.06× for mid-training. Replay helps more when there is less target data in pre-training. In practical applications, improves agentic web navigation success by 4.5% and Basque question-answering accuracy by 2%.
Conclusion: Contrary to conventional wisdom, replaying generic data during fine-tuning can improve performance on target tasks, especially when target data is limited. This suggests generic data provides beneficial regularization or complementary knowledge that enhances target domain adaptation.
Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5%$ and Basque question-answering accuracy by $2%$.
[50] When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger
Amirabbas Afzali, Myeongho Jeon, Maria Brbic
Main category: cs.CL
TL;DR: Weak LLMs can serve as effective annotators for preference alignment when using only their high-confidence samples, outperforming full human annotations through confidence-weighted training.
Details
Motivation: Preference alignment for LLMs typically requires costly human annotations or large API models. The paper explores whether weak LLMs can serve as cost-effective annotators instead.Method: Proposes Confidence-Weighted Preference Optimization (CW-PO), which re-weights training samples based on a weak LLM’s confidence scores. Only high-confidence samples from the weak LLM are selected for training.
Result: CW-PO with just 20% of human annotations outperforms standard DPO trained with 100% human annotations. Weak LLMs with confidence weighting dramatically reduce annotation costs while improving performance.
Conclusion: Weak LLMs paired with confidence weighting can effectively reduce preference alignment costs while outperforming methods using full human-labeled data, offering a practical alternative to expensive annotation processes.
Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM’s highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM’s confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.
[51] MPCEval: A Benchmark for Multi-Party Conversation Generation
Minxing Zhang, Yi Yang, Zhuofan Jia, Xuan Yang, Jian Pei, Yuchen Zang, Xingwang Deng, Xianglong Chen
Main category: cs.CL
TL;DR: MPCEval is a comprehensive evaluation suite for multi-party conversation generation that decomposes quality into speaker modeling, content quality, and speaker-content consistency, with novel reference-free metrics.
Details
Motivation: Multi-party conversation generation is important for AI applications like smart reply and collaborative assistants, but existing evaluation methods are inadequate for the unique challenges of multi-party settings including complex turn-taking, role-dependent behavior, long-range structure, and multiple valid continuations.Method: MPCEval decomposes generation quality into three dimensions: speaker modeling (capturing individual speaker characteristics), content quality (coherence and progression), and speaker-content consistency (alignment between speaker identity and content). It distinguishes between local next-turn prediction and global full-conversation generation, and provides novel quantitative, reference-free, reproducible metrics that scale across datasets and models.
Result: When applied to diverse public and real-world datasets, MPCEval reveals systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker-content consistency. It shows that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior.
Conclusion: MPCEval provides a comprehensive, task-aware evaluation framework for multi-party conversation generation that enables nuanced assessment across multiple quality dimensions, addressing the critical bottleneck in evaluating this increasingly important AI capability.
Abstract: Multi-party conversation generation, such as smart reply and collaborative assistants, is an increasingly important capability of generative AI, yet its evaluation remains a critical bottleneck. Compared to two-party dialogue, multi-party settings introduce distinct challenges, including complex turn-taking, role-dependent speaker behavior, long-range conversational structure, and multiple equally valid continuations. Accordingly, we introduce MPCEval, a task-aware evaluation and benchmarking suite for multi-party conversation generation. MPCEval decomposes generation quality into speaker modeling, content quality, and speaker–content consistency, and explicitly distinguishes local next-turn prediction from global full-conversation generation. It provides novel, quantitative, reference-free, and reproducible metrics that scale across datasets and models. We apply MPCEval to diverse public and real-world datasets and evaluate modern generation methods alongside human-authored conversations. The results reveal systematic, dimension-specific model characteristics in participation balance, content progression and novelty, and speaker–content consistency, demonstrating that evaluation objectives critically shape model assessment and that single-score evaluation obscures fundamental differences in multi-party conversational behavior. The implementation of MPCEval and the associated evaluation code are publicly available at https://github.com/Owen-Yang-18/MPCEval.
[52] VRM: Teaching Reward Models to Understand Authentic Human Preferences
Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng
Main category: cs.CL
TL;DR: VRM is a variational reward modeling framework that captures human preference evaluation by modeling both high-dimensional objective weights and low-dimensional semantic features as latent variables, addressing reward hacking in LLM alignment.
Details
Motivation: Current reward models for LLM alignment suffer from reward hacking by directly mapping prompt-response pairs to scalar scores, which captures spurious correlations rather than authentic human preferences. Human evaluation involves a sophisticated process of weighing multiple objectives and evaluating through semantic features.Method: Proposes VRM (Variational Reward Modeling) that explicitly models human preference evaluation by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, inferred through variational inference techniques.
Result: Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences. Theoretical analysis shows VRM achieves tighter generalization error bound compared to traditional reward models.
Conclusion: VRM provides a more sophisticated approach to reward modeling that better captures the nuanced process of human preference evaluation, addressing limitations of current reward models in LLM alignment.
Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations rather than authentic human preferences. In contrast, human evaluation employs a sophisticated process that initially weighs the relative importance of multiple high-dimensional objectives according to the prompt context, subsequently evaluating response quality through low-dimensional semantic features such as logical coherence and contextual appropriateness. Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, which are inferred through variational inference techniques. Additionally, we provide a theoretical analysis showing that VRM can achieve a tighter generalization error bound compared to the traditional reward model. Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences.
[53] ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts
Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat
Main category: cs.CL
TL;DR: ThaiSafetyBench: A safety evaluation benchmark for LLMs in Thai language with culturally contextualized attacks, revealing vulnerabilities in current safety alignment methods for non-English languages.
Details
Motivation: Current LLM safety evaluation is largely English-centric, leaving non-English languages and culturally grounded risks underexplored, particularly for Thai language and culture.Method: Created ThaiSafetyBench with 1,954 malicious Thai prompts covering general and culturally contextualized attacks. Evaluated 24 LLMs using GPT-4.1 and Gemini-2.5-Pro as judges. Fine-tuned DeBERTa-based ThaiSafetyClassifier for cost-effective evaluation.
Result: Closed-source models outperform open-source ones in safety. Culturally contextualized Thai attacks have higher Attack Success Rate than general Thai attacks. ThaiSafetyClassifier achieves 84.4% weighted F1 score matching GPT-4.1 judgments.
Conclusion: Current safety alignment methods have critical vulnerabilities for non-English languages and cultural contexts. The benchmark and classifier provide tools for better safety evaluation in Thai and similar languages.
Abstract: The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation.
- ThaiSafetyBench HuggingFace Dataset: https://huggingface.co/datasets/typhoon-ai/ThaiSafetyBench
- ThaiSafetyBench Github: https://github.com/trapoom555/ThaiSafetyBench
- ThaiSafetyClassifier HuggingFace Model: https://huggingface.co/typhoon-ai/ThaiSafetyClassifier
- ThaiSafetyBench Leaderboard: https://huggingface.co/spaces/typhoon-ai/ThaiSafetyBench-Leaderboard
[54] HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation
Yifan Zhu, Guanting Chen, Bing Wei, Haoran Luo
Main category: cs.CL
TL;DR: HiFlow: Hierarchical feedback-driven optimization framework for constrained long text generation using two-level optimization with planning and generation layers
Details
Motivation: Large language models struggle with long text generation under complex constraints involving multiple tightly coupled objectives like global structural consistency, local semantic coherence, and constraint feasibility. Existing approaches rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation.Method: HiFlow formulates generation as a two-level optimization process: 1) planning layer for global structure and constraint modeling, and 2) generation layer for conditioned text generation. It incorporates constraint-aware plan screening and closed-loop feedback at both levels to enable joint optimization of planning quality and generation behavior.
Result: Experiments on multiple backbones confirm HiFlow’s effectiveness over baseline methods for constrained long text generation tasks.
Conclusion: HiFlow provides a hierarchical feedback-driven optimization framework that progressively guides models toward high-quality, constraint-satisfying outputs for constrained long text generation.
Abstract: Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow’s effectiveness over baseline methods.
[55] NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension
Rongzhi Li, Hitomi Yanaka
Main category: cs.CL
TL;DR: NeuronMoE: A method that analyzes language-specific neuron specialization patterns to guide expert allocation in multilingual MoE models, achieving 40% parameter reduction while maintaining performance for low-resource languages.
Details
Motivation: Extending LLMs to low-resource languages is expensive with separate models per language. MoE architectures help but current approaches allocate experts based on layer-level similarity, ignoring fine-grained neuron-level specialization patterns in language processing.Method: Proposes NeuronMoE, which analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for Greek, Turkish, and Hungarian languages.
Result: Achieves approximately 40% average parameter reduction while matching performance of LayerMoE baseline. Found that low-resource language experts independently develop neuron specialization patterns mirroring high-resource language, concentrated in early and late layers.
Conclusion: Reveals potential universal architectural principles in how multilingual models organize linguistic knowledge. Neuron-level analysis provides more efficient expert allocation than layer-level approaches for multilingual MoE models.
Abstract: Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose $\textbf{NeuronMoE}$, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge.
[56] MUTEX: Leveraging Multilingual Transformers and Conditional Random Fields for Enhanced Urdu Toxic Span Detection
Inayat Arshad, Fajar Saleem, Ijaz Hussain
Main category: cs.CL
TL;DR: MUTEX: A multilingual transformer (XLM RoBERTa) with CRF layer for Urdu toxic span detection, achieving 60% token-level F1 score on multi-domain social media data.
Details
Motivation: Urdu toxic span detection is limited by sentence-level classification approaches that fail to identify specific toxic spans, exacerbated by lack of token-level annotated resources, linguistic complexity, code-switching, informal expressions, and morphological variations.Method: Proposes MUTEX framework using XLM RoBERTa transformer with CRF layer for sequence labeling, trained on manually annotated token-level toxic span dataset from multi-domain sources (social media, online news, YouTube reviews).
Result: Achieves 60% token-level F1 score, establishing the first supervised baseline for Urdu toxic span detection. Transformer-based models effectively capture contextual toxicity and handle code-switching and morphological variation better than other models.
Conclusion: MUTEX provides an effective framework for fine-grained toxic span detection in Urdu, demonstrating transformer-based models’ superiority in handling linguistic complexities of low-resource languages with code-switching and morphological variations.
Abstract: Urdu toxic span detection remains limited because most existing systems rely on sentence-level classification and fail to identify the specific toxic spans within those text. It is further exacerbated by the multiple factors i.e. lack of token-level annotated resources, linguistic complexity of Urdu, frequent code-switching, informal expressions, and rich morphological variations. In this research, we propose MUTEX: a multilingual transformer combined with conditional random fields (CRF) for Urdu toxic span detection framework that uses manually annotated token-level toxic span dataset to improve performance and interpretability. MUTEX uses XLM RoBERTa with CRF layer to perform sequence labeling and is tested on multi-domain data extracted from social media, online news, and YouTube reviews using token-level F1 to evaluate fine-grained span detection. The results indicate that MUTEX achieves 60% token-level F1 score that is the first supervised baseline for Urdu toxic span detection. Further examination reveals that transformer-based models are more effective at implicitly capturing the contextual toxicity and are able to address the issues of code-switching and morphological variation than other models.
[57] ARC-TGI: Human-Validated Task Generators with Reasoning Chain Templates for ARC-AGI
Jens Lehmann, Syeda Khushbakht, Nikoo Salehfard, Nur A Zarin Nishat, Dhananjay Bhandiwad, Andrei Aioanei, Sahar Vahdati
Main category: cs.CL
TL;DR: ARC-TGI is a framework for generating diverse visual reasoning tasks similar to ARC-AGI puzzles through Python task-family generators that preserve latent rules while enabling controlled benchmarking.
Details
Motivation: To address limitations of static ARC-AGI puzzle collections (overfitting, dataset leakage, memorization) by creating a scalable framework for generating diverse visual reasoning tasks with controlled variations.Method: Develops ARC-TGI as an open-source framework with task-family generators - compact Python programs that sample diverse ARC-AGI tasks while preserving latent rules. Each generated task includes natural-language reasoning chains and partially evaluated Python code for sampling, transformation, and episode construction.
Result: Released 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.
Conclusion: ARC-TGI provides a scalable solution for generating diverse visual reasoning tasks with controlled variations, addressing limitations of static puzzle collections and enabling better evaluation of abstraction and reasoning capabilities.
Abstract: The Abstraction and Reasoning Corpus (ARC-AGI) probes few-shot abstraction and rule induction on small visual grids, but progress is difficult to measure on static collections of hand-authored puzzles due to overfitting, dataset leakage, and memorisation. We introduce ARC-TGI (ARC Task Generators Inventory), an open-source framework for task-family generators: compact Python programs that sample diverse ARC-AGI tasks while preserving a latent rule. ARC-TGI is built around a solver-facing representation: each generated task is paired with natural-language input and transformation reasoning chains and partially evaluated Python code implementing sampling, transformation, and episode construction. Crucially, ARC-TGI supports task-level constraints so that training examples collectively expose the variations needed to infer the underlying rule, a requirement for human-solvable ARC tasks that independent per-example sampling often fails to guarantee. All generators undergo human refinement and local verification to keep both grids and reasoning traces natural and consistent under variation. We release 461 generators covering 180 ARC-Mini tasks, 215 ARC-AGI-1 tasks (200 train, 15 test), and 66 ARC-AGI-2 tasks (55 train, 11 test), enabling scalable dataset sampling and controlled benchmarking.
[58] Measuring the Redundancy of Decoder Layers in SpeechLLMs
Adel Moumen, Guangzhi Sun, Philip C Woodland
Main category: cs.CL
TL;DR: Speech Large Language Models show significant decoder redundancy inherited from pretrained LLMs, with 7-8B models maintaining good ASR performance with only 60% of decoder layers, and similar redundancy patterns across speech encoders, tasks, and languages.
Details
Motivation: Speech LLMs route speech encoder representations into LLM decoders that account for over 90% of parameters, but it's unclear how much decoder capacity is actually needed for speech tasks, suggesting potential for more efficient models.Method: Study decoder redundancy across two LLM families and three scales (1-8B) by pruning decoder layers and analyzing post-pruning healing for robustness. Measure excess capacity and generalize to speech translation tasks.
Result: Decoder redundancy is largely inherited from pretrained LLMs, with text and speech inputs yielding similar redundant blocks. 7-8B models retain good ASR performance with only 60% of decoder layers, with smaller scales showing reduced pruning tolerance. Same blocks of layers are redundant across speech encoders, tasks, and languages.
Conclusion: A global redundancy structure exists in SpeechLLMs, enabling deployment of a single pruned multi-task backbone across different speech encoders, tasks, and languages, suggesting significant potential for efficiency improvements.
Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.
[59] LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting
Yewen Li, Zhiyi Lyu, Peng Jiang, Qingpeng Cai, Fei Pan, Bo An, Peng Jiang
Main category: cs.CL
TL;DR: Proposes LBM, a hierarchical auto-bidding model using LLMs for reasoning and action generation in online advertising auctions, addressing black-box limitations of current methods.
Details
Motivation: Current auto-bidding methods using offline RL/generative approaches have black-box limitations, poor generalization, and lack interpretability in dynamic ad environments. LLMs offer reasoning capabilities but struggle with precise auction actions and domain knowledge.Method: Hierarchical LBM with LBM-Think (reasoning) and LBM-Act (action generation). Uses dual embedding to fuse language and numerical inputs for language-guided training of LBM-Act, and GQPO offline reinforcement fine-tuning to mitigate LLM hallucinations without simulation rollouts.
Result: Experiments show superiority of generative backbone based on LBM, especially in efficient training and generalization ability compared to previous methods.
Conclusion: LBM effectively leverages LLM reasoning for auto-bidding, addressing black-box limitations and improving performance through hierarchical architecture and specialized training techniques.
Abstract: The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think’s hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.
[60] Representation Fidelity:Auditing Algorithmic Decisions About Humans Using Self-Descriptions
Theresa Elstner, Martin Potthast
Main category: cs.CL
TL;DR: Proposes Representation Fidelity as a new dimension for validating algorithmic decisions about humans by measuring the distance between externally prescribed input representations and self-descriptions provided by human subjects.
Details
Motivation: Current algorithmic decision-making systems lack methods to validate whether decisions about humans rest on reasonable grounds. The paper aims to address this gap by introducing a framework to measure the fidelity of representations used in algorithmic decisions.Method: Operationalizes Representation Fidelity by measuring distance between two representations: (1) externally prescribed input representation used for decision-making, and (2) self-description provided by the human subject. Examines discrepancies, develops quantification methods, and creates a typology of representation mismatches. Presents a benchmark using a loan-granting dataset with synthetic natural language self-descriptions.
Result: Introduces the Loan-Granting Self-Representations Corpus 2025 containing 30,000 synthetic natural language self-descriptions derived from the German Credit Dataset, along with expert annotations of representation mismatches between each pair of representations.
Conclusion: Representation Fidelity provides a novel dimension for validating algorithmic decisions about humans, offering a systematic approach to assess whether decisions rest on reasonable grounds through comparison of external and self-representations.
Abstract: This paper introduces a new dimension for validating algorithmic decisions about humans by measuring the fidelity of their representations. Representation Fidelity measures if decisions about a person rest on reasonable grounds. We propose to operationalize this notion by measuring the distance between two representations of the same person: (1) an externally prescribed input representation on which the decision is based, and (2) a self-description provided by the human subject of the decision, used solely to validate the input representation. We examine the nature of discrepancies between these representations, how such discrepancies can be quantified, and derive a generic typology of representation mismatches that determine the degree of representation fidelity. We further present the first benchmark for evaluating representation fidelity based on a dataset of loan-granting decisions. Our Loan-Granting Self-Representations Corpus 2025 consists of a large corpus of 30 000 synthetic natural language self-descriptions derived from corresponding representations of applicants in the German Credit Dataset, along with expert annotations of representation mismatches between each pair of representations.
[61] Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers
Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang
Main category: cs.CL
TL;DR: Transformers learn analogical reasoning through aligned representations when jointly trained on similarity and attribution premises, with sequential training requiring specific curriculum and two-hop reasoning reducing to analogical reasoning with identity bridges.
Details
Motivation: To understand reasoning in large language models by isolating analogical reasoning (inferring shared properties between entities based on known similarities) and analyzing its emergence in transformers, moving beyond evaluations that conflate multiple reasoning types.Method: Theoretical analysis of analogical reasoning in transformers with three key proofs: (1) joint training on similarity and attribution premises enables analogical reasoning through aligned representations, (2) sequential training requires specific curriculum order, and (3) two-hop reasoning reduces to analogical reasoning with identity bridges. Experimental validation with architectures up to 1.5B parameters.
Result: Transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments validate the theory and demonstrate how representational geometry shapes inductive reasoning capabilities.
Conclusion: Analogous reasoning in transformers emerges through a unified mechanism of aligned representations, with specific training requirements for sequential learning and the necessity of explicit identity bridges for two-hop reasoning.
Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.
[62] C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal, Rauno Arike
Main category: cs.CL
TL;DR: C2-Faith benchmark evaluates LLMs’ ability to assess faithfulness in chain-of-thought reasoning, focusing on causality and coverage dimensions, revealing limitations in error detection vs. localization.
Details
Motivation: LLMs are increasingly used as judges of chain-of-thought reasoning, but it's unclear if they can reliably assess process faithfulness (causality and coverage) rather than just answer plausibility.Method: Created C2-Faith benchmark from PRM800K targeting two faithfulness dimensions: causality (logical flow) and coverage (essential inferences). Used controlled perturbations to create examples with known causal error positions and coverage deletions at varying rates.
Result: Model rankings depend strongly on task framing with no single judge dominating; substantial gap between detecting errors and localizing them; coverage judgments systematically inflated for incomplete reasoning.
Conclusion: LLM judges have limitations in process-level evaluation - they’re not equally reliable across different faithfulness assessment tasks, highlighting the need for careful judge selection based on specific evaluation needs.
Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation
[63] Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity
Di Zhang, Xun Wu, Shaohan Huang, Yudong Wang, Hanyong Shao, Yingbo Hao, Zewen Chi, Li Dong, Ting Song, Yan Xia, Zhifang Sui, Furu Wei
Main category: cs.CL
TL;DR: Sparse-BitNet combines 1.58-bit quantization with N:M sparsity for efficient LLMs, showing better compatibility than full-precision models and achieving up to 1.30X speedups.
Details
Motivation: Semi-structured N:M sparsity and low-bit quantization are both promising efficiency techniques for LLMs, but have been studied separately. The paper investigates their interaction and compatibility.Method: Proposes Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization (BitNet) and dynamic N:M sparsification with stable training. Uses custom sparse tensor cores for acceleration.
Result: 1.58-bit BitNet shows smaller performance degradation than full-precision models at same sparsity levels, tolerates higher structured sparsity before collapse, and achieves up to 1.30X speedups in training/inference.
Conclusion: Combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs, with BitNet showing natural compatibility with sparsity.
Abstract: Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at https://github.com/AAzdi/Sparse-BitNet
[64] Guidelines for the Annotation and Visualization of Legal Argumentation Structures in Chinese Judicial Decisions
Kun Chen, Xianglei Liao, Kaixue Fei, Yi Xing, Xinrui Li
Main category: cs.CL
TL;DR: A systematic annotation framework for representing legal argumentation structure in judicial decisions, with proposition types and argumentative relations for computational analysis.
Details
Motivation: To create a reliable data foundation for computational analysis of judicial reasoning by providing a systematic framework that reveals the logical organization of legal argumentation in court decisions.Method: Proposes a two-level annotation framework: at proposition level (4 types: general/specific normative and factual propositions) and relational level (5 types: support, attack, joint, match, identity relations). Includes formal representation rules, visualization conventions, standardized workflow, and consistency control mechanisms.
Result: A comprehensive guideline enabling consistent graphical representation of complex argumentation patterns and establishing reproducible annotation procedures for large-scale analysis of judicial reasoning.
Conclusion: The framework provides methodological support for legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis through clear conceptual models and practical annotation procedures.
Abstract: This guideline proposes a systematic and operational annotation framework for representing the structure of legal argumentation in judicial decisions. Grounded in theories of legal reasoning and argumentation, the framework aims to reveal the logical organization of judicial reasoning and to provide a reliable data foundation for computational analysis. At the proposition level, the guideline distinguishes four types of propositions: general normative propositions, specific normative propositions, general factual propositions, and specific factual propositions. At the relational level, five types of relations are defined to capture argumentative structures: support, attack, joint, match, and identity. These relations represent positive and negative argumentative connections, conjunctive reasoning structures, the correspondence between legal norms and case facts, and semantic equivalence between propositions. The guideline further specifies formal representation rules and visualization conventions for both basic and nested structures, enabling consistent graphical representation of complex argumentation patterns. In addition, it establishes a standardized annotation workflow and consistency control mechanisms to ensure reproducibility and reliability of the annotated data. By providing a clear conceptual model, formal representation rules, and practical annotation procedures, this guideline offers methodological support for large-scale analysis of judicial reasoning and for future research in legal argument mining, computational modeling of legal reasoning, and AI-assisted legal analysis.
[65] Transducing Language Models
Vésteinn Snæbjarnarson, Samuel Kiegeland, Tianyu Liu, Reda Boumasmoud, Ryan Cotterell, Tim Vieira
Main category: cs.CL
TL;DR: A framework for creating new language models via deterministic string-to-string transformations using finite-state transducers, enabling inference-time adaptation of pretrained models to different output formats.
Details
Motivation: Language models produce distributions over strings, but downstream tasks often require different output formats (e.g., tokens to bytes, DNA to amino acids). While deterministic transformations can convert outputs, prior work doesn't treat these as fully functional new language models.Method: Formalizes language models derived from deterministic string-to-string transformations, focusing on finite-state transducers (FSTs). Develops algorithms to compose language models with FSTs to marginalize over source strings mapping to a given target, propagating probabilities through transducers without altering model parameters.
Result: Presents exact and approximate algorithms with theoretical analysis. Experiments in three domains: token-to-byte, token-to-word, and DNA-to-amino-acid conversions demonstrate inference-time adaptation of pretrained language models to application-specific output requirements.
Conclusion: Provides a general framework for creating new language models via deterministic transformations, enabling flexible adaptation of pretrained models to diverse output formats without retraining.
Abstract: Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model’s output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers – a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to marginalize over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling conditioning on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.
[66] Diffusion LLMs can think EoS-by-EoS
Sarah Breckner, Sebastian Schuster
Main category: cs.CL
TL;DR: Diffusion LLMs use end-of-sequence (EoS) tokens as hidden scratchpads for complex reasoning, improving performance when generation length exceeds required answer length.
Details
Motivation: Diffusion LLMs outperform autoregressive models on complex reasoning tasks with interdependent sub-goals, especially when generating more tokens than needed. The researchers hypothesize that diffusion models use EoS token representations as hidden scratchpads for reasoning.Method: Tested diffusion models (LLaDA1.5, LLaDA2.0-mini, Dream-v0) on Addition, Entity Tracking, and Sudoku tasks. Conducted controlled prompting experiments adding EoS tokens, and performed causal interventions by patching hidden states of EoS tokens with counterfactual generations.
Result: Adding EoS tokens improved reasoning capabilities. Patching EoS token hidden states frequently changed outputs to counterfactual results, showing EoS tokens carry meaningful information about the problem being solved.
Conclusion: Diffusion LLMs use EoS tokens as computational scratchpads for reasoning (“think EoS-by-EoS”), with EoS token representations containing meaningful information rather than being devoid of meaning.
Abstract: Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs’ reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.
[67] Distilling Formal Logic into Neural Spaces: A Kernel Alignment Approach for Signal Temporal Logic
Sara Candussio, Gabriele Sarti, Gaia Saveri, Luca Bortolussi
Main category: cs.CL
TL;DR: A framework for learning continuous neural representations of formal specifications by distilling symbolic robustness kernels into Transformer encoders, enabling efficient neuro-symbolic reasoning.
Details
Motivation: Existing approaches for formal specification representations have limitations: symbolic kernels preserve behavioral semantics but are computationally expensive and non-invertible, while syntax-based neural embeddings fail to capture underlying semantic structures. There's a need for a method that bridges this gap.Method: Uses a teacher-student setup to distill a symbolic robustness kernel into a Transformer encoder. Supervises the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors proportionally to their semantic discrepancies. The trained encoder produces embeddings in a single forward pass, mimicking the kernel’s logic at reduced computational cost.
Result: The neural representations faithfully preserve semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation.
Conclusion: The framework successfully bridges the gap between symbolic and neural approaches, providing efficient continuous representations of formal specifications that preserve semantic structure while being computationally efficient and invertible.
Abstract: We introduce a framework for learning continuous neural representations of formal specifications by distilling the geometry of their semantics into a latent space. Existing approaches rely either on symbolic kernels – which preserve behavioural semantics but are computationally prohibitive, anchor-dependent, and non-invertible – or on syntax-based neural embeddings that fail to capture underlying structures. Our method bridges this gap: using a teacher-student setup, we distill a symbolic robustness kernel into a Transformer encoder. Unlike standard contrastive methods, we supervise the model with a continuous, kernel-weighted geometric alignment objective that penalizes errors in proportion to their semantic discrepancies. Once trained, the encoder produces embeddings in a single forward pass, effectively mimicking the kernel’s logic at a fraction of its computational cost. We apply our framework to Signal Temporal Logic (STL), demonstrating that the resulting neural representations faithfully preserve the semantic similarity of STL formulae, accurately predict robustness and constraint satisfaction, and remain intrinsically invertible. Our proposed approach enables highly efficient, scalable neuro-symbolic reasoning and formula reconstruction without repeated kernel computation at runtime.
[68] Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding
Ofir Ben Shoham
Main category: cs.CL
TL;DR: Speculative decoding acceleration technique using vocabulary-trimmed draft models to reduce latency while maintaining token coverage for domain-specific workloads.
Details
Motivation: Address the trade-off in draft model design where larger vocabularies improve token coverage but increase latency, while smaller vocabularies reduce latency but risk missing important tokens. Domain-specific workloads use only a small fraction of the full vocabulary.Method: Vocabulary trimming for draft models cast as constrained optimization problem balancing token coverage and draft latency. Coverage computed over assistant responses in training data, latency estimated using architecture-aware FLOPs. Optimized with Tree-structured Parzen Estimator to explore coverage-latency Pareto frontier under minimum coverage constraint.
Result: Improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks: up to 16% latency reduction, 20% throughput improvement. On diverse out-of-distribution tasks: up to 6.7% throughput gains.
Conclusion: Vocabulary trimming for draft models effectively addresses the coverage-latency trade-off in speculative decoding, enabling significant performance improvements for both domain-specific and general tasks.
Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.
[69] VietJobs: A Vietnamese Job Advertisement Dataset
Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj
Main category: cs.CL
TL;DR: VietJobs is the first large-scale Vietnamese job advertisement corpus with 48K postings, 15M words, covering 34 provinces and 16 occupational domains, designed for NLP and labor market analytics research.
Details
Motivation: There's a lack of large-scale, publicly available Vietnamese job advertisement datasets for NLP research and labor market analysis, limiting progress in Vietnamese-specific language modeling and socio-economic studies.Method: Collected 48,092 job postings from all 34 Vietnamese provinces, covering 16 occupational domains and multiple employment types. Created structured dataset with job titles, categories, salaries, skills, and conditions. Benchmarked LLMs on job category classification and salary estimation tasks.
Result: Instruction-tuned models like Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT showed notable gains in few-shot and fine-tuned settings. The dataset reveals challenges in multilingual and Vietnamese-specific modeling for labor market prediction.
Conclusion: VietJobs establishes a new benchmark for Vietnamese NLP and provides valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labor market analysis.
Abstract: VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: https://github.com/VinNLP/VietJobs.
[70] Oral to Web: Digitizing ‘Zero Resource’Languages of Bangladesh
Mohammad Mamun Or Rashid
Main category: cs.CL
TL;DR: First national-scale parallel multimodal dataset for Bangladesh’s 42 minority languages with text, audio, and IPA transcriptions for endangered language documentation and NLP research.
Details
Motivation: Bangladesh has approximately 40 minority languages spanning four language families, many endangered and lacking systematic digital documentation. There's a need for cross-family parallel multimodal corpus for these computationally "zero resource" languages to support preservation, research, and NLP development.Method: Systematic 90-day fieldwork across nine districts involving 16 data collectors, 77 speakers, and 43 validators. Used predefined elicitation template with 2224 items at three linguistic levels: isolated lexical items (475 words), grammatical constructions (887 sentences), and directed speech (862 prompts). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers.
Result: Created Multilingual Cloud Corpus with 85,792 structured textual entries (Bengali stimulus, English translation, IPA transcription) and ~107 hours of transcribed audio recordings covering 42 language varieties from four language families plus two unclassified languages. Dataset publicly accessible via multiling.cloud platform.
Conclusion: The corpus provides valuable resource for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries, establishing first national-scale parallel multimodal dataset for Bangladesh’s minority languages.
Abstract: We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh’s ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally “zero resource” varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (multiling.cloud), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.
[71] A theoretical model of dynamical grammatical gender shifting based on set-valued set function
Mohamed El Idrissi
Main category: cs.CL
TL;DR: A computational model for analyzing noun morphology variations, focusing on gender and semantic distinctions through template-based mapping.
Details
Motivation: To understand the complex patterns of noun morphology variations across languages, particularly grammatical gender shifts and semantic distinctions, through formal computational modeling.Method: Proposes a Template-Based and Modular Cognitive model using a set-valued set function h: P(M)→P(M) to map lexical items onto morphological templates, analyzing noun-to-noun derivation and template shifts.
Result: Demonstrates how gender shifts and other morphological variations arise during lexical changes, particularly in Riffian, and provides a unified framework for understanding morphological markings across languages.
Conclusion: The mathematical model challenges conventional views of word formation, contributes to understanding morphosyntactic variation, and has potential applications in linguistic pattern modeling.
Abstract: This study investigates the diverse characteristics of nouns, focusing on both semantic (e.g., countable/uncountable) and morphosyntactic (e.g., masculine/feminine) distinctions. We explore inter-word variations for gender markers in noun morphology. Grammatical gender shift is a widespread phenomenon in languages around the world. The aim is to uncover through a formal model the underlying patterns governing the variation of lexemes. To this end, we propose a new computational component dedicated to pairing items with morphological templates (e.g., the result of a generated item-template pair: (funas, ${N, +SG, -PL, -M, +F, -COL, +SING}$), with its spell-out form: $ð$a-funast ‘cow’). This process is formally represented by the Template-Based and Modular Cognitive model. This proposed model, defined by a set-valued set function $h : \mathscr{P}(M) \rightarrow \mathscr{P}(M)$, predicts the nonlinear dynamic mapping of lexical items onto morphological templates. By applying this formalism, we present a unified framework for understanding the complexities of morphological markings across languages. Through empirical observations, we demonstrate how these shifts, as well as non-gender shifts, arise during lexical changes, especially in Riffian. Our model posits that these variant markings emerge due to template shifts occurring during word and meaning’s formation. By formally demonstrating that conversion is applicable to noun-to-noun derivation, we challenge and broaden the conventional view of word formation. This mathematical model not only contributes to a deeper understanding of morphosyntactic variation but also offers potential applications in other fields requiring precise modelling of linguistic patterns.
[72] Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution
Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu
Main category: cs.CL
TL;DR: Med-V1 is a 3B parameter biomedical evidence attribution model that outperforms base models by 27-71% on verification tasks, matches frontier LLMs like GPT-5, and enables hallucination detection in biomedical text.
Details
Motivation: Current LLMs for biomedical evidence attribution require expensive frontier models like GPT-5, making large-scale deployment impractical. There's a need for efficient, lightweight models that can accurately verify claims and detect hallucinations in biomedical literature.Method: Developed Med-V1 family of small language models (3B parameters) trained on high-quality synthetic data specifically created for biomedical verification tasks. Models are evaluated on five biomedical benchmarks unified into verification format.
Result: Med-V1 substantially outperforms base models by 27.0% to 71.3% on biomedical verification benchmarks and performs comparably to frontier LLMs like GPT-5. Successfully used to quantify hallucinations in LLM-generated answers under different citation instructions and identify evidence misattributions in clinical guidelines.
Conclusion: Med-V1 provides an efficient, accurate lightweight alternative to frontier LLMs for practical biomedical evidence attribution and verification, enabling scalable hallucination detection and evidence validation in clinical applications.
Abstract: Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at https://github.com/ncbi-nlp/Med-V1.
[73] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
Main category: cs.CL
TL;DR: PersianPunc: A large-scale Persian punctuation restoration dataset and BERT-based model that outperforms LLMs by avoiding over-correction while being computationally efficient.
Details
Motivation: Punctuation restoration is crucial for improving ASR output readability but remains underexplored for Persian. Existing approaches using large language models suffer from over-correction (introducing unwanted text edits) and high computational costs.Method: Created PersianPunc dataset (17M samples) through systematic aggregation/filtering of existing textual resources. Formulated punctuation restoration as token-level sequence labeling and fine-tuned ParsBERT (Persian BERT).
Result: BERT-based approach achieved 91.33% macro-averaged F1 score on test set, outperforming LLMs which suffered from over-correction (introducing undesired text edits beyond punctuation) and higher computational requirements.
Conclusion: Lightweight BERT-based approach is superior to LLMs for punctuation restoration, especially for speech-to-text pipelines, due to avoiding over-correction and being computationally efficient. Dataset and model are publicly available.
Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (https://huggingface.co/datasets/MohammadJRanjbar/persian-punctuation-restoration) and model (https://huggingface.co/MohammadJRanjbar/parsbert-persian-punctuation) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.
[74] A Multilingual Human Annotated Corpus of Original and Easy-to-Read Texts to Support Access to Democratic Participatory Processes
Stefan Bott, Verena Riegler, Horacio Saggion, Almudena Rascón Alcaina, Nouran Khallaf
Main category: cs.CL
TL;DR: A multilingual corpus of original texts in Spanish, Catalan, and Italian with human-expert simplifications to Easy-to-Read level, developed for democratic participation research.
Details
Motivation: To address the lack of high-quality training and evaluation materials for automatic text simplification systems, particularly for less-resourced languages like Spanish, Catalan, and Italian, and to support research on Easy-to-Read language for democratic participation.Method: Compiled original texts from domains relevant to democratic participation, selected based on relevance, copyright availability, and ethical standards. Texts were simplified to Easy-to-Read level by human experts in text simplification.
Result: Created the first annotated corpus of its kind for Catalan language, plus significant contributions for Spanish and Italian. The corpus includes different text types and will be made freely accessible to the public.
Conclusion: This corpus fills an important gap in resources for text simplification research, particularly for less-resourced languages, and supports the iDEM project’s goal of assessing Easy-to-Read language impact on democratic participation.
Abstract: Being able to understand information is a key factor for a self-determined life and society. It is also very important for participating in democratic processes. The study of automatic text simplification is often limited by the availability of high quality material for the training and evaluation on automatic simplifiers. This is true for English, but more so for less resourced languages like Spanish, Catalan and Italian. In order to fill this gap, we present a corpus of original texts for these 3 languages, with high quality simplification produced by human experts in text simplification. It was developed within the iDEM project to assess the impact of Easy-to-Read (E2R) language for democratic participation. The original texts were compiled from domains related to this topic. The corpus includes different text types, selected based on relevance, copyright availability, and ethical standards. All texts were simplified to E2R level. The corpus is particularity valuable because it includes the first annotated corpus of its kind for the Catalan language. It also represents a noteworthy contribution for Spanish and Italian, offering high-quality, human-annotated language resources that are rarely available in these domains. The corpus will be made freely accessible to the public.
[75] DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning
Mohammad Mahdi Moradi, Sudhir Mudur
Main category: cs.CL
TL;DR: DiSCTT is a difficulty-aware test-time adaptation framework for LLMs that uses consensus-based uncertainty estimation to dynamically allocate optimization strategies: high-consensus inputs get supervised fine-tuning with pseudo-labels, while low-consensus inputs get reinforcement learning with consensus regularization.
Details
Motivation: Existing test-time adaptation approaches apply uniform optimization objectives across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. There's a need for more adaptive strategies that account for instance difficulty and uncertainty.Method: DiSCTT estimates instance-level epistemic uncertainty from agreement among sampled reasoning trajectories. Based on consensus levels: high-consensus inputs are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels; low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints.
Result: Across mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times.
Conclusion: Explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.
Abstract: Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.
[76] Progressive Residual Warmup for Language Model Pretraining
Tianhao Chen, Xin Xu, Lu Yin, Hao Chen, Yang Wang, Shizhe Diao, Can Yang
Main category: cs.CL
TL;DR: ProRes is a progressive residual warmup method for transformer pretraining that gradually activates deeper layers to improve stability and convergence.
Details
Motivation: Transformer architectures have stability and convergence challenges during pretraining. The logical dependency between sequentially stacked layers suggests that deeper layers should wait for earlier layers to stabilize before contributing to learning.Method: ProRes multiplies each layer’s residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. This implements an “early layer learns first” philosophy where deeper layers wait for early layers to settle.
Result: ProRes stabilizes pretraining, introduces a unique optimization trajectory, leads to faster convergence, stronger generalization, and better downstream performance across various model scales, normalization, and initialization schemes.
Conclusion: Progressive Residual Warmup is an effective technique for improving transformer pretraining stability and efficiency by aligning layer activation with the logical dependency between sequentially stacked layers.
Abstract: Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an “early layer learns first” philosophy by multiplying each layer’s residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at https://github.com/dandingsky/ProRes.
[77] An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs
Deshan Sumanathilaka, Nicholas Micallef, Julian Hough
Main category: cs.CL
TL;DR: Low-parameter LLMs (<4B) achieve GPT-4-level Word Sense Disambiguation performance through reasoning-focused fine-tuning with Chain-of-Thought and neighbor-word analysis, reducing computational demands.
Details
Motivation: Large LLMs like GPT-4-Turbo show state-of-the-art WSD performance but have high computational and energy costs, limiting scalability. The paper investigates whether smaller LLMs (<4B parameters) can achieve comparable results through targeted fine-tuning strategies.Method: Fine-tuned eight small-scale open-source LLMs (Gemma, Qwen) using FEWS dataset augmented with rationale-rich annotations. Employed Chain-of-Thought reasoning combined with neighbor-word analysis to enhance sense identification capabilities.
Result: Gemma-3-4B and Qwen-3-4B models outperformed medium-parameter baselines and state-of-the-art models on FEWS, achieving performance comparable to GPT-4-Turbo in zero-shot settings. Models showed robust generalization to unseen senses and strong cross-domain adaptability on the “Fool Me If You Can” dataset without task-specific fine-tuning.
Conclusion: Carefully crafted reasoning-centric fine-tuning enables low-parameter LLMs to deliver accurate WSD while substantially reducing computational and energy demands, demonstrating scalability advantages over large models.
Abstract: Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen “Fool Me If You Can’’ dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.
[78] Ensembling Language Models with Sequential Monte Carlo
Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly, Timothy J. O’Donnell, Ryan Cotterell, Tim Vieira
Main category: cs.CL
TL;DR: A framework for ensembling language models using f-ensemble distributions with byte-level SMC sampling, enabling ensembles of models with different vocabularies.
Details
Motivation: Language model performance is sensitive to model and prompt choices, but traditional ensembling is challenging during decoding due to biased approximations when aggregating next-token probabilities.Method: Introduces f-ensemble distributions for composing K language models using various aggregation functions f, with a byte-level sequential Monte Carlo algorithm that operates in shared character space to handle mismatching vocabularies.
Result: Alternative aggregation strategies outperform traditional probability averaging, and better posterior approximations lead to better ensemble performance across various structured text generation tasks.
Conclusion: The framework provides principled ensembling for language models during decoding, enabling effective combination of diverse models with different vocabularies through flexible aggregation functions.
Abstract: Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}{\geq 0}^{K}\to\mathbb{R}{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.
[79] FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling
Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao
Main category: cs.CL
TL;DR: FlashAttention-4 optimizes attention computation for Blackwell GPUs (B200/GB200) with redesigned pipelines, software-emulated exponential operations, and tensor memory usage, achieving significant speedups over previous implementations.
Details
Motivation: Attention is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized for Hopper GPUs, the industry has transitioned to Blackwell-based systems (B200/GB200) with asymmetric hardware scaling where tensor core throughput doubles but other functional units scale more slowly, creating new bottlenecks that need addressing.Method: Three main techniques: (1) redesigned pipelines exploiting fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling to reduce non-matmul operations, and (3) leveraging tensor memory and 2-CTA MMA mode to reduce shared memory traffic and atomic adds in backward pass. Implemented entirely in CuTe-DSL embedded in Python.
Result: Achieves up to 1.3× speedup over cuDNN 9.13 and 2.7× over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Also achieves 20-30× faster compile times compared to traditional C++ template-based approaches.
Conclusion: FlashAttention-4 successfully addresses the shifting bottlenecks on Blackwell GPUs through algorithmic innovations and modern implementation approaches, providing significant performance improvements for attention computation in large language models.
Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.
[80] DEBISS: a Corpus of Individual, Semi-structured and Spoken Debates
Klaywert Danillo Ferreira de Souza, David Eduardo Pereira, Cláudio E. C. Campelo, Larissa Lucena Vasconcelos
Main category: cs.CL
TL;DR: DEBISS corpus: A collection of spoken individual debates with semi-structured features and comprehensive NLP annotations including speech-to-text, speaker diarization, argument mining, and debater quality assessment.
Details
Motivation: Debates are essential in daily life with diverse applications, structures, and formats, but there is a notable scarcity of debate corpora in the state of the art that account for these variations, making development challenging.Method: Proposes the DEBISS corpus - a collection of spoken and individual debates with semi-structured features, annotated with a broad range of NLP tasks.
Result: Creates a comprehensive debate corpus with annotations for speech-to-text, speaker diarization, argument mining, and debater quality assessment to address the scarcity of debate resources.
Conclusion: The DEBISS corpus fills an important gap in debate resources by providing a versatile collection with comprehensive NLP annotations for various debate applications and research tasks.
Abstract: The process of debating is essential in our daily lives, whether in studying, work activities, simple everyday discussions, political debates on TV, or online discussions on social networks. The range of uses for debates is broad. Due to the diverse applications, structures, and formats of debates, developing corpora that account for these variations can be challenging, and the scarcity of debate corpora in the state of the art is notable. For this reason, the current research proposes the DEBISS corpus: a collection of spoken and individual debates with semi-structured features. With a broad range of NLP task annotations, such as speech-to-text, speaker diarization, argument mining, and debater quality assessment.
[81] NCTB-QA: A Large-Scale Bangla Educational Question Answering Dataset and Benchmarking Performance
Abrar Eyasir, Tahsin Ahmed, Muhammad Ibrahim
Main category: cs.CL
TL;DR: NCTB-QA is a large-scale Bangla question answering dataset with balanced answerable/unanswerable questions, used to benchmark transformer models and show substantial improvements through fine-tuning.
Details
Motivation: Reading comprehension systems for low-resource languages struggle with unanswerable questions, producing unreliable responses when answers are absent from context. Existing Bangla datasets lack balanced unanswerable questions.Method: Created NCTB-QA dataset with 87,805 question-answer pairs from 50 Bangladeshi textbooks, maintaining 57.25% answerable and 42.75% unanswerable questions. Includes adversarial instances with plausible distractors. Benchmarked BERT, RoBERTa, and ELECTRA models through fine-tuning.
Result: BERT achieved 313% relative improvement in F1 score (0.150 to 0.620). All models showed significant improvements in semantic answer quality measured by BERTScore. The dataset serves as a challenging benchmark for Bangla educational QA.
Conclusion: NCTB-QA establishes a valuable benchmark for Bangla question answering, demonstrating that domain-specific fine-tuning is critical for robust performance in low-resource language settings.
Abstract: Reading comprehension systems for low-resource languages face significant challenges in handling unanswerable questions. These systems tend to produce unreliable responses when correct answers are absent from context. To solve this problem, we introduce NCTB-QA, a large-scale Bangla question answering dataset comprising 87,805 question-answer pairs extracted from 50 textbooks published by Bangladesh’s National Curriculum and Textbook Board. Unlike existing Bangla datasets, NCTB-QA maintains a balanced distribution of answerable (57.25%) and unanswerable (42.75%) questions. NCTB-QA also includes adversarially designed instances containing plausible distractors. We benchmark three transformer-based models (BERT, RoBERTa, ELECTRA) and demonstrate substantial improvements through fine-tuning. BERT achieves 313% relative improvement in F1 score (0.150 to 0.620). Semantic answer quality measured by BERTScore also increases significantly across all models. Our results establish NCTB-QA as a challenging benchmark for Bangla educational question answering. This study demonstrates that domain-specific fine-tuning is critical for robust performance in low-resource settings.
[82] Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval
Artem Vazhentsev, Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Seleznyov, Mikhail Salnikov, Elena Tutubalina, Vasily Konovalov, Irina Nikishina, Alexander Panchenko, Viktor Moskvoretskii
Main category: cs.CL
TL;DR: INTRA enables fact-checking without external retrieval by leveraging internal model representations, achieving state-of-the-art performance across diverse datasets and generalization scenarios.
Details
Motivation: Current fact-checking methods rely on external knowledge retrieval, which is constrained by retrieval errors and data availability, while leaving LLMs' intrinsic fact-verification capabilities largely unused.Method: Proposes INTRA (Internal Representation Analysis), a method that exploits interactions between internal model representations for fact-checking without external retrieval, evaluated across 9 datasets, 18 methods, and 3 models.
Result: INTRA achieves state-of-the-art performance with strong generalization across long-tail knowledge, claim source variations, multilinguality, and long-form generation scenarios.
Conclusion: Fact-checking without retrieval is a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable integration into generation processes.
Abstract: Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.
[83] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought
Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo
Main category: cs.CL
TL;DR: Analysis reveals performative chain-of-thought in reasoning models where models show high confidence early but continue generating tokens, with activation probing enabling early exit to reduce computation while maintaining accuracy.
Details
Motivation: To investigate whether chain-of-thought reasoning in large language models represents genuine reasoning or is merely performative "reasoning theater" where models already know answers but continue generating tokens.Method: Used activation probing, early forced answering, and CoT monitoring across two large models (DeepSeek-R1 671B & GPT-OSS 120B) on tasks of varying difficulty (easy MMLU recall vs. difficult multihop GPQA-Diamond questions).
Result: Found task difficulty-specific differences: final answers decodable from activations far earlier in CoT than monitors can detect, especially for easy questions. Inflection points (backtracking, ‘aha’ moments) correlated with genuine belief shifts. Probe-guided early exit reduced tokens by 80% on MMLU and 30% on GPQA-Diamond with similar accuracy.
Conclusion: Models exhibit performative CoT where they continue generating tokens after forming confident answers, but attention probing can efficiently detect this performative reasoning and enable adaptive computation through early exit strategies.
Abstract: We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model’s final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, ‘aha’ moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned “reasoning theater.” Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
[84] INMS: Memory Sharing for Large Language Model based Agents
Hang Gao, Yongfeng Zhang
Main category: cs.CL
TL;DR: INMS framework enables multi-agent systems to share conversational memory asynchronously, improving performance through collective knowledge exchange and dynamic memory filtering.
Details
Motivation: Current LLM-based agents operate in isolation with static databases, lacking the dynamic knowledge exchange found in human dialogue, which limits their performance in open-ended scenarios.Method: Proposes INteractive Memory Sharing (INMS) framework with real-time memory filtering, storage, and retrieval to create a shared conversational memory pool for asynchronous multi-agent interaction.
Result: Extensive experiments across three datasets show INMS significantly improves agent performance by effectively modeling multi-agent interaction and collective knowledge sharing.
Conclusion: INMS bridges the gap between isolated agent operation and human-like dialogue by enabling continuous memory sharing, promoting collective self-enhancement in multi-agent systems.
Abstract: While Large Language Model (LLM) based agents excel at complex tasks, their performance in open-ended scenarios is often constrained by isolated operation and reliance on static databases, missing the dynamic knowledge exchange of human dialogue. To bridge this gap, we propose the INteractive Memory Sharing (INMS) framework, an asynchronous interaction paradigm for multi-agent systems. By integrating real-time memory filtering, storage, and retrieval, INMS establishes a shared conversational memory pool. This enables continuous, dialogue-like memory sharing among agents, promoting collective self-enhancement and dynamically refining the retrieval mediator based on interaction history. Extensive experiments across three datasets demonstrate that INMS significantly improves agent performance by effectively modeling multi-agent interaction and collective knowledge sharing.
[85] ShIOEnv: A Command Evaluation Environment for Grammar-Constrained Synthesis and Execution Behavior Modeling
Jarrod Ragsdale, Rajendra Boppana
Main category: cs.CL
TL;DR: ShIOEnv: A Bash shell environment for generating command-line input-output pairs to train models that better understand complex CLI execution behavior, addressing data gaps in current approaches.
Details
Motivation: Current CLI interaction models struggle with complex command compositions and system-dependent execution behavior due to lack of shell input-output (ShIO) training data, creating a need for better datasets and training environments.Method: Created ShIOEnv, a Gymnasium-compatible Bash shell environment that uses grammar-derived options to constrain argument synthesis to syntactically valid commands, and introduces a self-supervised irreducibility signal to measure information density in inputs.
Result: Curated and released 2.1M input-output pairs for modeling Bash command execution feedback. Models trained on grammar-constrained datasets with higher maximum irreducibility achieve greater accuracy in modeling execution behavior than prior execution-free baselines.
Conclusion: ShIOEnv addresses the data gap for CLI execution modeling, enabling better training of models that understand complex command compositions and system-dependent behavior through grammar-constrained synthesis and irreducibility-based data curation.
Abstract: Modeling of command-line interface (CLI) interaction has enabled flexible, execution-free output presentation. However, current approaches struggle to model inputs with complex compositions and inputs whose execution behavior depends on system characteristics. This is due to a lack of shell input-output (ShIO) data in the training distributions used by the models in these approaches. To address this data gap, we present ShIOEnv, a Gymnasium-compatible Bash shell environment for command synthesis and system-grounded execution behavior capturing. To concentrate synthesis on productive regions of the state-action space, we temporally abstract argument construction into grammar-derived options, thereby constraining synthesis to syntactically valid arguments. We introduce a self-supervised irreducibility signal to approximate the proportion of arguments that contribute to the observed execution behavior, serving as a measure of information density for each input. Using ShIOEnv, we curate and release 2.1M input-output pairs for modeling feedback from Bash command execution. We find that models trained on grammar-constrained datasets with higher maximum irreducibility achieve greater accuracy when modeling the execution behavior of user-sourced inputs than prior execution-free baselines.
[86] SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models
Thinh Pham, Nguyen Nguyen, Pratibha Zunjare, Weiyuan Chen, Yu-Min Tseng, Tu Vu
Main category: cs.CL
TL;DR: SealQA is a benchmark for evaluating search-augmented language models on challenging fact-seeking questions where web search yields conflicting, noisy, or unhelpful results, with three variants testing different capabilities.
Details
Motivation: Current search-augmented language models struggle with factual accuracy when web search results are unreliable, conflicting, or noisy. There's a need for better benchmarks to evaluate models' reasoning capabilities in these challenging scenarios.Method: Created three benchmark variants: Seal-0 (main) for most challenging questions where models achieve near-zero accuracy; Seal-Hard for factual accuracy and reasoning; and LongSeal for long-context, multi-document reasoning in “needle-in-a-haystack” settings.
Result: Frontier LLMs perform poorly across all SealQA flavors. On Seal-0, best agentic models achieve only 17.1% and 6.3% accuracy. Advanced reasoning models are highly vulnerable to noisy search results, and increasing test-time compute doesn’t reliably improve performance.
Conclusion: Current search-augmented language models have critical limitations in handling noisy, conflicting search results. The benchmark reveals that even advanced models struggle with factual accuracy and reasoning in challenging scenarios, highlighting important research gaps.
Abstract: We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in “needle-in-a-haystack” settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the “lost-in-the-middle” issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.
[87] Why Reinforcement Fine-Tuning Enables MLLMs Preserve Prior Knowledge Better: A Data Perspective
Zhihao Zhang, Qiaole Dong, Qi Zhang, Jun Zhao, Enyu Zhou, Zhiheng Xi, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Mingqi Wu, Yanwei Fu, Tao Ji, Tao Gui, Xuanjing Huang, Kai Chen
Main category: cs.CL
TL;DR: SFT enables rapid task learning but causes catastrophic forgetting, while RFT learns slower but better retains prior knowledge in multimodal LLMs, with data distribution playing a key role in forgetting.
Details
Motivation: To understand how post-training algorithms (SFT and RFT) affect knowledge retention in multimodal large language models when adapting to new tasks, using jigsaw puzzles as a novel task not present in pretraining data.Method: Systematic study using jigsaw puzzles as novel task on Qwen2.5-VL series; analysis of learning dynamics examining magnitude and direction of training data influence on prior knowledge; comparison of SFT vs RFT; validation on math and scientific QA tasks.
Result: SFT enables rapid task acquisition but leads to catastrophic forgetting, while RFT learns more slowly but better maintains prior knowledge; RFT mainly reinforces correct samples aligned with base model’s probability landscape, causing weaker interference with prior knowledge.
Conclusion: Data distribution in post-training plays central role in forgetting, not just algorithmic differences; RFT shows promise for stable continual post-training by better preserving prior knowledge while learning new tasks.
Abstract: Post-training algorithms such as Supervised Fine-Tuning (SFT) and Reinforcement Fine-Tuning (RFT) are widely used to adapt (multimodal) large language models to downstream tasks. While effective at task adaptation, their impact on retaining prior knowledge remains unclear. In this paper, we introduce jigsaw puzzles as a novel task absent from existing pretraining corpora and systematically study the behavior of SFT and RFT on the open-source Qwen2.5-VL series. Our experiments reveal a sharp trade-off: SFT enables rapid task acquisition but leads to catastrophic forgetting, whereas RFT learns more slowly but better maintains prior knowledge. We study this phenomenon through learning dynamics by examining both the magnitude and direction of how training data influence prior knowledge. Our analysis shows that RFT mainly reinforces correct samples naturally aligned with the base model’s probability landscape, leading to weaker interference with prior knowledge. Moreover, training on RFT-simulated rollouts, which exert a smaller magnitude of influence and are better aligned in direction to prior knowledge, allows SFT to preserve prior knowledge better while rapidly learning new tasks. We further validate our framework on Qwen2.5 post-training in math and scientific QA, observing consistent forgetting and learning-dynamics trends. These findings suggest that the distribution of post-training data, rather than algorithmic differences alone, plays a central role in forgetting, and highlight RFT as a promising ingredient for stable continual post-training.
[88] MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining
Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Trevor Cohn, Meng Fang
Main category: cs.CL
TL;DR: MuRating: A framework for multilingual data quality assessment that transfers English quality signals to 17 languages via translation-based projection and pairwise comparison aggregation.
Details
Motivation: Existing data quality selection methods for large language models focus almost exclusively on English, creating a gap for multilingual model training where high-quality non-English data is needed.Method: Aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores, then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs.
Result: Applied to web data, MuRating selects balanced subsets for pretraining a 1.2B-parameter LLaMA model, boosting average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks.
Conclusion: MuRating effectively transfers English data quality signals to multilingual contexts, improving model performance across languages while highlighting issues with translation fidelity, selection biases, and underrepresentation of narrative material.
Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores,then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.
[89] Eka-Eval: An Evaluation Framework for Low-Resource Multilingual Large Language Models
Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh
Main category: cs.CL
TL;DR: EKA-EVAL is a unified evaluation framework for LLMs with multilingual benchmarks, zero-code web interface, and modular architecture for comprehensive model assessment.
Details
Motivation: There's a need for globally applicable, flexible, and modular evaluation frameworks that support diverse tasks, model types, and linguistic settings as LLMs rapidly evolve.Method: Developed a unified end-to-end framework combining zero-code web interface and interactive CLI, integrating 55+ multilingual benchmarks across 9 categories, supporting local/proprietary models with modular plug-and-play architecture.
Result: Comparisons against 5 baselines show at least 2x better usability, highest user satisfaction, faster setup times, and consistent benchmark reproducibility.
Conclusion: EKA-EVAL is the first comprehensive multilingual evaluation suite in a single platform, offering broad accessibility and scalability for LLM assessment.
Abstract: The rapid evolution of Large Language Models’ has underscored the need for evaluation frameworks that are globally applicable, flexible, and modular, and that support a wide range of tasks, model types, and linguistic settings. We introduce EKA-EVAL, a unified, end- to-end framework that combines a zero-code web interface and an interactive CLI to ensure broad accessibility. It integrates 55+ multilingual benchmarks across nine evaluation categories, supports local and proprietary models, and provides 11 core capabilities through a modular, plug-and-play architecture. Designed for scalable, multilingual evaluation with support for low-resource multilingual languages, EKA-EVAL is, to the best of our knowledge, the first suite to offer comprehensive coverage in a single platform. Comparisons against five existing baselines indicate improvements of at least 2x better on key usability measures, with the highest user satisfaction, faster setup times, and consistent benchmark reproducibility. The framework is open-source and publicly available at https://github.com/lingo-iitgn/eka-eval.
[90] How Quantization Shapes Bias in Large Language Models
Federico Marcuzzi, Xuefei Ning, Roy Schwartz, Iryna Gurevych
Main category: cs.CL
TL;DR: Quantization has nuanced effects on model bias: reduces toxicity, doesn’t affect sentiment much, but slightly increases stereotypes and unfairness in generative tasks, especially with aggressive compression.
Details
Motivation: To comprehensively evaluate how quantization affects model bias across different demographic subgroups, examining the trade-offs between model efficiency and ethical considerations.Method: Evaluated weight and activation quantization strategies across 13 benchmarks using both probability- and generated text-based metrics. Tested models with different architectures and reasoning abilities, examining effects on stereotypes, fairness, toxicity, and sentiment.
Result: Quantization reduces model toxicity and doesn’t significantly impact sentiment, but tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are consistent across demographic categories and model types.
Conclusion: Careful balancing of efficiency and ethical considerations is crucial when applying quantization in practice, as it has nuanced but measurable impacts on model bias.
Abstract: This work presents a comprehensive evaluation of how quantization affects model bias, with particular attention to its impact on individual demographic subgroups. We focus on weight and activation quantization strategies and examine their effects across a broad range of bias types, including stereotypes, fairness, toxicity, and sentiment. We employ both probability- and generated text-based metrics across 13 benchmarks and evaluate models that differ in architecture family and reasoning ability. Our findings show that quantization has a nuanced impact on bias: while it can reduce model toxicity and does not significantly impact sentiment, it tends to slightly increase stereotypes and unfairness in generative tasks, especially under aggressive compression. These trends are generally consistent across demographic categories and subgroups, and model types, although their magnitude depends on the specific setting. Overall, our results highlight the importance of carefully balancing efficiency and ethical considerations when applying quantization in practice.
[91] New Insights into Optimal Alignment of Acoustic and Linguistic Representations for Knowledge Transfer in ASR
Xugang Lu, Peng Shen, Hisashi Kawai
Main category: cs.CL
TL;DR: Proposes an unbalanced optimal transport-based alignment model for ASR that handles structural asymmetries between acoustic and linguistic representations through soft partial matching, improving knowledge transfer from pre-trained language models.
Details
Motivation: Aligning acoustic and linguistic representations is challenging due to inherent structural asymmetries in ASR: many-to-one mappings (multiple acoustic frames to single tokens), one-to-many mappings (acoustic transitions to multiple tokens), and acoustic frames with no linguistic counterpart (noise/silence). Existing methods struggle with these distributional mismatches and structural asymmetries.Method: Treats alignment as a detection problem and proposes an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries. Uses soft and partial matching between acoustic and linguistic modalities, ensuring every linguistic token is grounded in at least one acoustic observation while allowing flexible probabilistic mappings from acoustic to linguistic units.
Result: Experimental evaluation on a CTC-based ASR system with pre-trained language model for knowledge transfer demonstrates effectiveness in flexibly controlling degree of matching and improving ASR performance.
Conclusion: The proposed unbalanced optimal transport approach effectively addresses alignment challenges in ASR by handling structural asymmetries and distributional mismatches, enabling better knowledge transfer from pre-trained language models to acoustic models.
Abstract: Aligning acoustic and linguistic representations is a central challenge to bridge the pre-trained models in knowledge transfer for automatic speech recognition (ASR). This alignment is inherently structured and asymmetric: while multiple consecutive acoustic frames typically correspond to a single linguistic token (many-to-one), certain acoustic transition regions may relate to multiple adjacent tokens (one-to-many). Moreover, acoustic sequences often include frames with no linguistic counterpart, such as background noise or silence may lead to imbalanced matching conditions. In this work, we take a new insight to regard alignment and matching as a detection problem, where the goal is to identify meaningful correspondences with high precision and recall ensuring full coverage of linguistic tokens while flexibly handling redundant or noisy acoustic frames in transferring linguistic knowledge for ASR. Based on this new insight, we propose an unbalanced optimal transport-based alignment model that explicitly handles distributional mismatch and structural asymmetries with soft and partial matching between acoustic and linguistic modalities. Our method ensures that every linguistic token is grounded in at least one acoustic observation, while allowing for flexible, probabilistic mappings from acoustic to linguistic units. We evaluate our proposed model with experiments on an CTC-based ASR system with a pre-trained language model for knowledge transfer. Experimental results demonstrate the effectiveness of our approach in flexibly controlling degree of matching and hence to improve ASR performance.
[92] Linguistic trajectories of bipolar disorder on social media
Laurin Plank, Armin Zlomuzica
Main category: cs.CL
TL;DR: Social media language analysis reveals longitudinal linguistic changes associated with bipolar disorder diagnosis, including mood disturbance patterns and seasonal variations.
Details
Motivation: To study longitudinal language changes associated with bipolar disorder on a large scale using social media data, overcoming limitations of traditional cross-sectional psychiatric research.Method: Used social media records with novel method to infer diagnosis timelines from user self-reports, comparing users self-identifying with BD, depression, or no mental health condition.
Result: BD diagnosis onset corresponded with widespread linguistic shifts reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, and altered linguistic coherence. Post-diagnosis, mood symptom discussions showed 12-month cyclical patterns consistent with seasonal mood variation.
Conclusion: Social media language captures linguistic and behavioral changes associated with BD and could complement traditional psychiatric cohort research.
Abstract: Language use offers valuable insight into affective disorders such as bipolar disorder (BD), yet past research has been cross-sectional and limited in scale. Here, we demonstrate that social media records can be leveraged to study longitudinal language change associated with BD on a large scale. Using a novel method to infer diagnosis timelines from user self-reports, we compared users self-identifying with BD, depression, or no mental health condition. The onset of BD diagnosis corresponded with widespread linguistic shifts reflecting mood disturbance, psychiatric comorbidity, substance abuse, hospitalization, medical comorbidities, interpersonal concerns, unusual thought content, and altered linguistic coherence. In the years following the diagnosis, discussions of mood symptoms were found to fluctuate periodically with a dominant 12-month cycle consistent with seasonal mood variation. These findings suggest that social media language captures linguistic and behavioral changes associated with BD and might serve as a valuable complement to traditional psychiatric cohort research.
[93] Llama-Mimi: Exploring the Limits of Flattened Speech Language Modeling
Issa Sugiura, Shuhei Kurita, Yusuke Oda, Ryuichiro Higashinaka
Main category: cs.CL
TL;DR: Llama-Mimi: A single-Transformer architecture that flattens multi-level RVQ audio tokens into a single sequence for autoregressive modeling, outperforming hierarchical models on speech tasks.
Details
Motivation: To simplify speech language model architectures by moving away from hierarchical designs (which capture multi-level RVQ token structure) toward simpler, more scalable single-Transformer architectures inspired by recent NLP progress.Method: Proposes Llama-Mimi which flattens multi-level RVQ tokens from the Mimi neural audio codec into a single sequence and models them autoregressively using a Transformer decoder, eliminating hierarchical architectural biases.
Result: Outperforms CSM-based hierarchical models on most tasks and achieves best performance on acoustic consistency. Models, code, and speech samples are publicly available.
Conclusion: Simpler single-Transformer architectures can effectively model multi-level audio tokens, achieving competitive or better performance than hierarchical models while offering scalability benefits.
Abstract: Speech Language Models (SpeechLMs) model tokenized speech to capture both semantic and acoustic information. When neural audio codecs based on Residual Vector Quantization (RVQ) are used as audio tokenizers, they produce multiple discrete tokens per time step, yielding inherently multi-level representations. To process these multi-level tokens together, prior work typically adopts hierarchical architectures to capture this structure. In contrast, recent progress in NLP has progressively reduced architectural inductive biases, moving toward simpler and more scalable single-Transformer architectures. In this work, we propose Llama-Mimi, which flattens multi-level RVQ tokens produced by the Mimi neural audio codec into a single sequence and models them autoregressively with a Transformer decoder. We show that Llama-Mimi outperforms a CSM-based hierarchical model on most tasks and achieves the best performance on acoustic consistency. Our models, code, and speech samples are publicly available.
[94] BeyondBench: Contamination-Resistant Evaluation of Reasoning in Language Models
Gaurav Srivastava, Aafiya Hussain, Zhenyu Bi, Swastik Roy, Priya Pitre, Meng Lu, Morteza Ziyadi, Xuan Wang
Main category: cs.CL
TL;DR: BeyondBench is an evaluation framework using algorithmic problem generation to create mathematically grounded problems on-the-fly, ensuring uncontaminated testing of language models’ reasoning capabilities across 44 algorithmic tasks with 117 variations.
Details
Motivation: Static benchmarks risk contamination by training data, making it difficult to determine if language models truly reason or simply recall memorized information. There's a need for evaluation frameworks that can generate fresh, uncontaminated problems to assess genuine reasoning abilities.Method: The framework generates algorithmic problems on-the-fly using mathematical algorithms, covering 44 tasks with 117 variations across three difficulty levels: Easy Suite (arithmetic/statistics), Medium Suite (sequence patterns/reasoning), and Hard Suite (NP-complete/constraint satisfaction). Each task draws from a space exceeding 10^15 unique instances with deterministically verified solutions.
Result: Evaluation of 101 language models (85 open-source, 16 closed-source) revealed consistent reasoning deficiencies, with performance degrading sharply as complexity increases. In Hard Suite, top models achieved only 56.21% (Gemini-2.5-pro), 27.16% (Llama-3.3-70B), and 33.37% (Qwen2.5-72B) accuracy. Performance drops significantly without tool usage.
Conclusion: BeyondBench provides a contamination-resistant evaluation framework for genuine reasoning assessment through algorithmic problem generation, revealing significant limitations in current language models’ reasoning capabilities as problem complexity increases.
Abstract: Evaluating language models fairly is increasingly difficult as static benchmarks risk contamination by training data, obscuring whether models truly reason or recall. We introduce BeyondBench, an evaluation framework using algorithmic problem generation to create mathematically grounded problems on the fly, ensuring each test remains uncontaminated. Our framework covers 44 algorithmic tasks with 117 variations across three difficulty levels: the Easy Suite (29 tasks) for arithmetic and statistics, the Medium Suite (5 tasks, 49 variations) for sequence patterns and reasoning, and the Hard Suite (10 tasks, 68 variations) for NP-complete and constraint satisfaction problems. Each task draws from a space exceeding 10^15 unique instances, with deterministically verified solutions. We evaluated 101 language models (85 open-source, 16 closed-source), spanning 0.5B to 141B parameters and multiple quantization schemes, using three-fold evaluation for robustness. Results reveal consistent reasoning deficiencies, with performance degrading sharply as complexity increases. In Hard Suite evaluations, Gemini-2.5-pro, Llama-3.3-70B, and Qwen2.5-72B achieved accuracies of 56.21%, 27.16%, and 33.37% respectively. Performance drops significantly without tool usage, with GPT-5, GPT-5-mini, and GPT-5-nano showing declines of 16.81%, 15.86%, and 43.95% in overall accuracy. Contamination resistance rests on three guarantees: (i) the problem space vastly exceeds any static dataset, (ii) every instance has a deterministically verifiable solution, and (iii) isomorphic transformations yield semantically equivalent but syntactically novel problems. BeyondBench redefines reasoning evaluation via genuine algorithmic problem-solving. Our leaderboard is at https://ctrl-gaurav.github.io/BeyondBench/, Python package at https://pypi.org/project/beyondbench/, and codebase at https://github.com/ctrl-gaurav/BeyondBench.
[95] Pretraining Large Language Models with NVFP4
NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Muya Chang, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, Abhinav Goel, Oleg Goncharov, Yugi Guvvala, Robert Hesse, Russell Hewett, Herbert Hum, Ujval Kapasi, Brucek Khailany, Mikail Khona, Nick Knight, Alex Kondratenko, Ronny Krashinsky, Ben Lanir, Simon Layton, Michael Lightstone, Daniel Lo, Paulius Micikevicius, Asit Mishra, Tim Moon, Deepak Narayanan, Chao Ni, Abhijit Paithankar, Satish Pasumarthi, Ankit Patel, Mostofa Patwary, Ashwin Poojary, Gargi Prasad, Sweta Priyadarshi, Yigong Qin, Xiaowei Ren, Oleg Rybakov, Charbel Sakr, Sanjeev Satheesh, Stas Sergienko, Pasha Shamis, Kirthi Shankar, Nishant Sharma, Mohammad Shoeybi, Michael Siu, Misha Smelyanskiy, Darko Stosic, Dusan Stosic, Bor-Yiing Su, Frank Sun, Nima Tajbakhsh, Shelby Thomas, Przemek Tredak, Evgeny Tsykunov, Gandhi Vaithilingam, Aditya Vavre, Rangharajan Venkatesan, Roger Waleffe, Qiyu Wan, Hexin Wang, Mengdi Wang, Lizzie Wei, Hao Wu, Evan Wu, Keith Wyss, Ning Xu, Jinze Xue, Charlene Yang, Yujia Zhai, Ruoxi Zhang, Jingyang Zhu, Zhongbo Zhu
Main category: cs.CL
TL;DR: Novel NVFP4 training method enables stable 4-bit precision LLM pretraining with comparable performance to FP8 baseline, demonstrated on 12B parameter model trained on 10T tokens.
Details
Motivation: Current LLM training requires massive computational resources (tens to hundreds of yottaflops). While FP8 training is now common, moving to even narrower 4-bit precision (FP4) could significantly improve computational efficiency and resource utilization, but faces challenges with training stability, convergence, and implementation for large-scale models.Method: Introduces NVFP4 format training with: 1) Random Hadamard transforms to bound block-level outliers, 2) Two-dimensional quantization scheme for consistent forward/backward pass representations, 3) Stochastic rounding for unbiased gradient estimation, 4) Selective high-precision layers.
Result: Successfully trained a 12-billion-parameter model on 10 trillion tokens - the longest publicly documented 4-bit precision training run. Achieved training loss and downstream task accuracies comparable to FP8 baseline.
Conclusion: NVFP4 with the proposed training approach represents a major advancement in narrow-precision LLM training algorithms, enabling efficient 4-bit training without sacrificing model quality.
Abstract: Large Language Models (LLMs) today are powerful problem solvers across many domains, and they continue to get stronger as they scale in model size, training set size, and training set quality, as shown by extensive research and experimentation across the industry. Training a frontier model today requires on the order of tens to hundreds of yottaflops, which is a massive investment of time, compute, and energy. Improving pretraining efficiency is therefore essential to enable the next generation of even more capable LLMs. While 8-bit floating point (FP8) training is now widely adopted, transitioning to even narrower precision, such as 4-bit floating point (FP4), could unlock additional improvements in computational speed and resource utilization. However, quantization at this level poses challenges to training stability, convergence, and implementation, notably for large-scale models trained on long token horizons. In this study, we introduce a novel approach for stable and accurate training of large language models (LLMs) using the NVFP4 format. Our method integrates Random Hadamard transforms (RHT) to bound block-level outliers, employs a two-dimensional quantization scheme for consistent representations across both the forward and backward passes, utilizes stochastic rounding for unbiased gradient estimation, and incorporates selective high-precision layers. We validate our approach by training a 12-billion-parameter model on 10 trillion tokens – the longest publicly documented training run in 4-bit precision to date. Our results show that the model trained with our NVFP4-based pretraining technique achieves training loss and downstream task accuracies comparable to an FP8 baseline. These findings highlight that NVFP4, when combined with our training approach, represents a major step forward in narrow-precision LLM training algorithms.
[96] PrefDisco: Benchmarking Proactive Personalized Reasoning
Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov
Main category: cs.CL
TL;DR: PrefDisco: A framework for evaluating LLMs on personalized reasoning tasks where models must adapt responses to individual user preferences through strategic questioning and reasoning.
Details
Motivation: Current LLMs treat task-solving and preference-alignment as separate challenges, but in human-facing applications, correct answers are insufficient if they don't match user needs. The problem is especially acute in cold-start scenarios where no prior user history exists.Method: PrefDisco transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences. It defines PrefAlign as a fine-grained rubric-based metric for measuring preference alignment, creating scenarios where identical questions require different reasoning chains depending on user context.
Result: Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs.
Conclusion: Personalized reasoning requires dedicated development rather than emerging naturally. PrefDisco provides a foundation for developing systems that can adapt to individual users in domains like education, healthcare, and technical fields where personalization is critical.
Abstract: Current large language model (LLM) development treats task-solving and preference-alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user’s needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to proactively identify what they don’t know about the user, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly – a complicated chain of cognitive processes which we term personalized reasoning. We introduce PrefDisco, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse, context-dependent preferences, and define PrefAlign as a fine-grained rubric-based metric for measuring preference alignment. PrefDisco builds scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PrefDisco provides a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.
[97] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs
Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao, Lin Chen, Feng Wei, Yuxi Qian, Bo Zheng, Keting Yin, Shengyu Zhang
Main category: cs.CL
TL;DR: Graph2Eval: A knowledge-graph-driven framework for automated generation of multimodal agent tasks with improved semantic consistency and solvability
Details
Motivation: Traditional static datasets are insufficient for evaluating increasingly autonomous multimodal LLM-driven agents. Existing LLM-based task generation methods suffer from hallucinations and lack of data relationship modeling, leading to semantic inconsistencies and solvability issues.Method: Leverages knowledge graphs from heterogeneous data sources as structured task space, generates multimodal agent tasks through subgraph sampling and task construction guided by templates and meta-path strategies. Includes multi-stage filtering pipeline with node reachability analysis, LLM scoring, and similarity analysis to ensure task reliability.
Result: Graph2Eval improves task semantic consistency by 20% and solvability by 17% over baselines. Graph2Eval-Bench dataset contains 1,319 tasks spanning document understanding and web interaction scenarios, effectively distinguishing agent performance.
Conclusion: Graph2Eval provides a scalable, semantically grounded framework for generating reliable multimodal agent evaluation tasks, offering a new perspective on agent evaluation beyond traditional static datasets.
Abstract: As multimodal LLM-driven agents advance in autonomy and generalization, traditional static datasets face inherent scalability limitations and are insufficient for fully assessing their capabilities in increasingly complex and diverse tasks. Existing studies have attempted to generate agent tasks using LLMs, but due to the inherent hallucinations of LLMs and the lack of internal data relationship modeling, these tasks often exhibit semantic inconsistencies and solvability issues. To address these challenges, we introduce Graph2Eval, a knowledge-graph-driven framework for automated, scalable, and semantically grounded agent task generation. At its core, Graph2Eval leverages a knowledge graph built from heterogeneous external data sources as a structured task space, generating multimodal agent tasks through subgraph sampling and task construction guided by task templates and meta-path strategies. To further ensure task reliability, a multi-stage filtering pipeline based on node reachability analysis, LLM scoring, and similarity analysis ensures the diversity and solvability of the generated tasks. By unifying both RAG Agent and Web Agent scenarios, Graph2Eval enables efficient generation of multimodal document understanding tasks and multi-step web interaction tasks. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document understanding and web interaction scenarios. Extensive experiments show that, on average, Graph2Eval improves task semantic consistency by 20% and solvability by 17% over baselines, while Graph2Eval-Bench effectively distinguishes agent performance, offering a new perspective on agent evaluation.
[98] Detecting Hallucinations in Authentic LLM-Human Interactions
Yujie Ren, Niklas Gruhlke, Anne Lauscher
Main category: cs.CL
TL;DR: AuthenHallu is the first hallucination detection benchmark built from authentic LLM-human dialogues, addressing limitations of artificially constructed benchmarks.
Details
Motivation: Existing hallucination detection benchmarks are artificially constructed through deliberate hallucination induction or simulated interactions, failing to capture real-world hallucination characteristics in genuine LLM-human dialogues.Method: Created AuthenHallu by selecting and annotating samples from genuine LLM-human dialogues, providing a faithful reflection of real-world LLM hallucinations. Statistical analysis reveals hallucination rates across different domains.
Result: Hallucinations occur in 31.4% of query-response pairs overall, increasing to 60.0% in challenging domains like Math & Number Problems. Vanilla LLMs show promise as hallucination detectors but current performance remains insufficient for real-world scenarios.
Conclusion: AuthenHallu provides a more realistic benchmark for hallucination detection research, highlighting the need for improved detection methods and revealing domain-specific challenges in LLM reliability.
Abstract: As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed–either through deliberate hallucination induction or simulated interactions–rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios. The data and code are publicly available at https://github.com/TAI-HAMBURG/AuthenHallu.
[99] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda
Main category: cs.CL
TL;DR: Analyzing activation biases in narrowly finetuned LLMs reveals strong traces of training data that can be used to understand finetuning domains and steer model behavior.
Details
Motivation: To understand how narrow finetuning creates biases in LLM activations and whether these biases can reveal information about the finetuning domain, which has implications for AI safety, interpretability, and realistic model evaluation.Method: Using model diffing techniques to analyze activation differences between base and finetuned models, particularly on first few tokens of random text, and steering by adding these differences to activations. Testing across synthetic document finetuning, emergent misalignment, subliminal learning, and taboo word guessing games across different architectures (Gemma, LLaMA, Qwen) and scales (1B-32B).
Result: Narrow finetuning creates strong, interpretable biases in activations that reveal training domain characteristics. An LLM-based interpretability agent using these biases significantly outperforms baseline prompting. Mixing pretraining data reduces biases but residual risks remain.
Conclusion: Narrowly finetuned models contain salient traces of their training in activations, suggesting training improvements needed. Using such models as proxies for broader finetuning studies may be unrealistic, highlighting need for deeper investigation into narrow finetuning effects.
Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.
[100] EchoMind: An Interrelated Multi-level Benchmark for Evaluating Empathetic Speech Language Models
Li Zhou, Lutong Yu, You Lyu, Yihang Lin, Zefeng Zhao, Junyi Ao, Yuhao Zhang, Benyou Wang, Haizhou Li
Main category: cs.CL
TL;DR: EchoMind is a new benchmark for evaluating Speech Language Models’ ability to integrate linguistic content with vocal cues for empathetic dialogue, revealing current models struggle with high-expressive vocal cues and effective empathy integration.
Details
Motivation: Current SLM benchmarks evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, missing the integrated skills needed for human-like, emotionally intelligent conversation. There's a need to assess whether SLMs can perceive non-lexical vocal cues alongside spoken words and respond with contextually appropriate empathy.Method: Created EchoMind benchmark with sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. Uses identical semantically neutral scripts with controlled vocal style variations to isolate delivery effects. Grounded in empathy framework with 3 coarse and 12 fine-grained dimensions covering 39 vocal attributes, evaluated with objective and subjective metrics.
Result: Testing 12 advanced SLMs shows state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analysis reveals weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy.
Conclusion: Current SLMs lack the ability to integrate linguistic content with diverse vocal cues for truly empathetic conversational ability. EchoMind provides a comprehensive benchmark to drive development of more emotionally intelligent speech models.
Abstract: Speech Language Models (SLMs) have made significant progress in spoken language understanding. Yet it remains unclear whether they can fully perceive non lexical vocal cues alongside spoken words, and respond with empathy that aligns with both emotional and contextual factors. Existing benchmarks typically evaluate linguistic, acoustic, reasoning, or dialogue abilities in isolation, overlooking the integration of these skills that is crucial for human-like, emotionally intelligent conversation. We present EchoMind, the first interrelated, multi-level benchmark that simulates the cognitive process of empathetic dialogue through sequential, context-linked tasks: spoken-content understanding, vocal-cue perception, integrated reasoning, and response generation. All tasks share identical and semantically neutral scripts that are free of explicit emotional or contextual cues, and controlled variations in vocal style are used to test the effect of delivery independent of the transcript. EchoMind is grounded in an empathy-oriented framework spanning 3 coarse and 12 fine-grained dimensions, encompassing 39 vocal attributes, and evaluated using both objective and subjective metrics. Testing 12 advanced SLMs reveals that even state-of-the-art models struggle with high-expressive vocal cues, limiting empathetic response quality. Analyses of prompt strength, speech source, and ideal vocal cue recognition reveal persistent weaknesses in instruction-following, resilience to natural speech variability, and effective use of vocal cues for empathy. These results underscore the need for SLMs that integrate linguistic content with diverse vocal cues to achieve truly empathetic conversational ability.
[101] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyunghyun Cho, Alice Oh
Main category: cs.CL
TL;DR: The Open Korean Historical Corpus is a large-scale, openly licensed dataset spanning 1,300 years of Korean language evolution, enabling quantitative analysis of linguistic shifts and serving as a resource for LLM pre-training.
Details
Motivation: Korean language history features a discrepancy between spoken and written forms and a shift from Chinese characters to Hangul, but this evolution has been unexplored in NLP due to lack of accessible historical corpora.Method: Created the Open Korean Historical Corpus containing 17.7 million documents and 5.1 billion tokens from 19 sources spanning 7th century to 2025, covering 6 languages and underrepresented writing systems like Idu and Hanja-Hangul mixed script.
Result: Quantitative analysis revealed: (1) Idu usage peaked in 1860s then declined sharply; (2) Hanja to Hangul transition was rapid starting around 1890; (3) North Korea’s lexical divergence causes modern tokenizers to produce up to 51x higher OOV rates.
Conclusion: The corpus provides foundational resource for quantitative diachronic analysis of Korean language history and can serve as pre-training corpus for LLMs to improve understanding of Sino-Korean vocabulary and archaic writing systems.
Abstract: The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 17.7 million documents and 5.1 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea’s lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
[102] Steering Awareness: Models Can Be Trained to Detect Activation Steering
Joshua Fonseca Rivera, David Demitri Africa
Main category: cs.CL
TL;DR: Models can be trained to detect when activation steering vectors are injected into their residual streams, challenging the assumption that such interventions are undetectable.
Details
Motivation: The paper challenges the common assumption in activation steering research that language models cannot detect when steering vectors are injected into their residual streams, which has implications for the reliability of steering-based safety evaluations and interpretability techniques.Method: Fine-tuned seven open-source instruction-tuned models to detect whether steering vectors were injected and identify which concepts were injected. Tested detection transfer across different vector extraction methods and analyzed the mechanistic basis of steering awareness.
Result: Best model achieved 95.5% detection accuracy on held-out concepts and 71.2% concept identification, with no false positives on clean controls. Detection transfers to vectors aligned with contrastive activation addition but fails for geometrically dissimilar methods. Detection-trained models are more susceptible to steering in realistic settings.
Conclusion: Activation steering cannot be assumed to be undetectable, with significant implications for the reliability of steering-based safety evaluations and interpretability techniques. Models can learn to detect and identify steering interventions through distributed transformations.
Abstract: Activation steering - adding a vector to a language model’s residual stream - is widely used to elicit latent behaviors and to probe safety-relevant properties. Many steering-based evaluations implicitly assume that the model cannot tell when such an intervention has occurred. We test this assumption by fine-tuning models to report (i) whether a steering vector was injected and (ii) which concept was injected, a capability we call steering awareness. Across seven open-source instruction-tuned models, the best achieves 95.5% detection on held-out concepts and 71.2% concept identification, with no false positives on our clean controls. We find that such detection transfers to novel vectors extracted by methods that produce directions aligned with contrastive activation addition, but fail for geometrically dissimilar methods. Crucially, detection does not confer behavioral robustness; detection-trained models are consistently more susceptible to steering in realistic settings than their base counterparts. Mechanistically, steering awareness arises from a distributed transformation that progressively rotates diverse injected vectors into a shared detection direction. These findings suggest that activation steering cannot be assumed to remain an undetectable intervention, with implications for the long-term reliability of steering-based safety evaluations and interpretability techniques more broadly.
[103] Beyond Prefixes: Graph-as-Memory Cross-Attention for Knowledge Graph Completion with Large Language Models
Ruitong Liu, Boxu Lin, Peize Li, Siyuan Li, Yunjia Wu, Te Sun, Chaohan Wu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.08966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[104] Think-While-Generating: On-the-Fly Reasoning for Personalized Long-Form Generation
Chengbing Wang, Yang Zhang, Wenjie Wang, Xiaoyan Zhao, Fuli Feng, Xiangnan He, Tat-Seng Chua
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2512.06690: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06690&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[105] ReFusion: A Diffusion Large Language Model with Parallel Autoregressive Decoding
Jia-Nan Li, Jian Guan, Wei Wu, Chongxuan Li
Main category: cs.CL
TL;DR: Unable to analyze paper 2512.13586 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2512.13586: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13586&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[106] MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers
Xuanjun Zong, Zhiqi Shen, Lei Wang, Yunshi Lan, Chao Yang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.15163: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.15163&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[107] From Word to World: Can Large Language Models be Implicit Text-based World Models?
Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.18832: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18832&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[108] Parallel Token Prediction for Language Models
Felix Draxler, Justus Will, Farrin Marouf Sofian, Theofanis Karaletsos, Sameer Singh, Stephan Mandt
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.21323: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21323&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[109] When Do Tools and Planning Help Large Language Models Think? A Cost- and Latency-Aware Benchmark
Subha Ghoshal, Ali Al-Bustami
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.02663: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02663&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[110] Identifying Good and Bad Neurons for Task-Level Controllable LLMs
Wenjie Li, Guansong Pang, Hezhe Qiao, Debin Gao, David Lo
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Paper analysis not possible due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2601.04548: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04548&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[111] F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
Maike Züfle, Ondrej Klejch, Nicholas Sanders, Jan Niehues, Alexandra Birch, Tsz Kin Lam
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.11329: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11329&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[112] The unreasonable effectiveness of pattern matching
Gary Lupyan, Blaise Agüera y Arcas
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.11432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[113] FreeAct: Freeing Activations for LLM Quantization
Xiaohao Liu, Xiaobo Xia, Manyi Zhang, Ji-Fu Li, Xianzhi Yu, Fei Shen, Xiu Su, See-Kiong Ng, Tat-Seng Chua
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.01776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[114] Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model
Jakub Prejzner
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.04162: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04162&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[115] The Convergence of Schema-Guided Dialogue Systems and the Model Context Protocol
Andreas Schlapbach
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.18764 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.18764: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18764&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[116] AgentIR: Reasoning-Aware Retrieval for Deep Research Agents
Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.04384: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04384&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[117] Vector Retrieval with Similarity and Diversity: How Hard Is It?
Hang Gao, Dong Deng, Yongfeng Zhang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2407.04573 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as the paper content is unavailable due to API rate limiting.Method: Cannot determine method as the paper content is unavailable.
Result: Cannot determine results as the paper content is unavailable.
Conclusion: Cannot draw conclusions about the paper without access to its content.
Abstract: Failed to fetch summary for 2407.04573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.04573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[118] Learning Virtual Machine Scheduling in Cloud Computing through Language Agents
JieHao Wu, Ziwei Wang, Junjie Sheng, Wenhao Li, Xiangfeng Wang, Jun Luo
Main category: cs.CL
TL;DR: Failed to fetch summary for paper 2505.10117 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: Unable to determine method due to technical error in accessing paper content
Result: Unable to determine results due to technical error in accessing paper content
Conclusion: Unable to draw conclusions due to technical error in accessing paper content
Abstract: Failed to fetch summary for 2505.10117: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.10117&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[119] Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2507.07999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.07999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[120] EDINET-Bench: Evaluating LLMs on Complex Financial Tasks using Japanese Financial Statements
Issa Sugiura, Takashi Ishida, Taro Makino, Chieko Tazuke, Takanori Nakagawa, Kosuke Nakago, David Ha
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.08762: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08762&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[121] Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
Haochen Wang, Yuhao Wang, Tao Zhang, Yikang Zhou, Yanwei Li, Jiacong Wang, Jiani Zheng, Ye Tian, Jiahao Meng, Zilong Huang, Guangcan Mai, Anran Wang, Yunhai Tong, Zhuochen Wang, Xiangtai Li, Zhaoxiang Zhang
Main category: cs.CL
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to API rate limitingMethod: No method information available due to failed API request
Result: No results available - the request to arXiv API resulted in HTTP 429 (Too Many Requests) error
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2510.18876: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18876&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[122] RePo: Language Models with Context Re-Positioning
Huayang Li, Tianyu Zhao, Deng Cai, Richard Sproat
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.14391: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14391&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[123] Yuan3.0 Ultra: A Trillion-Parameter Enterprise-Oriented MoE LLM
YuanLab.ai, Shawn Wu, Jiangang Luo, Darcy Chen, Sean Wang, Louie Li, Allen Wang, Xudong Zhao, Tong Yu, Bach Li, Joseph Shen, Gawain Ma, Jasper Jia, Marcus Mao, Claire Wang, Hunter He, Carol Wang, Zera Zhang, Jason Wang, Chonly Shen, Leo Zhang, Logan Chen, Qasim Meng, James Gong, Daniel Zhao, Penn Zheng, Owen Zhu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2601.14327: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14327&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[124] Where is the multimodal goal post? On the Ability of Foundation Models to Recognize Contextually Important Moments
Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to analyze paper due to technical retrieval error
Abstract: Failed to fetch summary for 2601.16333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[125] Adaptive Rollout Allocation for Online Reinforcement Learning with Verifiable Rewards
Hieu Trung Nguyen, Bao Nguyen, Wenao Ma, Yuzhi Zhao, Ruifeng She, Viet Anh Nguyen
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.01601: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01601&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[126] LatentChem: From Textual CoT to Latent Thinking in Chemical Reasoning
Xinwu Ye, Yicheng Mao, Jia Zhang, Yimeng Liu, Li Hao, Fang Wu, Zhiwei Li, Yuxuan Liao, Zehong Wang, Zhiyuan Liu, Zhenfei Yin, Li Yuan, Philip Torr, Huan Sun, Xiangxiang Zeng, Mengdi Wang, Le Cong, Shenghua Gao, Xiangru Tang
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2602.07075: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07075&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[127] Jailbreak Foundry: From Papers to Runnable Attacks for Reproducible Benchmarking
Zhicheng Fang, Jingjie Zheng, Chenxu Fu, Wei Xu
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.24009: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24009&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[128] Learn Hard Problems During RL with Reference Guided Fine-tuning
Yangzhen Wu, Shanda Li, Zixin Wen, Xin Zhou, Ameet Talwalkar, Yiming Yang, Wenhao Huang, Tianle Cai
Main category: cs.CL
TL;DR: Paper 2603.01223: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as abstract is unavailable due to arXiv rate limitingMethod: Method unknown - paper content inaccessible due to HTTP 429 error
Result: Results unknown - unable to retrieve paper information
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.01223: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01223&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[129] Incremental Graph Construction Enables Robust Spectral Clustering of Texts
Marko Pranjić, Boshko Koloski, Nada Lavrač, Senja Pollak, Marko Robnik-Šikonja
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content not accessible due to HTTP 429 error from arXiv API
Result: No results available - arXiv API rate limiting prevented access to paper information
Conclusion: Cannot analyze paper due to technical limitations with arXiv API access
Abstract: Failed to fetch summary for 2603.03056: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03056&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[130] Why Are Linear RNNs More Parallelizable?
William Merrill, Hongjian Jiang, Yanhong Li, Anthony Lin, Ashish Sabharwal
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.03612: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03612&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[131] DeformTrace: A Deformable State Space Model with Relay Tokens for Temporal Forgery Localization
Xiaodong Zhu, Suting Wang, Yuanming Zheng, Junqi Yang, Yangxu Liao, Yuhong Yang, Weiping Tu, Zhongyuan Wang
Main category: cs.CV
TL;DR: DeformTrace enhances State Space Models with deformable dynamics and relay mechanisms for precise temporal forgery localization in video and audio, achieving SOTA performance with efficiency.
Details
Motivation: Temporal Forgery Localization (TFL) needs precise identification of manipulated segments in video/audio for security/forensics. Current SSMs have limitations with ambiguous boundaries, sparse forgeries, and long-range modeling in TFL applications.Method: DeformTrace introduces: 1) Deformable Self-SSM (DS-SSM) with dynamic receptive fields for precise localization, 2) Relay Token Mechanism to enhance temporal reasoning and mitigate long-range decay, 3) Deformable Cross-SSM (DC-SSM) that partitions global state space into query-specific subspaces to reduce non-forgery information accumulation. These are integrated into a hybrid Transformer-SSM architecture.
Result: Extensive experiments show DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness compared to existing methods.
Conclusion: DeformTrace successfully addresses TFL challenges by enhancing SSMs with deformable dynamics and relay mechanisms, providing an efficient and effective solution for temporal forgery localization in multimodal content.
Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments in video and audio, offering strong interpretability for security and forensics. While recent State Space Models (SSMs) show promise in precise temporal reasoning, their use in TFL is hindered by ambiguous boundaries, sparse forgeries, and limited long-range modeling. We propose DeformTrace, which enhances SSMs with deformable dynamics and relay mechanisms to address these challenges. Specifically, Deformable Self-SSM (DS-SSM) introduces dynamic receptive fields into SSMs for precise temporal localization. To further enhance its capacity for temporal reasoning and mitigate long-range decay, a Relay Token Mechanism is integrated into DS-SSM. Besides, Deformable Cross-SSM (DC-SSM) partitions the global state space into query-specific subspaces, reducing non-forgery information accumulation and boosting sensitivity to sparse forgeries. These components are integrated into a hybrid architecture that combines the global modeling of Transformers with the efficiency of SSMs. Extensive experiments show that DeformTrace achieves state-of-the-art performance with fewer parameters, faster inference, and stronger robustness.
[132] EasyAnimate: High-Performance Video Generation Framework with Hybrid Windows Attention and Reward Backpropagation
Jiaqi Xu, Kunzhe Huang, Xinyi Zou, Yunkuo Chen, Bo Liu, MengLi Cheng, Jun Huang, Xing Shi
Main category: cs.CV
TL;DR: EasyAnimate is an efficient video generation framework using diffusion transformers with Hybrid Window Attention for better 3D receptive capabilities and computational efficiency, plus reward backpropagation for human preference alignment.
Details
Motivation: Existing video diffusion models suffer from slow generation speeds and suboptimal video quality, needing improvements in both training/inference efficiency and output quality.Method: Proposes Hybrid Window Attention with multidirectional sliding window attention for better 3D receptive capabilities; uses reward backpropagation for human preference alignment; introduces Training with Token Length strategy for uneven GPU utilization; employs multimodal large language model as text encoder.
Result: Achieves state-of-the-art performance on both VBench leaderboard and human evaluation, with significant enhancements in computational efficiency and model performance.
Conclusion: EasyAnimate provides an efficient, high-quality video generation framework that addresses key limitations of existing video diffusion models through architectural innovations and training optimizations.
Abstract: This paper introduces EasyAnimate, an efficient and high quality video generation framework that leverages diffusion transformers to achieve high-quality video production, encompassing data processing, model training, and end-to-end inference. Despite substantial advancements achieved by video diffusion models, existing video generation models still struggles with slow generation speeds and less-than-ideal video quality. To improve training and inference efficiency without compromising performance, we propose Hybrid Window Attention. We design the multidirectional sliding window attention in Hybrid Window Attention, which provides stronger receptive capabilities in 3D dimensions compared to naive one, while reducing the model’s computational complexity as the video sequence length increases. To enhance video generation quality, we optimize EasyAnimate using reward backpropagation to better align with human preferences. As a post-training method, it greatly enhances the model’s performance while ensuring efficiency. In addition to the aforementioned improvements, EasyAnimate integrates a series of further refinements that significantly improve both computational efficiency and model performance. We introduce a new training strategy called Training with Token Length to resolve uneven GPU utilization in training videos of varying resolutions and lengths, thereby enhancing efficiency. Additionally, we use a multimodal large language model as the text encoder to improve text comprehension of the model. Experiments demonstrate significant enhancements resulting from the above improvements. The EasyAnimate achieves state-of-the-art performance on both the VBench leaderboard and human evaluation. Code and pre-trained models are available at https://github.com/aigc-apps/EasyAnimate.
[133] Lost in Translation: How Language Re-Aligns Vision for Cross-Species Pathology
Ekansh Arora
Main category: cs.CV
TL;DR: CPath-CLIP fine-tuning improves cancer detection but suffers from semantic collapse in cross-species transfer; Semantic Anchoring using language alignment addresses this by providing stable semantic coordinates.
Details
Motivation: To understand how foundation models behave in computational pathology under cross-cancer and cross-species transfer, and to address the limitations of standard vision-language alignment for cross-species generalization.Method: Fine-tuned CPath-CLIP on whole-slide image patches from canine and human histopathology, evaluated performance using AUC, analyzed embedding spaces with cosine similarity and Grad-CAM, and introduced Semantic Anchoring using language to stabilize visual features.
Result: Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species transfer showed limitations, with Semantic Anchoring providing additional gains (8.52% same-cancer, 5.67% cross-cancer) by addressing semantic collapse.
Conclusion: Language acts as a control mechanism enabling semantic re-interpretation without retraining; standard vision-language alignment is suboptimal for cross-species generalization due to semantic collapse, which can be addressed through text alignment mechanisms.
Abstract: Foundation models are increasingly applied to computational pathology, yet their behavior under cross-cancer and cross-species transfer remains unspecified. This study investigated how fine-tuning CPath-CLIP affects cancer detection under same-cancer, cross-cancer, and cross-species conditions using whole-slide image patches from canine and human histopathology. Performance was measured using area under the receiver operating characteristic curve (AUC). Few-shot fine-tuning improved same-cancer (64.9% to 72.6% AUC) and cross-cancer performance (56.84% to 66.31% AUC). Cross-species evaluation revealed that while tissue matching enables meaningful transfer, performance remains below state-of-the-art benchmarks (H-optimus-0: 84.97% AUC), indicating that standard vision-language alignment is suboptimal for cross-species generalization. Embedding space analysis revealed extremely high cosine similarity (greater than 0.99) between tumor and normal prototypes. Grad-CAM shows prototype-based models remain domain-locked, while language-guided models attend to conserved tumor morphology. To address this, we introduce Semantic Anchoring, which uses language to provide a stable coordinate system for visual features. Ablation studies reveal that benefits stem from the text-alignment mechanism itself, regardless of text encoder complexity. Benchmarking against H-optimus-0 shows that CPath-CLIP’s failure stems from intrinsic embedding collapse, which text alignment effectively circumvents. Additional gains were observed in same-cancer (8.52%) and cross-cancer classification (5.67%). We identified a previously uncharacterized failure mode: semantic collapse driven by species-dominated alignment rather than missing visual information. These results demonstrate that language acts as a control mechanism, enabling semantic re-interpretation without retraining.
[134] Recognition of Daily Activities through Multi-Modal Deep Learning: A Video, Pose, and Object-Aware Approach for Ambient Assisted Living
Kooshan Hashemifard, Pau Climent-Pérez, Francisco Florez-Revuelta
Main category: cs.CV
TL;DR: Multi-modal activity recognition system for older adults in AAL settings combining 3D CNN visual features, 3D human pose data via Graph CNN, and contextual object detection with cross-attention fusion.
Details
Motivation: To develop robust activity recognition systems for Ambient Assisted Living (AAL) that can effectively monitor older adults' well-being and support independence, addressing challenges like intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity.Method: Multi-modal approach integrating: 1) Visual information processed by 3D CNN, 2) 3D human pose data analyzed by Graph Convolutional Network, 3) Contextual information from object detection module fused with 3D CNN features using cross-attention mechanism.
Result: Achieves competitive classification accuracy for daily activities on the Toyota SmartHome dataset, demonstrating potential as essential component for advanced AAL monitoring solutions.
Conclusion: The proposed multi-modal system effectively recognizes daily activities for older adults in AAL settings, supporting development of intelligent systems that promote safety and autonomy among older adults.
Abstract: Recognition of daily activities is a critical element for effective Ambient Assisted Living (AAL) systems, particularly to monitor the well-being and support the independence of older adults in indoor environments. However, developing robust activity recognition systems faces significant challenges, including intra-class variability, inter-class similarity, environmental variability, camera perspectives, and scene complexity. This paper presents a multi-modal approach for the recognition of activities of daily living tailored for older adults within AAL settings. The proposed system integrates visual information processed by a 3D Convolutional Neural Network (CNN) with 3D human pose data analyzed by a Graph Convolutional Network. Contextual information, derived from an object detection module, is fused with the 3D CNN features using a cross-attention mechanism to enhance recognition accuracy. This method is evaluated using the Toyota SmartHome dataset, which consists of real-world indoor activities. The results indicate that the proposed system achieves competitive classification accuracy for a range of daily activities, highlighting its potential as an essential component for advanced AAL monitoring solutions. This advancement supports the broader goal of developing intelligent systems that promote safety and autonomy among older adults.
[135] InverseNet: Benchmarking Operator Mismatch and Calibration Across Compressive Imaging Modalities
Chengshuai Yang, Xin Yuan
Main category: cs.CV
TL;DR: InverseNet is the first cross-modality benchmark for operator mismatch in compressive imaging systems, evaluating 12 methods across 4 scenarios to reveal that deep learning methods lose significant performance under mismatch and that operator-conditioned architectures are crucial for robustness.
Details
Motivation: Current compressive imaging benchmarks assume perfect forward operators, but real deployed systems always have operator mismatch. No existing benchmark quantifies this mismatch, which is the default condition in practice, leading to overestimated performance of deep learning methods.Method: Introduced InverseNet benchmark spanning CASSI, CACTI, and single-pixel cameras. Evaluated 12 methods under 4 scenarios: ideal, mismatched, oracle-corrected, and blind calibration. Used 27 simulated scenes and 9 real hardware captures with comprehensive protocol.
Result: Deep learning methods lose 10-21 dB under mismatch, eliminating their advantage over classical baselines. Performance and robustness are inversely correlated across modalities. Mask-oblivious architectures recover 0% of mismatch losses, while operator-conditioned methods recover 41-90%. Blind grid-search calibration recovers 85-100% of oracle bound without ground truth.
Conclusion: Operator mismatch is critical in compressive imaging systems and must be addressed. Operator-conditioned architectures and blind calibration methods are essential for real-world deployment. Simulation findings transfer to physical hardware, validating the benchmark’s relevance.
Abstract: State-of-the-art EfficientSCI loses 20.58 dB when its assumed forward operator deviates from physical reality in just eight parameters, yet no existing benchmark quantifies operator mismatch, the default condition in deployed compressive imaging systems. We introduce InverseNet, the first cross-modality benchmark for operator mismatch, spanning CASSI, CACTI, and single-pixel cameras. Evaluating 12 methods under a four-scenario protocol (ideal, mismatched, oracle-corrected, blind calibration) across 27 simulated scenes and 9 real hardware captures, we find: (1) deep learning methods lose 10-21 dB under mismatch, eliminating their advantage over classical baselines; (2) performance and robustness are inversely correlated across modalities (Spearman r_s = -0.71, p < 0.01); (3) mask-oblivious architectures recover 0% of mismatch losses regardless of calibration quality, while operator-conditioned methods recover 41-90%; (4) blind grid-search calibration recovers 85-100% of the oracle bound without ground truth. Real hardware experiments confirm that simulation trends transfer to physical data. Code will be released upon acceptance.
[136] Fusion and Grouping Strategies in Deep Learning for Local Climate Zone Classification of Multimodal Remote Sensing Data
Ancymol Thomas, Jaya Sreevalsan-Nair
Main category: cs.CV
TL;DR: This paper analyzes multimodal fusion strategies for Local Climate Zone classification using SAR and multispectral remote sensing data, comparing different CNN-based fusion architectures.
Details
Motivation: There's a need to comprehensively analyze fusion mechanisms in deep learning classifiers for multimodal remote sensing data in LCZ classification, as data fusion is crucial for improving accuracy but existing approaches lack systematic comparison of fusion strategies.Method: The study analyzes four fusion models: baseline hybrid fusion (FM1), self- and cross-attention mechanisms (FM2), multi-scale Gaussian filtered images (FM3), and weighted decision-level fusion (FM4). Ablation experiments examine pixel-, feature-, and decision-level fusion effects. Grouping strategies include band grouping within modalities and label merging in ground truth, tested on the So2Sat LCZ42 dataset with SAR and MSI image pairs.
Result: FM1 consistently outperforms simple fusion methods, with FM1 combined with band grouping and label merging achieving the best overall accuracy of 76.6%. The study highlights how these strategies improve prediction accuracy for underrepresented classes.
Conclusion: Hybrid fusion with appropriate grouping strategies is most effective for multimodal LCZ classification, and systematic analysis of fusion mechanisms provides insights for improving classification accuracy, particularly for underrepresented classes.
Abstract: Local Climate Zones (LCZs) give a zoning map to study urban structures and land use and analyze the impact of urbanization on local climate. Multimodal remote sensing enables LCZ classification, for which data fusion is significant for improving accuracy owing to the data complexity. However, there is a gap in a comprehensive analysis of the fusion mechanisms used in their deep learning (DL) classifier architectures. This study analyzes different fusion strategies in the multi-class LCZ classification models for multimodal data and grouping strategies based on inherent data characteristics. The different models involving Convolutional Neural Networks (CNNs) include: (i) baseline hybrid fusion (FM1), (ii) with self- and cross-attention mechanisms (FM2), (iii) with the multi-scale Gaussian filtered images (FM3), and (iv) weighted decision-level fusion (FM4). Ablation experiments are conducted to study the pixel-, feature-, and decision-level fusion effects in the model performance. Grouping strategies include band grouping (BG) within the data modalities and label merging (LM) in the ground truth. Our analysis is exclusively done on the So2Sat LCZ42 dataset, which consists of Synthetic Aperture Radar (SAR) and Multispectral Imaging (MSI) image pairs. Our results show that FM1 consistently outperforms simple fusion methods. FM1 with BG and LM is found to be the most effective approach among all fusion strategies, giving an overall accuracy of 76.6%. Importantly, our study highlights the effect of these strategies in improving prediction accuracy for the underrepresented classes. Our code and processed datasets are available at https://github.com/GVCL/LCZC-MultiModalHybridFusion
[137] Structure-Guided Histopathology Synthesis via Dual-LoRA Diffusion
Xuan Xu, Prateek Prasanna
Main category: cs.CV
TL;DR: A unified diffusion framework for histopathology image synthesis that uses nuclei centroids as spatial priors and task-specific LoRA adapters to handle both local structure completion and global structure synthesis in a single model.
Details
Motivation: Existing generative methods treat restoration and generation as separate tasks despite sharing the same objective of structure-consistent tissue synthesis. Current approaches rely on weak structural priors that limit realistic cellular organization in histopathology images.Method: Dual-LoRA Controllable Diffusion framework using multi-class nuclei centroids as lightweight spatial priors. Two task-specific LoRA adapters specialize a shared diffusion backbone for local structure completion (restoration) and global structure synthesis (generation) without retraining separate models.
Result: Significant improvements over state-of-the-art GAN and diffusion baselines: LPIPS in masked regions improved from 0.1797 to 0.1524 for local completion, and FID improved from 225.15 to 76.04 for global synthesis, indicating better structural fidelity and realism.
Conclusion: The unified framework achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling with a single model.
Abstract: Histopathology image synthesis plays an important role in tissue restoration, data augmentation, and modeling of tumor microenvironments. However, existing generative methods typically address restoration and generation as separate tasks, although both share the same objective of structure-consistent tissue synthesis under varying degrees of missingness, and often rely on weak or inconsistent structural priors that limit realistic cellular organization. We propose Dual-LoRA Controllable Diffusion, a unified centroid-guided diffusion framework that jointly supports Local Structure Completion and Global Structure Synthesis within a single model. Multi-class nuclei centroids serve as lightweight and annotation-efficient spatial priors, providing biologically meaningful guidance under both partial and complete image absence. Two task-specific LoRA adapters specialize the shared backbone for local and global objectives without retraining separate diffusion models. Extensive experiments demonstrate consistent improvements over state-of-the-art GAN and diffusion baselines across restoration and synthesis tasks. For local completion, LPIPS computed within the masked region improves from 0.1797 (HARP) to 0.1524, and for global synthesis, FID improves from 225.15 (CoSys) to 76.04, indicating improved structural fidelity and realism. Our approach achieves more faithful structural recovery in masked regions and substantially improved realism and morphology consistency in full synthesis, supporting scalable pan-cancer histopathology modeling.
[138] Mask-aware inference with State-Space Models
Ignasi Mas, Ramon Morros, Javier-Ruiz Hidalgo, Ivan Huerta
Main category: cs.CV
TL;DR: Partial Vision Mamba (PVM) extends Mamba SSMs to handle arbitrarily shaped missing/invalid data in vision tasks through mask-aware partial operations.
Details
Motivation: Real-world vision tasks often deal with inputs containing arbitrarily shaped regions of missing or invalid data. While CNNs have Partial Convolutions for this, recent State Space Models (SSMs) like Mamba lack mechanisms to handle such incomplete data at inference time.Method: Introduces Partial Vision Mamba (PVM), a novel architectural component that ports partial operation principles to Mamba backbone. Defines design rules for architectures using PVM, enabling mask-aware re-normalization conditioned only on valid pixels.
Result: Demonstrates efficacy and generalizability across depth completion, image inpainting, and classification with invalid data tasks.
Conclusion: PVM successfully bridges the gap between SSMs’ linear complexity advantages and the need to handle incomplete visual data, extending Mamba’s capabilities to real-world vision applications with missing information.
Abstract: Many real-world computer vision tasks, such as depth completion, must handle inputs with arbitrarily shaped regions of missing or invalid data. For Convolutional Neural Networks (CNNs), Partial Convolutions solved this by a mask-aware re-normalization conditioned only on valid pixels. Recently, State Space Models (SSMs) like Mamba have emerged, offering high performance with linear complexity. However, these architectures lack an inherent mechanism for handling such arbitrarily shaped invalid data at inference time. To bridge this gap, we introduce Partial Vision Mamba (PVM), a novel architectural component that ports the principles of partial operations to the Mamba backbone. We also define a series of rules to design architectures using PVM. We show the efficacy and generalizability of our approach in the tasks of depth completion, image inpainting, and classification with invalid data.
[139] A 360-degree Multi-camera System for Blue Emergency Light Detection Using Color Attention RT-DETR and the ABLDataset
Francisco Vacalebri-Lloret, Lucas Banchero, Jose J. Lopez, Jose M. Mossi
Main category: cs.CV
TL;DR: A computer vision system for detecting emergency vehicle blue lights using RT-DETR with color attention blocks, achieving 94.7% accuracy with 70m range detection and angle estimation for ADAS integration.
Details
Motivation: To enhance road safety and Advanced Driver Assistance Systems (ADAS) by developing a reliable system for detecting emergency vehicles through their blue lights, which is crucial for timely driver awareness and response.Method: Uses ABLDataset with European emergency vehicle images under various conditions, employs four 180-degree fisheye cameras with calibration for azimuthal localization, compares multiple DNNs (YOLO variants, RetinaNet, Faster R-CNN, RT-DETR), enhances RT-DETR with color attention blocks, and uses geometric transformations for approach angle estimation.
Result: Achieved 94.7% accuracy and 94.1% recall on test set, with field test detections up to 70 meters, and successful integration into a multimodal system combining visual and acoustic data.
Conclusion: The system demonstrates high efficiency for emergency vehicle detection and is designed for multimodal ADAS integration, offering promising improvements to road safety through combined visual and acoustic data processing.
Abstract: This study presents an advanced system for detecting blue lights on emergency vehicles, developed using ABLDataset, a curated dataset that includes images of European emergency vehicles under various climatic and geographic conditions. The system employs a configuration of four fisheye cameras, each with a 180-degree horizontal field of view, mounted on the sides of the vehicle. A calibration process enables the azimuthal localization of the detections. Additionally, a comparative analysis of major deep neural network algorithms was conducted, including YOLO (v5, v8, and v10), RetinaNet, Faster R-CNN, and RT-DETR. RT-DETR was selected as the base model and enhanced through the incorporation of a color attention block, achieving an accuracy of 94.7 percent and a recall of 94.1 percent on the test set, with field test detections reaching up to 70 meters. Furthermore, the system estimates the approach angle of the emergency vehicle relative to the center of the car using geometric transformations. Designed for integration into a multimodal system that combines visual and acoustic data, this system has demonstrated high efficiency, offering a promising approach to enhancing Advanced Driver Assistance Systems (ADAS) and road safety.
[140] PinPoint: Evaluation of Composed Image Retrieval with Explicit Negatives, Multi-Image Queries, and Paraphrase Testing
Rohan Mahadev, Joyce Yuan, Patrick Poirson, David Xue, Hao-Yu Wu, Dmitry Kislyuk
Main category: cs.CV
TL;DR: PinPoint is a comprehensive benchmark for Composed Image Retrieval (CIR) with 7,635 queries and 329K relevance judgments, featuring multiple correct answers, hard negatives, paraphrases, multi-image composition, and fairness evaluation, revealing limitations in current methods and proposing a training-free MLLM-based reranking solution.
Details
Motivation: Current CIR benchmarks are limited to single ground-truth answers and lack annotations for evaluating false positive avoidance, robustness, multi-image reasoning, and fairness. There's a need for more comprehensive evaluation frameworks to uncover real-world limitations of CIR systems.Method: Created PinPoint benchmark with 7,635 queries across 23 categories, featuring: 1) multiple correct answers (avg 9.1 per query), 2) explicit hard negatives, 3) six instruction paraphrases per query, 4) multi-image composition support (13.4% of queries), and 5) demographic metadata. Evaluated 20+ methods across 4 paradigms and proposed training-free reranking using off-the-shelf MLLM.
Result: Best methods achieve mAP@10 of 28.5% but still retrieve irrelevant results 9% of the time. Models show 25.1% performance variation across paraphrases. Multi-image queries perform 40-70% worse. The MLLM-based reranking method helps bridge performance gaps without additional training.
Conclusion: PinPoint reveals significant limitations in current CIR systems regarding false positives, robustness to paraphrases, and multi-image reasoning. The benchmark provides comprehensive evaluation capabilities and the proposed MLLM-based reranking offers a practical solution to improve existing systems.
Abstract: Composed Image Retrieval (CIR) has made significant progress, yet current benchmarks are limited to single ground-truth answers and lack the annotations needed to evaluate false positive avoidance, robustness and multi-image reasoning. We present PinPoint, a comprehensive real world benchmark with 7,635 queries and 329K relevance judgments across 23 query categories. PinPoint advances the field by providing: (1) multiple correct answers (averaging 9.1 per query) (2) explicit hard negatives, (3) six instruction paraphrases per query for robustness testing, (4) multi-image composition support (13.4% of queries), and (5) demographic metadata for fairness evaluation. Based on our analysis of 20+ methods across 4 different major paradigms, we uncover three significant drawbacks: The best methods while achieving mAP@10 of 28.5%, still retrieves irrelevant results (hard negatives) 9% of the time. The best models also exhibit 25.1% performance variation across paraphrases, indicating significant potential for enhancing current CIR techniques. Multi-image queries performs 40 to 70% worse across different methods. To overcome these new issues uncovered by our evaluation framework, we propose a training-free reranking method based on an off-the-shelf MLLM that can be applied to any existing system to bridge the gap. We release the complete dataset, including all images, queries, annotations, retrieval index, and benchmarking code.
[141] The Impact of Preprocessing Methods on Racial Encoding and Model Robustness in CXR Diagnosis
Dishantkumar Sutariya, Eike Petersen
Main category: cs.CV
TL;DR: Simple lung cropping from chest X-rays reduces racial bias in diagnostic models while maintaining diagnostic accuracy, challenging the fairness-accuracy trade-off assumption.
Details
Motivation: Deep learning models can identify racial identity from chest X-rays, raising concerns about racial shortcut learning that could lead to biased diagnostic predictions and threaten healthcare equity. Image preprocessing methods may influence racial shortcut learning but remain underexplored.Method: Investigates effects of three image preprocessing methods: lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy.
Result: Simple bounding box-based lung cropping proves effective at reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.
Conclusion: Basic preprocessing techniques like lung cropping can mitigate racial bias in medical imaging models without sacrificing diagnostic accuracy, offering a practical approach to address fairness concerns in healthcare AI.
Abstract: Deep learning models can identify racial identity with high accuracy from chest X-ray (CXR) recordings. Thus, there is widespread concern about the potential for racial shortcut learning, where a model inadvertently learns to systematically bias its diagnostic predictions as a function of racial identity. Such racial biases threaten healthcare equity and model reliability, as models may systematically misdiagnose certain demographic groups. Since racial shortcuts are diffuse - non-localized and distributed throughout the whole CXR recording - image preprocessing methods may influence racial shortcut learning, yet the potential of such methods for reducing biases remains underexplored. Here, we investigate the effects of image preprocessing methods including lung masking, lung cropping, and Contrast Limited Adaptive Histogram Equalization (CLAHE). These approaches aim to suppress spurious cues encoding racial information while preserving diagnostic accuracy. Our experiments reveal that simple bounding box-based lung cropping can be an effective strategy for reducing racial shortcut learning while maintaining diagnostic model performance, bypassing frequently postulated fairness-accuracy trade-offs.
[142] SGR3 Model: Scene Graph Retrieval-Reasoning Model in 3D
Zirui Wang, Ruiping Liu, Yufan Chen, Junwei Zheng, Weijia Fan, Kunyu Peng, Di Wen, Jiale Wei, Jiaming Zhang, Rainer Stiefelhagen
Main category: cs.CV
TL;DR: SGR3 Model is a training-free framework using multimodal LLMs with retrieval-augmented generation for 3D scene graph generation, bypassing explicit 3D reconstruction and improving relational reasoning through retrieved scene graphs.
Details
Motivation: Existing 3D scene graph generation approaches require multi-modal data and heuristic graph construction, which limits relationship prediction. The authors aim to create a training-free framework that leverages MLLMs to overcome these limitations.Method: Proposes SGR3 Model using multimodal LLMs with retrieval-augmented generation. Uses ColPali-style cross-modal framework for retrieval, introduces weighted patch-level similarity selection to handle blurry/uninformative regions, and bypasses explicit 3D reconstruction.
Result: Achieves competitive performance compared to training-free baselines and performs on par with GNN-based expert models. Ablation studies show retrieved external information is explicitly integrated into token generation rather than implicitly internalized.
Conclusion: SGR3 Model demonstrates effective 3D scene graph generation using multimodal LLMs with RAG, offering a training-free alternative that avoids explicit reconstruction while maintaining competitive performance.
Abstract: 3D scene graphs provide a structured representation of object entities and their relationships, enabling high-level interpretation and reasoning for robots while remaining intuitively understandable to humans. Existing approaches for 3D scene graph generation typically combine scene reconstruction with graph neural networks (GNNs). However, such pipelines require multi-modal data that may not always be available, and their reliance on heuristic graph construction can constrain the prediction of relationship triplets. In this work, we introduce a Scene Graph Retrieval-Reasoning Model in 3D (SGR3 Model), a training-free framework that leverages multi-modal large language models (MLLMs) with retrieval-augmented generation (RAG) for semantic scene graph generation. SGR3 Model bypasses the need for explicit 3D reconstruction. Instead, it enhances relational reasoning by incorporating semantically aligned scene graphs retrieved via a ColPali-style cross-modal framework. To improve retrieval robustness, we further introduce a weighted patch-level similarity selection mechanism that mitigates the negative impact of blurry or semantically uninformative regions. Experiments demonstrate that SGR3 Model achieves competitive performance compared to training-free baselines and on par with GNN-based expert models. Moreover, an ablation study on the retrieval module and knowledge base scale reveals that retrieved external information is explicitly integrated into the token generation process, rather than being implicitly internalized through abstraction.
[143] Spinverse: Differentiable Physics for Permeability-Aware Microstructure Reconstruction from Diffusion MRI
Prathamesh Pradeep Khole, Mario M. Brenes, Zahra Kais Petiwala, Ehsan Mirafzali, Utkarsh Gupta, Jing-Rebecca Li, Andrada Ianus, Razvan Marinescu
Main category: cs.CV
TL;DR: Spinverse: A permeability-aware reconstruction method that inverts diffusion MRI measurements using a differentiable Bloch-Torrey simulator to recover microstructural boundaries without fixed topology assumptions.
Details
Motivation: Existing dMRI methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces, limiting accurate microstructural boundary reconstruction.Method: Uses a fully differentiable Bloch-Torrey simulator on fixed tetrahedral grid with learnable face permeabilities; optimizes permeability field through backpropagation of signal-matching loss; employs mesh-based geometric priors and staged multi-sequence optimization curriculum.
Result: Successfully reconstructs diverse geometries across synthetic voxel meshes; demonstrates that sequence scheduling and regularization are critical for avoiding outline-only solutions while improving boundary accuracy and structural validity.
Conclusion: Spinverse enables permeability-aware reconstruction of microstructural boundaries without fixed topology assumptions, advancing dMRI-based tissue microstructure analysis through differentiable simulation and optimization techniques.
Abstract: Diffusion MRI (dMRI) is sensitive to microstructural barriers, yet most existing methods either assume impermeable boundaries or estimate voxel-level parameters without recovering explicit interfaces. We present Spinverse, a permeability-aware reconstruction method that inverts dMRI measurements through a fully differentiable Bloch-Torrey simulator. Spinverse represents tissue on a fixed tetrahedral grid and treats each interior face permeability as a learnable parameter; low-permeability faces act as diffusion barriers, so microstructural boundaries whose topology is not fixed a priori (up to the resolution of the ambient mesh) emerge without changing mesh connectivity or vertex positions. Given a target signal, we optimize face permeabilities by backpropagating a signal-matching loss through the PDE forward model, and recover an interface by thresholding the learned permeability field. To mitigate the ill-posedness of permeability inversion, we use mesh-based geometric priors; to avoid local minima, we use a staged multi-sequence optimization curriculum. Across a collection of synthetic voxel meshes, Spinverse reconstructs diverse geometries and demonstrates that sequence scheduling and regularization are critical to avoid outline-only solutions while improving both boundary accuracy and structural validity.
[144] sFRC for assessing hallucinations in medical image restoration
Prabhat Kc, Rongping Zeng, Nirmal Soni, Aldo Badano
Main category: cs.CV
TL;DR: Proposes sFRC, a patch-based Fourier Ring Correlation method to detect hallucinations in DL-based medical image restoration, validated on CT and MRI tasks.
Details
Motivation: DL methods for medical image restoration can produce visually appealing but hallucinated outputs, with limited tools for detecting such artifacts.Method: sFRC performs Fourier Ring Correlation analysis over small patches scanned across DL outputs and references to detect hallucinations, using expert annotations or theory-based maps.
Result: Demonstrated effectiveness on CT super-resolution/sparse-view and MRI subsampled restoration, showing agreement with theory-based maps and quantifying hallucination rates.
Conclusion: sFRC provides a robust method for detecting hallucinations in DL-based medical image restoration, applicable across different imaging modalities and methods.
Abstract: Deep learning (DL) methods are currently being explored to restore images from sparse-view-, limited-data-, and undersampled-based acquisitions in medical applications. Although outputs from DL may appear visually appealing based on likability/subjective criteria (such as less noise, smooth features), they may also suffer from hallucinations. This issue is further exacerbated by a lack of easy-to-use techniques and robust metrics for the identification of hallucinations in DL outputs. In this work, we propose performing Fourier Ring Correlation (FRC) analysis over small patches and concomitantly (s)canning across DL outputs and their reference counterparts to detect hallucinations (termed as sFRC). We describe the rationale behind sFRC and provide its mathematical formulation. The parameters essential to sFRC may be set using predefined hallucinated features annotated by subject matter experts or using imaging theory-based hallucination maps. We use sFRC to detect hallucinations for three undersampled medical imaging problems: CT super-resolution, CT sparse view, and MRI subsampled restoration. In the testing phase, we demonstrate sFRC’s effectiveness in detecting hallucinated features for the CT problem and sFRC’s agreement with imaging theory-based outputs on hallucinated feature maps for the MR problem. Finally, we quantify the hallucination rates of DL methods on in-distribution versus out-of-distribution data and under increasing subsampling rates to characterize the robustness of DL methods. Beyond DL-based methods, sFRC’s effectiveness in detecting hallucinations for a conventional regularization-based restoration method and a state-of-the-art unrolled method is also shown.
[145] Decoding the Pulse of Reasoning VLMs in Multi-Image Understanding Tasks
Chenjun Li
Main category: cs.CV
TL;DR: PulseFocus improves multi-image reasoning in vision-language models by addressing diffuse attention patterns during chain-of-thought generation through training-free inference-time attention gating.
Details
Motivation: Multi-image reasoning is challenging for VLMs, with observed diffuse "pulse" attention patterns during CoT generation that fail to focus on task-relevant images, plus systematic positional bias in attention allocation across images.Method: PulseFocus structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. The model explicitly plans which image to examine, then gates decode-time attention to the referenced image, all without additional training.
Result: Consistent improvements on multi-image benchmarks: +3.7% on BLINK benchmark and +1.07% on MuirBench, demonstrating sharper attention focus and better multi-image reasoning.
Conclusion: PulseFocus effectively addresses attention diffusion in multi-image reasoning through inference-time attention gating, improving VLM performance without requiring additional training.
Abstract: Multi-image reasoning remains a significant challenge for vision-language models (VLMs). We investigate a previously overlooked phenomenon: during chain-of-thought (CoT) generation, the text-to-image (T2I) attention of reasoning VLMs exhibits diffuse “pulses”: sporadic and unfocused attention patterns that fail to concentrate on task-relevant images. We further reveal a systematic positional bias in attention allocation across images. Motivated by these observations, we propose PulseFocus, a training-free, inference-time method that structures CoT reasoning into interleaved plan/focus blocks with soft attention gating. By forcing the model to explicitly plan which image to examine and then gating decode-time attention to the referenced image, PulseFocus sharpens attention focus and yields consistent improvements on multi-image benchmarks like BLINK benchmark (+3.7%) and MuirBench (+1.07%).
[146] A Benchmark Study of Neural Network Compression Methods for Hyperspectral Image Classification
Sai Shi
Main category: cs.CV
TL;DR: Systematic evaluation of neural network compression methods (pruning, quantization, knowledge distillation) for hyperspectral land cover classification in remote sensing applications.
Details
Motivation: Deep neural networks have high computational and memory requirements that limit deployment on resource-constrained platforms like remote sensing devices and edge systems, necessitating network compression techniques.Method: Systematic evaluation of three compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments conducted on two benchmark hyperspectral datasets measuring classification accuracy, memory consumption, and inference efficiency.
Result: Compressed models significantly reduce model size and computational cost while maintaining competitive classification performance. The study provides insights into trade-offs between compression ratio, efficiency, and accuracy.
Conclusion: Compression techniques show potential for enabling efficient deep learning deployment in remote sensing applications by balancing model size reduction with maintained performance.
Abstract: Deep neural networks have achieved strong performance in image classification tasks due to their ability to learn complex patterns from high-dimensional data. However, their large computational and memory requirements often limit deployment on resource-constrained platforms such as remote sensing devices and edge systems. Network compression techniques have therefore been proposed to reduce model size and computational cost while maintaining predictive performance. In this study, we conduct a systematic evaluation of neural network compression methods for a remote sensing application, namely hyperspectral land cover classification. Specifically, we examine three widely used compression strategies for convolutional neural networks: pruning, quantization, and knowledge distillation. Experiments are conducted on two benchmark hyperspectral datasets, considering classification accuracy, memory consumption, and inference efficiency. Our results demonstrate that compressed models can significantly reduce model size and computational cost while maintaining competitive classification performance. These findings provide insights into the trade-offs between compression ratio, efficiency, and accuracy, and highlight the potential of compression techniques for enabling efficient deep learning deployment in remote sensing applications.
[147] Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild
Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi
Main category: cs.CV
TL;DR: MLLMs show promise for video anomaly detection but suffer from conservative bias and recall collapse in zero-shot settings, requiring careful prompting strategies to improve performance.
Details
Motivation: To systematically evaluate MLLMs' reliability for real-world Video Anomaly Detection (VAD) and explore their potential as a paradigm shift from conventional reconstruction/pose-based methods to language-guided reasoning.Method: Reformulate VAD as binary classification under weak temporal supervision, evaluate state-of-the-art MLLMs on ShanghaiTech and CHAD benchmarks, investigate prompt specificity and temporal window lengths (1s-3s), and analyze precision-recall trade-offs.
Result: MLLMs exhibit pronounced conservative bias in zero-shot settings with high confidence but disproportionate favor for ’normal’ class, leading to high precision but recall collapse. Class-specific instructions can significantly improve F1-score (from 0.09 to 0.64 on ShanghaiTech), but recall remains a critical bottleneck.
Conclusion: There’s a significant performance gap for MLLMs in noisy VAD environments, highlighting the need for recall-oriented prompting and model calibration for open-world surveillance applications requiring complex video understanding and reasoning.
Abstract: Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s–3s) influence performance, focusing on the precision–recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the ’normal’ class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.
[148] FOZO: Forward-Only Zeroth-Order Prompt Optimization for Test-Time Adaptation
Xingyu Wang, Tao Wang
Main category: cs.CV
TL;DR: FOZO is a forward-only zeroth-order optimization method for test-time adaptation that avoids backpropagation, making it suitable for resource-limited devices while achieving competitive performance on distribution shifts.
Details
Motivation: Current TTA methods have limitations: backpropagation-based approaches are computationally expensive and modify model weights, making them unsuitable for low-end devices, while traditional backpropagation-free methods have constrained adaptation capabilities.Method: FOZO uses memory-efficient zeroth-order prompt optimization guided by objectives that optimize both intermediate feature statistics and prediction entropy. It introduces dynamically decaying perturbation scale for stable adaptation and theoretically proves convergence under TTA data stream assumptions.
Result: FOZO achieves 59.52% Top-1 accuracy on ImageNet-C (5K, level 5), outperforming gradient-based methods and state-of-the-art forward-only FOA (58.13%). It also shows strong generalization on quantized (INT8) models.
Conclusion: FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios, offering efficient adaptation without backpropagation while maintaining strong performance on distribution shifts.
Abstract: Test-Time Adaptation (TTA) is essential for enabling deep learning models to handle real-world data distribution shifts. However, current approaches face significant limitations: backpropagation-based methods are not suitable for low-end deployment devices, due to their high computation and memory requirements, as well as their tendency to modify model weights during adaptation; while traditional backpropagation-free techniques exhibit constrained adaptation capabilities. In this work, we propose Forward-Only Zeroth-Order Optimization (FOZO), a novel and practical backpropagation-free paradigm for TTA. FOZO leverages a memory-efficient zeroth-order prompt optimization, which is led by objectives optimizing both intermediate feature statistics and prediction entropy. To ensure efficient and stable adaptation over the out-of-distribution data stream, we introduce a dynamically decaying perturbation scale during zeroth-order gradient estimation and theoretically prove its convergence under the TTA data stream assumption. Extensive continual adaptation experiments on ImageNet-C, ImageNet-R, and ImageNet-Sketch demonstrate FOZO’s superior performance, achieving 59.52% Top-1 accuracy on ImageNet-C (5K, level 5) and outperforming main gradient-based methods and SOTA forward-only FOA (58.13%). Furthermore, FOZO exhibits strong generalization on quantized (INT8) models. These findings demonstrate that FOZO is a highly competitive solution for TTA deployment in resource-limited scenarios.
[149] Toward Real-world Infrared Image Super-Resolution: A Unified Autoregressive Framework and Benchmark Dataset
Yang Zou, Jun Ma, Zhidong Jiao, Xingyuan Li, Zhiying Jiang, Jinyuan Liu
Main category: cs.CV
TL;DR: Real-IISR: A unified autoregressive framework for real-world infrared image super-resolution that addresses coupled optical and sensing degradations while preserving thermal fidelity through thermal-structural guidance and condition-adaptive codebooks.
Details
Motivation: Real-world infrared image super-resolution is practically significant but rarely addressed. Existing works use simulated datasets or ignore intrinsic differences between infrared and visible imaging. Real infrared images suffer from coupled optical and sensing degradations that deteriorate both structural sharpness and thermal fidelity.Method: Proposes Real-IISR, a unified autoregressive framework that progressively reconstructs fine-grained thermal structures and clear backgrounds via thermal-structural guided visual autoregression. Includes Thermal-Structural Guidance module, Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors, and Thermal Order Consistency Loss to enforce monotonic relation between temperature and pixel intensity.
Result: Built FLIR-IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate promising performance, providing a unified foundation for real-world IISR and benchmarking.
Conclusion: Real-IISR addresses the challenging task of real-world infrared image super-resolution by considering coupled degradations and thermal fidelity, offering a comprehensive solution with a new dataset and framework for this underexplored area.
Abstract: Infrared image super-resolution (IISR) under real-world conditions is a practically significant yet rarely addressed task. Pioneering works are often trained and evaluated on simulated datasets or neglect the intrinsic differences between infrared and visible imaging. In practice, however, real infrared images are affected by coupled optical and sensing degradations that jointly deteriorate both structural sharpness and thermal fidelity. To address these challenges, we propose Real-IISR, a unified autoregressive framework for real-world IISR that progressively reconstructs fine-grained thermal structures and clear backgrounds in a scale-by-scale manner via thermal-structural guided visual autoregression. Specifically, a Thermal-Structural Guidance module encodes thermal priors to mitigate the mismatch between thermal radiation and structural edges. Since non-uniform degradations typically induce quantization bias, Real-IISR adopts a Condition-Adaptive Codebook that dynamically modulates discrete representations based on degradation-aware thermal priors. Also, a Thermal Order Consistency Loss enforces a monotonic relation between temperature and pixel intensity, ensuring relative brightness order rather than absolute values to maintain physical consistency under spatial misalignment and thermal drift. We build FLIR-IISR, a real-world IISR dataset with paired LR-HR infrared images acquired via automated focus variation and motion-induced blur. Extensive experiments demonstrate the promising performance of Real-IISR, providing a unified foundation for real-world IISR and benchmarking. The dataset and code are available at: https://github.com/JZD151/Real-IISR.
[150] Evaluating GPT-5 as a Multimodal Clinical Reasoner: A Landscape Commentary
Alexandru Florea, Shansong Wang, Mingzhe Hu, Qiang Li, Zach Eidex, Luke del Balzo, Mojtaba Safari, Xiaofeng Yang
Main category: cs.CV
TL;DR: GPT-5 family shows substantial improvements over GPT-4o in medical reasoning tasks, particularly in text-based medical exams and multimodal visual question answering, but still lags behind specialized domain-specific models in perception-critical tasks like mammography and neuroradiology.
Details
Motivation: To evaluate whether general-purpose foundation models like GPT-5 can support the integrated multimodal reasoning required in clinical medicine, where diagnosis requires synthesis of ambiguous patient narratives, lab data, and multimodal imaging.Method: Controlled cross-sectional evaluation of GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against GPT-4o using standardized zero-shot chain-of-thought protocol across diverse clinically grounded tasks: medical education exams, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography.
Result: GPT-5 showed substantial gains in expert-level textual reasoning (25+ percentage-point improvements on MedXpertQA). For multimodal synthesis, GPT-5 effectively grounded uncertain clinical narratives in imaging evidence, achieving state-of-the-art or competitive VQA performance, outperforming GPT-4o by 10-40% in mammography tasks. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind specialized mammography systems (52-64% vs >80%).
Conclusion: GPT-5 represents meaningful advance toward integrated multimodal clinical reasoning, mirroring clinician cognitive processes of biasing uncertain information with objective findings, but generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.
Abstract: The transition from task-specific artificial intelligence toward general-purpose foundation models raises fundamental questions about their capacity to support the integrated reasoning required in clinical medicine, where diagnosis demands synthesis of ambiguous patient narratives, laboratory data, and multimodal imaging. This landscape commentary provides the first controlled, cross-sectional evaluation of the GPT-5 family (GPT-5, GPT-5 Mini, GPT-5 Nano) against its predecessor GPT-4o across a diverse spectrum of clinically grounded tasks, including medical education examinations, text-based reasoning benchmarks, and visual question-answering in neuroradiology, digital pathology, and mammography using a standardized zero-shot chain-of-thought protocol. GPT-5 demonstrated substantial gains in expert-level textual reasoning, with absolute improvements exceeding 25 percentage-points on MedXpertQA. When tasked with multimodal synthesis, GPT-5 effectively leveraged this enhanced reasoning capacity to ground uncertain clinical narratives in concrete imaging evidence, achieving state-of-the-art or competitive performance across most VQA benchmarks and outperforming GPT-4o by margins of 10-40% in mammography tasks requiring fine-grained lesion characterization. However, performance remained moderate in neuroradiology (44% macro-average accuracy) and lagged behind domain-specific models in mammography, where specialized systems exceed 80% accuracy compared to GPT-5’s 52-64%. These findings indicate that while GPT-5 represents a meaningful advance toward integrated multimodal clinical reasoning, mirroring the clinician’s cognitive process of biasing uncertain information with objective findings, generalist models are not yet substitutes for purpose-built systems in highly specialized, perception-critical tasks.
[151] Evaluating and Correcting Human Annotation Bias in Dynamic Micro-Expression Recognition
Feng Liu, Bingyu Nan, Xuezhong Qian, Xiaolan Fu
Main category: cs.CV
TL;DR: GAMDSS architecture improves micro-expression recognition by automatically reselecting keyframes (Onset, Apex, Offset) to reduce human labeling errors, especially in cross-cultural datasets, without adding parameters.
Details
Motivation: Manual labeling of micro-expressions is prone to errors, particularly in cross-cultural scenarios where labeling consistency suffers. Current methods rely on human-annotated keyframes that may be inaccurate, limiting model performance and generalizability.Method: Proposes Global Anti-Monotonic Differential Selection Strategy (GAMDSS) with dynamic frame reselection to automatically identify Onset and Apex frames from micro-expression sequences, then determines Offset frames. Uses two-branch structure with shared parameters for spatio-temporal feature extraction.
Result: Extensive experiments on seven micro-expression datasets show GAMDSS reduces subjective errors in multicultural datasets (SAMM, 4DME). Quantitative analysis confirms offset-frame annotations are more uncertain in multicultural datasets, supporting standardization of annotations.
Conclusion: GAMDSS provides theoretical justification for standardizing micro-expression annotations and can be integrated into existing models without parameter increase, offering new approach to enhance micro-expression recognition performance.
Abstract: Existing manual labeling of micro-expressions is subject to errors in accuracy, especially in cross-cultural scenarios where deviation in labeling of key frames is more prominent. To address this issue, this paper presents a novel Global Anti-Monotonic Differential Selection Strategy (GAMDSS) architecture for enhancing the effectiveness of spatio-temporal modeling of micro-expressions through keyframe re-selection. Specifically, the method identifies Onset and Apex frames, which are characterized by significant micro-expression variation, from complete micro-expression action sequences via a dynamic frame reselection mechanism. It then uses these to determine Offset frames and construct a rich spatio-temporal dynamic representation. A two-branch structure with shared parameters is then used to efficiently extract spatio-temporal features. Extensive experiments are conducted on seven widely recognized micro-expression datasets. The results demonstrate that GAMDSS effectively reduces subjective errors caused by human factors in multicultural datasets such as SAMM and 4DME. Furthermore, quantitative analyses confirm that offset-frame annotations in multicultural datasets are more uncertain, providing theoretical justification for standardizing micro-expression annotations. These findings directly support our argument for reconsidering the validity and generalizability of dataset annotation paradigms. Notably, this design can be integrated into existing models without increasing the number of parameters, offering a new approach to enhancing micro-expression recognition performance. The source code is available on GitHub[https://github.com/Cross-Innovation-Lab/GAMDSS].
[152] InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
Zhenzhi Wang, Jiaqi Yang, Jianwen Jiang, Chao Liang, Gaojie Lin, Zerong Zheng, Ceyuan Yang, Yuan Zhang, Mingyuan Gao, Dahua Lin
Main category: cs.CV
TL;DR: A novel framework for multi-concept human animation with precise per-identity control using region-specific binding of multi-modal conditions (text, image, audio) to enable human-human and human-object interactions.
Details
Motivation: Existing methods only animate single subjects with global condition injection, ignoring multi-concept scenarios with rich interactions. This prevents precise per-identity control of multiple concepts including humans and objects, hindering practical applications.Method: Enforces strong region-specific binding of conditions from modalities to each identity’s spatiotemporal footprint. Uses a mask predictor to automatically infer layout by matching appearance cues between denoised video and reference images. Injects local audio conditions into corresponding regions iteratively for layout-aligned modality matching.
Result: Enables high-quality generation of human dialogue videos (2-3 people) and video customization from multiple reference images. Empirical results and ablation studies validate effectiveness of explicit layout control for multi-modal conditions compared to implicit counterparts and existing methods.
Conclusion: The framework successfully addresses limitations of single-entity assumptions by providing precise per-identity control through explicit layout binding, enabling realistic multi-concept human animation with rich interactions.
Abstract: End-to-end human animation with rich multi-modal conditions, e.g., text, image and audio has achieved remarkable advancements in recent years. However, most existing methods could only animate a single subject and inject conditions in a global manner, ignoring scenarios where multiple concepts could appear in the same video with rich human-human interactions and human-object interactions. Such a global assumption prevents precise and per-identity control of multiple concepts including humans and objects, therefore hinders applications. In this work, we discard the single-entity assumption and introduce a novel framework that enforces strong, region-specific binding of conditions from modalities to each identity’s spatiotemporal footprint. Given reference images of multiple concepts, our method could automatically infer layout information by leveraging a mask predictor to match appearance cues between the denoised video and each reference appearance. Furthermore, we inject local audio condition into its corresponding region to ensure layout-aligned modality matching in an iterative manner. This design enables the high-quality generation of human dialogue videos between two to three people or video customization from multiple reference images. Empirical results and ablation studies validate the effectiveness of our explicit layout control for multi-modal conditions compared to implicit counterparts and other existing methods. Video demos are available at https://zhenzhiwang.github.io/interacthuman/
[153] DSA-SRGS: Super-Resolution Gaussian Splatting for Dynamic Sparse-View DSA Reconstruction
Shiyu Zhang, Zhicong Wu, Huangxuan Zhao, Zhentao Liu, Lei Chen, Yong Luo, Lefei Zhang, Zhiming Cui, Ziwen Ke, Bo Du
Main category: cs.CV
TL;DR: DSA-SRGS: A super-resolution Gaussian splatting framework for dynamic sparse-view DSA reconstruction that integrates high-quality priors from a fine-tuned super-resolution model to recover fine-grained vascular details.
Details
Motivation: Current 3D vessel reconstruction methods from sparse dynamic DSA inputs are limited by input projection resolution, causing blurring and aliasing artifacts when upsampling, which prevents recovery of fine vascular details needed for precision diagnosis.Method: Proposes a Multi-Fidelity Texture Learning Module that integrates priors from a fine-tuned DSA-specific super-resolution model into 4D reconstruction optimization, with Confidence-Aware Strategy to weight supervision between low-res projections and high-res pseudo-labels, plus Radiative Sub-Pixel Densification for refining 4D radiative Gaussian kernels.
Result: Extensive experiments on two clinical DSA datasets show DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.
Conclusion: DSA-SRGS successfully addresses the super-resolution limitation in dynamic DSA reconstruction, enabling recovery of fine-grained vascular details for improved precision diagnosis and treatment.
Abstract: Digital subtraction angiography (DSA) is a key imaging technique for the auxiliary diagnosis and treatment of cerebrovascular diseases. Recent advancements in gaussian splatting and dynamic neural representations have enabled robust 3D vessel reconstruction from sparse dynamic inputs. However, these methods are fundamentally constrained by the resolution of input projections, where performing naive upsampling to enhance rendering resolution inevitably results in severe blurring and aliasing artifacts. Such lack of super-resolution capability prevents the reconstructed 4D models from recovering fine-grained vascular details and intricate branching structures, which restricts their application in precision diagnosis and treatment. To solve this problem, this paper proposes DSA-SRGS, the first super-resolution gaussian splatting framework for dynamic sparse-view DSA reconstruction. Specifically, we introduce a Multi-Fidelity Texture Learning Module that integrates high-quality priors from a fine-tuned DSA-specific super-resolution model, into the 4D reconstruction optimization. To mitigate potential hallucination artifacts from pseudo-labels, this module employs a Confidence-Aware Strategy to adaptively weight supervision signals between the original low-resolution projections and the generated high-resolution pseudo-labels. Furthermore, we develop Radiative Sub-Pixel Densification, an adaptive strategy that leverages gradient accumulation from high-resolution sub-pixel sampling to refine the 4D radiative gaussian kernels. Extensive experiments on two clinical DSA datasets demonstrate that DSA-SRGS significantly outperforms state-of-the-art methods in both quantitative metrics and qualitative visual fidelity.
[154] MADCrowner: Margin Aware Dental Crown Design with Template Deformation and Refinement
Linda Wei, Chang Liu, Wenran Zhang, Yuxuan Hu, Ruiyang Li, Feng Qi, Changyao Tian, Ke Wang, Yuanyuan Wang, Shaoting Zhang, Dimitris Metaxas, Hongsheng Li
Main category: cs.CV
TL;DR: A margin-aware mesh generation framework for automated dental crown design using intraoral scans, addressing spatial resolution, noise, and overextension issues in previous learning-based methods.
Details
Motivation: Current dental crown restoration requires extensive manual adjustments despite CAD systems. Existing learning-based methods for automated crown generation suffer from inadequate spatial resolution, noisy outputs, and overextension of surface reconstruction.Method: Proposed TotalFramework with CrownDeformR (deforms initial template to target crown using anatomical context from multi-scale intraoral scan encoder) and CrownSegger (margin segmentation network to extract cervical margin). Uses margin as constraint for deformation and boundary condition for postprocessing to remove overextended areas.
Result: Significantly outperformed existing approaches in both geometric accuracy and clinical feasibility on a large-scale intraoral scan dataset.
Conclusion: The proposed margin-aware mesh generation framework effectively addresses limitations of previous learning-based dental crown design methods and shows superior performance for clinical applications.
Abstract: Dental crown restoration is one of the most common treatment modalities for tooth defect, where personalized dental crown design is critical. While computer-aided design (CAD) systems have notably enhanced the efficiency of dental crown design, extensive manual adjustments are still required in the clinic workflow. Recent studies have explored the application of learning-based methods for the automated generation of restorative dental crowns. Nevertheless, these approaches were challenged by inadequate spatial resolution, noisy outputs, and overextension of surface reconstruction. To address these limitations, we propose \totalframework, a margin-aware mesh generation framework comprising CrownDeformR and CrownSegger. Inspired by the clinic manual workflow of dental crown design, we designed CrownDeformR to deform an initial template to the target crown based on anatomical context, which is extracted by a multi-scale intraoral scan encoder. Additionally, we introduced \marginseg, a novel margin segmentation network, to extract the cervical margin of the target tooth. The performance of CrownDeformR improved with the cervical margin as an extra constraint. And it was also utilized as the boundary condition for the tailored postprocessing method, which removed the overextended area of the reconstructed surface. We constructed a large-scale intraoral scan dataset and performed extensive experiments. The proposed method significantly outperformed existing approaches in both geometric accuracy and clinical feasibility.
[155] Privacy-Aware Camera 2.0 Technical Report
Huan Song, Shuyu Tian, Ting Long, Jiang Liu, Cheng Yuan, Zhenyu Jia, Jiawei Shao, Xuelong Li
Main category: cs.CV
TL;DR: A privacy-preserving visual perception framework using AI Flow paradigm and edge-cloud architecture that transforms raw images into abstract feature vectors at the edge, ensuring mathematical irreversibility while enabling behavior recognition and semantic reconstruction in the cloud.
Details
Motivation: Address the privacy-security paradox in sensitive environments (restrooms, locker rooms) where existing privacy-preserving approaches compromise semantic understanding or lack mathematically provable irreversibility, while previous Privacy Camera 1.0 provided only textual judgments leading to evidentiary blind spots.Method: Uses AI Flow paradigm with collaborative edge-cloud architecture. At the edge, a visual desensitizer transforms raw images into abstract feature vectors using nonlinear mapping and stochastic noise injection under Information Bottleneck principle. In the cloud, behavior recognition and semantic reconstruction are performed via “dynamic contour” visual language.
Result: Achieves mathematically provable irreversibility of original images while enabling illustrative visual reference without exposing raw data. Strips identity-sensitive information while maintaining semantic understanding capabilities.
Conclusion: Proposes a novel framework that balances perception and privacy by eliminating visual data at the source while enabling semantic understanding through abstract representations, addressing limitations of existing privacy-preserving approaches.
Abstract: With the increasing deployment of intelligent sensing technologies in highly sensitive environments such as restrooms and locker rooms, visual surveillance systems face a profound privacy-security paradox. Existing privacy-preserving approaches, including physical desensitization, encryption, and obfuscation, often compromise semantic understanding or fail to ensure mathematically provable irreversibility. Although Privacy Camera 1.0 eliminated visual data at the source to prevent leakage, it provided only textual judgments, leading to evidentiary blind spots in disputes. To address these limitations, this paper proposes a novel privacy-preserving perception framework based on the AI Flow paradigm and a collaborative edge-cloud architecture. By deploying a visual desensitizer at the edge, raw images are transformed in real time into abstract feature vectors through nonlinear mapping and stochastic noise injection under the Information Bottleneck principle, ensuring identity-sensitive information is stripped and original images are mathematically unreconstructable. The abstract representations are transmitted to the cloud for behavior recognition and semantic reconstruction via a “dynamic contour” visual language, achieving a critical balance between perception and privacy while enabling illustrative visual reference without exposing raw images.
[156] RMK RetinaNet: Rotated Multi-Kernel RetinaNet for Robust Oriented Object Detection in Remote Sensing Imagery
Huiran Sun
Main category: cs.CV
TL;DR: RMK RetinaNet improves rotated object detection in remote sensing by addressing receptive field limitations, feature fusion issues, and angle regression discontinuities through multi-scale kernel blocks, contextual attention, bottom-up paths, and Euler angle encoding.
Details
Motivation: Rotated object detection in remote sensing faces three major bottlenecks: 1) non-adaptive receptive field utilization, 2) inadequate long-range multi-scale feature fusion, and 3) discontinuities in angle regression, which limit detection performance in multi-scale and multi-orientation scenarios.Method: Proposes Rotated Multi-Kernel RetinaNet with four key components: 1) Multi-Scale Kernel Block for adaptive multi-scale feature extraction, 2) Multi-Directional Contextual Anchor Attention for cross-scale contextual modeling, 3) Bottom-up Path to preserve fine-grained spatial details, and 4) Euler Angle Encoding Module for continuous and stable angle regression.
Result: Extensive experiments on DOTA-v1.0, HRSC2016, and UCAS-AOD datasets show RMK RetinaNet achieves performance comparable to state-of-the-art rotated object detectors while improving robustness in multi-scale and multi-orientation scenarios.
Conclusion: RMK RetinaNet effectively addresses the three major bottlenecks in rotated object detection for remote sensing imagery through its novel architectural components, demonstrating improved performance and robustness across challenging datasets.
Abstract: Rotated object detection in remote sensing imagery is hindered by three major bottlenecks: non-adaptive receptive field utilization, inadequate long-range multi-scale feature fusion, and discontinuities in angle regression. To address these issues, we propose Rotated Multi-Kernel RetinaNet (RMK RetinaNet). First, we design a Multi-Scale Kernel (MSK) Block to strengthen adaptive multi-scale feature extraction. Second, we incorporate a Multi-Directional Contextual Anchor Attention (MDCAA) mechanism into the feature pyramid to enhance contextual modeling across scales and orientations. Third, we introduce a Bottom-up Path to preserve fine-grained spatial details that are often degraded during downsampling. Finally, we develop an Euler Angle Encoding Module (EAEM) to enable continuous and stable angle regression. Extensive experiments on DOTA-v1.0, HRSC2016, and UCAS-AOD show that RMK RetinaNet achieves performance comparable to state-of-the-art rotated object detectors while improving robustness in multi-scale and multi-orientation scenarios.
[157] LAW & ORDER: Adaptive Spatial Weighting for Medical Diffusion and Segmentation
Anugunj Naman, Ayushman Singh, Gaibo Zhang, Yaguang Zhang
Main category: cs.CV
TL;DR: A framework with two network adapters (LAW and ORDER) addresses spatial imbalance in medical image segmentation and synthesis by learning adaptive spatial weighting for diffusion models and efficient region detection for segmentation.
Details
Motivation: Medical image analysis faces spatial imbalance where lesions occupy small regions against vast backgrounds, causing diffusion models to drift from prescribed lesion layouts and efficient segmenters to struggle with spatially uncertain regions.Method: Introduces two network adapters: 1) Learnable Adaptive Weighter (LAW) predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via normalization, clamping, and regularization; 2) Optimal Region Detection with Efficient Resolution (ORDER) applies selective bidirectional skip attention at late decoder stages for efficient segmentation.
Result: LAW achieves 20% FID generative improvement over uniform baseline (52.28 vs. 65.60), with synthetic data improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and 42K parameters, remaining 730x smaller than nnUNet.
Conclusion: The proposed adaptive spatial weighting framework effectively addresses spatial imbalance in medical image analysis, improving both controllable synthesis via diffusion models and efficient segmentation through targeted computational allocation.
Abstract: Medical image analysis relies on accurate segmentation, and benefits from controllable synthesis (of new training images). Yet both tasks of the cyclical pipeline face spatial imbalance: lesions occupy small regions against vast backgrounds. In particular, diffusion models have been shown to drift from prescribed lesion layouts, while efficient segmenters struggle on spatially uncertain regions. Adaptive spatial weighting addresses this by learning where to allocate computational resources. This paper introduces a pair of network adapters: 1) Learnable Adaptive Weighter (LAW) which predicts per-pixel loss modulation from features and masks for diffusion training, stabilized via a mix of normalization, clamping, and regularization to prevent degenerate solutions; and 2) Optimal Region Detection with Efficient Resolution (ORDER) which applies selective bidirectional skip attention at late decoder stages for efficient segmentation. Experiments on polyp and kidney tumor datasets demonstrate that LAW achieves 20% FID generative improvement over a uniform baseline (52.28 vs. 65.60), with synthetic data then improving downstream segmentation by 4.9% Dice coefficient (83.2% vs. 78.3%). ORDER reaches 6.0% Dice improvement on MK-UNet (81.3% vs. 75.3%) with 0.56 GFLOPs and just 42K parameters, remaining 730x smaller than the standard nnUNet.
[158] Comparative Evaluation of Traditional Methods and Deep Learning for Brain Glioma Imaging. Review Paper
Kiranmayee Janardhan, Vinay Martin DSa Prabhu, T. Christy Bobby
Main category: cs.CV
TL;DR: Review paper evaluating segmentation and classification techniques for brain gliomas using MRI, finding CNN architectures outperform traditional methods
Details
Motivation: Brain glioma segmentation and classification are crucial for treatment planning and monitoring, but accurate and reproducible segmentation is challenging. The paper aims to evaluate effective techniques for these tasks.Method: Review paper analyzing both fully automatic and semi-automatic segmentation and classification techniques for brain gliomas from MRI data, comparing traditional methods with deep learning approaches.
Result: Convolutional neural network architectures outperform traditional techniques in both segmentation and classification tasks for brain gliomas.
Conclusion: CNN-based methods show superior performance for brain glioma segmentation and classification, with semi-automatic techniques often preferred by radiologists due to accuracy requirements.
Abstract: Segmentation is crucial for brain gliomas as it delineates the glioma s extent and location, aiding in precise treatment planning and monitoring, thus improving patient outcomes. Accurate segmentation ensures proper identification of the glioma s size and position, transforming images into applicable data for analysis. Classification of brain gliomas is also essential because different types require different treatment approaches. Accurately classifying brain gliomas by size, location, and aggressiveness is essential for personalized prognosis prediction, follow-up care, and monitoring disease progression, ensuring effective diagnosis, treatment, and management. In glioma research, irregular tissues are often observable, but error free and reproducible segmentation is challenging. Many researchers have surveyed brain glioma segmentation, proposing both fully automatic and semi-automatic techniques. The adoption of these methods by radiologists depends on ease of use and supervision, with semi-automatic techniques preferred due to the need for accurate evaluations. This review evaluates effective segmentation and classification techniques post magnetic resonance imaging acquisition, highlighting that convolutional neural network architectures outperform traditional techniques in these tasks.
[159] MASQuant: Modality-Aware Smoothing Quantization for Multimodal Large Language Models
Lulu Hu, Wenhu Xiao, Xin Chen, Xinhua Xu, Bowen Xu, Kun Li, Yongliang Tao
Main category: cs.CV
TL;DR: MASQuant is a post-training quantization method for Multimodal Large Language Models that addresses modality-specific challenges through separate smoothing factors and cross-modal compensation techniques.
Details
Motivation: Existing post-training quantization methods for LLMs don't work well for MLLMs due to modality-specific issues like smoothing misalignment and cross-modal computational invariance challenges.Method: Proposes Modality-Aware Smoothing Quantization (MASQuant) with two key components: (1) Modality-Aware Smoothing (MAS) that learns separate smoothing factors for each modality, and (2) Cross-Modal Compensation (CMC) that uses SVD whitening to transform multi-modal activation differences into low-rank forms for unified quantization.
Result: MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs and is competitive with state-of-the-art PTQ algorithms.
Conclusion: MASQuant effectively addresses modality-specific quantization challenges in MLLMs through specialized smoothing and compensation techniques, enabling efficient deployment of multimodal models.
Abstract: Post-training quantization (PTQ) with computational invariance for Large Language Models~(LLMs) have demonstrated remarkable advances, however, their application to Multimodal Large Language Models~(MLLMs) presents substantial challenges. In this paper, we analyze SmoothQuant as a case study and identify two critical issues: Smoothing Misalignment and Cross-Modal Computational Invariance. To address these issues, we propose Modality-Aware Smoothing Quantization (MASQuant), a novel framework that introduces (1) Modality-Aware Smoothing (MAS), which learns separate, modality-specific smoothing factors to prevent Smoothing Misalignment, and (2) Cross-Modal Compensation (CMC), which addresses Cross-modal Computational Invariance by using SVD whitening to transform multi-modal activation differences into low-rank forms, enabling unified quantization across modalities. MASQuant demonstrates stable quantization performance across both dual-modal and tri-modal MLLMs. Experimental results show that MASQuant is competitive among the state-of-the-art PTQ algorithms. Source code: https://github.com/alibaba/EfficientAI.
[160] Guiding Diffusion-based Reconstruction with Contrastive Signals for Balanced Visual Representation
Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Xilin Zhao, Qingming Huang
Main category: cs.CV
TL;DR: DCR (Diffusion Contrastive Reconstruction) enhances CLIP’s visual encoder by integrating contrastive learning into diffusion-based image reconstruction to improve both discriminative and perceptual abilities.
Details
Motivation: CLIP's visual encoder has limited understanding capacity, including both discriminative ability (class separability) and detail perceptual ability (fine-grained cues). Existing diffusion-based enhancement methods may compromise discriminative ability, so a more balanced approach is needed.Method: Proposes Diffusion Contrastive Reconstruction (DCR) that injects contrastive signals derived from each reconstructed image (not original input) into the diffusion process, unifying the learning objective to avoid gradient conflicts and jointly optimize both abilities.
Result: Extensive experiments across various benchmarks and multi-modal large language models validate DCR’s effectiveness in enhancing CLIP’s visual representations.
Conclusion: DCR successfully addresses CLIP’s representation limitations by balancing discriminative and perceptual abilities through unified diffusion-contrastive learning, with theoretical analysis supporting the joint optimization.
Abstract: The limited understanding capacity of the visual encoder in Contrastive Language-Image Pre-training (CLIP) has become a key bottleneck for downstream performance. This capacity includes both Discriminative Ability (D-Ability), which reflects class separability, and Detail Perceptual Ability (P-Ability), which focuses on fine-grained visual cues. Recent solutions use diffusion models to enhance representations by conditioning image reconstruction on CLIP visual tokens. We argue that such paradigms may compromise D-Ability and therefore fail to effectively address CLIP’s representation limitations. To address this, we integrate contrastive signals into diffusion-based reconstruction to pursue more comprehensive visual representations. We begin with a straightforward design that augments the diffusion process with contrastive learning on input images. However, empirical results show that the naive combination suffers from gradient conflict and yields suboptimal performance. To balance the optimization, we introduce the Diffusion Contrastive Reconstruction (DCR), which unifies the learning objective. The key idea is to inject contrastive signals derived from each reconstructed image, rather than from the original input, into the diffusion process. Our theoretical analysis shows that the DCR loss can jointly optimize D-Ability and P-Ability. Extensive experiments across various benchmarks and multi-modal large language models validate the effectiveness of our method. The code is available at https://github.com/boyuh/DCR.
[161] Meta-D: Metadata-Aware Architectures for Brain Tumor Analysis and Missing-Modality Segmentation
SangHyuk Kim, Daniel Haehn, Sumientra Rampersad
Main category: cs.CV
TL;DR: Meta-D is an architecture that uses MRI scanner metadata (sequence type and plane orientation) to improve brain tumor analysis by guiding feature extraction and handling missing modalities.
Details
Motivation: Medical image deep learning pipelines can benefit from integrating explicit scanner metadata to stabilize feature representations, especially when dealing with missing modalities in clinical settings.Method: The architecture leverages categorical metadata (MRI sequence and plane orientation) to dynamically modulate convolutional features for 2D tumor detection. For 3D missing-modality segmentation, a Transformer Maximizer uses metadata-based cross-attention to isolate and route available modalities, focusing only on valid slices.
Result: For 2D tumor detection, metadata injection improved F1-score by up to 2.62% over image-only baselines. For 3D missing-modality segmentation, it improved Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.
Conclusion: Explicit metadata integration significantly improves medical image analysis performance, particularly for handling missing modalities, making deep learning models more robust and efficient in clinical applications.
Abstract: We present Meta-D, an architecture that explicitly leverages categorical scanner metadata such as MRI sequence and plane orientation to guide feature extraction for brain tumor analysis. We aim to improve the performance of medical image deep learning pipelines by integrating explicit metadata to stabilize feature representations. We first evaluate this in 2D tumor detection, where injecting sequence (e.g., T1, T2) and plane (e.g., axial) metadata dynamically modulates convolutional features, yielding an absolute increase of up to 2.62% in F1-score over image-only baselines. Because metadata grounds feature extraction when data are available, we hypothesize it can serve as a robust anchor when data are missing. We apply this to 3D missing-modality tumor segmentation. Our Transformer Maximizer utilizes metadata-based cross-attention to isolate and route available modalities, ensuring the network focuses on valid slices. This targeted attention improves brain tumor segmentation Dice scores by up to 5.12% under extreme modality scarcity while reducing model parameters by 24.1%.
[162] Revisiting Shape from Polarization in the Era of Vision Foundation Models
Chenhao Li, Taishi Ono, Takeshi Uemori, Yusuke Moriuchi
Main category: cs.CV
TL;DR: Lightweight model using polarization cues outperforms RGB-only vision foundation models in surface normal estimation with 33x less training data or 8x smaller model size.
Details
Motivation: Despite the strong physical relationship between polarization and surface geometry, existing Shape from Polarization (SfP) methods underperform compared to RGB-only vision foundation models trained on massive datasets. The authors argue this is due to domain gaps in synthetic data and sensor noise, not the polarization modality itself.Method: 1) Created high-quality polarization dataset using 1,954 3D-scanned real-world objects; 2) Incorporated pretrained DINOv3 priors for better generalization; 3) Introduced polarization sensor-aware data augmentation to model real-world noise; 4) Trained lightweight model with only 40K scenes.
Result: The method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Polarization cues enable 33x reduction in training data or 8x reduction in model parameters while achieving better performance than RGB-only counterparts.
Conclusion: Polarization cues remain valuable for surface normal estimation despite specialized hardware requirements and limited training data. Proper dataset construction and noise modeling can unlock their potential, enabling efficient models that outperform data-hungry RGB-only approaches.
Abstract: We show that, with polarization cues, a lightweight model trained on a small dataset can outperform RGB-only vision foundation models (VFMs) in single-shot object-level surface normal estimation. Shape from polarization (SfP) has long been studied due to the strong physical relationship between polarization and surface geometry. Meanwhile, driven by scaling laws, RGB-only VFMs trained on large datasets have recently achieved impressive performance and surpassed existing SfP methods. This situation raises questions about the necessity of polarization cues, which require specialized hardware and have limited training data. We argue that the weaker performance of prior SfP methods does not come from the polarization modality itself, but from domain gaps. These domain gaps mainly arise from two sources. First, existing synthetic datasets use limited and unrealistic 3D objects, with simple geometry and random texture maps that do not match the underlying shapes. Second, real-world polarization signals are often affected by sensor noise, which is not well modeled during training. To address the first issue, we render a high-quality polarization dataset using 1,954 3D-scanned real-world objects. We further incorporate pretrained DINOv3 priors to improve generalization to unseen objects. To address the second issue, we introduce polarization sensor-aware data augmentation that better reflects real-world conditions. With only 40K training scenes, our method significantly outperforms both state-of-the-art SfP approaches and RGB-only VFMs. Extensive experiments show that polarization cues enable a 33x reduction in training data or an 8x reduction in model parameters, while still achieving better performance than RGB-only counterparts.
[163] Mitigating Instance Entanglement in Instance-Dependent Partial Label Learning
Rui Zhao, Bin Shi, Kai Sun, Bo Dong
Main category: cs.CV
TL;DR: CAD framework addresses instance entanglement in instance-dependent partial label learning through class-specific augmentation and weighted penalty loss for intra- and inter-class regulation.
Details
Motivation: Partial label learning faces instance entanglement where similar classes share overlapping features and candidate labels, causing class confusion in real-world instance-dependent PLL scenarios.Method: Proposes Class-specific Augmentation based Disentanglement (CAD) framework with intra-class regulation (amplifying class-specific features and aligning same-class augmentations) and inter-class regulation (weighted penalty loss applying stronger penalties to ambiguous labels).
Result: Extensive experiments demonstrate CAD effectively mitigates entanglement problem and enhances ID-PLL performance.
Conclusion: CAD framework successfully addresses instance entanglement in ID-PLL through joint intra- and inter-class regulations, improving class boundary clarity and reducing confusion.
Abstract: Partial label learning is a prominent weakly supervised classification task, where each training instance is ambiguously labeled with a set of candidate labels. In real-world scenarios, candidate labels are often influenced by instance features, leading to the emergence of instance-dependent PLL (ID-PLL), a setting that more accurately reflects this relationship. A significant challenge in ID-PLL is instance entanglement, where instances from similar classes share overlapping features and candidate labels, resulting in increased class confusion. To address this issue, we propose a novel Class-specific Augmentation based Disentanglement (CAD) framework, which tackles instance entanglement by both intra- and inter-class regulations. For intra-class regulation, CAD amplifies class-specific features to generate class-wise augmentations and aligns same-class augmentations across instances. For inter-class regulation, CAD introduces a weighted penalty loss function that applies stronger penalties to more ambiguous labels, encouraging larger inter-class distances. By jointly applying intra- and inter-class regulations, CAD improves the clarity of class boundaries and reduces class confusion caused by entanglement. Extensive experimental results demonstrate the effectiveness of CAD in mitigating the entanglement problem and enhancing ID-PLL performance. The code is available at https://github.com/RyanZhaoIc/CAD.git.
[164] Towards Highly Transferable Vision-Language Attack via Semantic-Augmented Dynamic Contrastive Interaction
Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler
Main category: cs.CV
TL;DR: SADCA is an adversarial attack method for vision-language models that enhances transferability through dynamic contrastive learning and semantic augmentation, disrupting cross-modal alignment more effectively than static attacks.
Details
Motivation: Existing adversarial attacks on vision-language models rely on static cross-modal interactions and only disrupt positive image-text pairs, resulting in limited cross-modal disruption and poor transferability to different models and tasks.Method: Proposes Semantic-Augmented Dynamic Contrastive Attack (SADCA) that: 1) progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts, 2) establishes a contrastive learning mechanism involving adversarial, positive and negative samples to reinforce semantic inconsistency, and 3) incorporates semantic augmentation through input transformations to increase diversity and generalization of adversarial examples.
Result: Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods.
Conclusion: SADCA effectively addresses the limitations of existing attacks on vision-language models by enhancing adversarial transferability through progressive, semantically-guided perturbation and dynamic contrastive learning.
Abstract: With the rapid advancement and widespread application of vision-language pre-training (VLP) models, their vulnerability to adversarial attacks has become a critical concern. In general, the adversarial examples can typically be designed to exhibit transferable power, attacking not only different models but also across diverse tasks. However, existing attacks on language-vision models mainly rely on static cross-modal interactions and focus solely on disrupting positive image-text pairs, resulting in limited cross-modal disruption and poor transferability. To address this issue, we propose a Semantic-Augmented Dynamic Contrastive Attack (SADCA) that enhances adversarial transferability through progressive and semantically guided perturbation. SADCA progressively disrupts cross-modal alignment through dynamic interactions between adversarial images and texts. This is accomplished by SADCA establishing a contrastive learning mechanism involving adversarial, positive and negative samples, to reinforce the semantic inconsistency of the obtained perturbations. Moreover, we empirically find that input transformations commonly used in traditional transfer-based attacks also benefit VLPs, which motivates a semantic augmentation module that increases the diversity and generalization of adversarial examples. Extensive experiments on multiple datasets and models demonstrate that SADCA significantly improves adversarial transferability and consistently surpasses state-of-the-art methods. The code is released at https://github.com/LiYuanBoJNU/SADCA.
[165] Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models
Yuanbo Li, Tianyang Xu, Cong Hu, Tao Zhou, Xiao-Jun Wu, Josef Kittler
Main category: cs.CV
TL;DR: MPCAttack: A multi-paradigm collaborative attack framework that improves adversarial transferability against multimodal LLMs by aggregating visual and textual features for joint optimization.
Details
Motivation: Existing adversarial attacks against MLLMs rely on surrogate models trained within single learning paradigms, limiting feature representation richness and adversarial perturbation diversity, which restricts transferability.Method: Proposes Multi-Paradigm Collaborative Attack (MPCAttack) that aggregates semantic representations from both visual images and language texts, using Multi-Paradigm Collaborative Optimisation (MPCO) strategy to perform contrastive matching on multi-paradigm features and adaptively balance different paradigm representations.
Result: Extensive experiments on multiple benchmarks show MPCAttack consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
Conclusion: MPCAttack effectively addresses representation bias in adversarial attacks against MLLMs by leveraging multi-paradigm feature aggregation and collaborative optimization, significantly improving transferability of adversarial examples.
Abstract: The rapid progress of Multi-Modal Large Language Models (MLLMs) has significantly advanced downstream applications. However, this progress also exposes serious transferable adversarial vulnerabilities. In general, existing adversarial attacks against MLLMs typically rely on surrogate models trained within a single learning paradigm and perform independent optimisation in their respective feature spaces. This straightforward setting naturally restricts the richness of feature representations, delivering limits on the search space and thus impeding the diversity of adversarial perturbations. To address this, we propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs. In principle, MPCAttack aggregates semantic representations, from both visual images and language texts, to facilitate joint adversarial optimisation on the aggregated features through a Multi-Paradigm Collaborative Optimisation (MPCO) strategy. By performing contrastive matching on multi-paradigm features, MPCO adaptively balances the importance of different paradigm representations and guides the global perturbation optimisation, effectively alleviating the representation bias. Extensive experimental results on multiple benchmarks demonstrate the superiority of MPCAttack, indicating that our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs. The code is released at https://github.com/LiYuanBoJNU/MPCAttack.
[166] GloSplat: Joint Pose-Appearance Optimization for Faster and More Accurate 3D Reconstruction
Tianyu Xiong, Rui Li, Linjie Li, Jiaqi Yang
Main category: cs.CV
TL;DR: GloSplat is a framework for joint pose-appearance optimization in 3D Gaussian Splatting that preserves explicit SfM feature tracks throughout training, enabling both photometric and geometric supervision.
Details
Motivation: Traditional approaches treat feature extraction, matching, SfM, and novel view synthesis as separate problems with independent optimization objectives. The authors aim to create a unified framework that performs joint optimization while maintaining explicit geometric constraints.Method: GloSplat preserves explicit SfM feature tracks as first-class entities throughout 3D Gaussian Splatting training. Track 3D points are maintained as separate optimizable parameters from Gaussian primitives, using a reprojection loss alongside photometric supervision. Two variants: GloSplat-F (COLMAP-free with retrieval-based pair selection) and GloSplat-A (exhaustive matching for maximum quality).
Result: GloSplat-F achieves state-of-the-art among COLMAP-free methods, while GloSplat-A surpasses all COLMAP-based baselines. The method prevents early-stage pose drift and enables fine-grained refinement not possible with photometric-only approaches.
Conclusion: Joint optimization with explicit geometric constraints improves 3D reconstruction quality by combining photometric and geometric supervision, offering both efficiency (COLMAP-free variant) and maximum quality (exhaustive variant).
Abstract: Feature extraction, matching, structure from motion (SfM), and novel view synthesis (NVS) have traditionally been treated as separate problems with independent optimization objectives. We present GloSplat, a framework that performs \emph{joint pose-appearance optimization} during 3D Gaussian Splatting training. Unlike prior joint optimization methods (BARF, NeRF–, 3RGS) that rely purely on photometric gradients for pose refinement, GloSplat preserves \emph{explicit SfM feature tracks} as first-class entities throughout training: track 3D points are maintained as separate optimizable parameters from Gaussian primitives, providing persistent geometric anchors via a reprojection loss that operates alongside photometric supervision. This architectural choice prevents early-stage pose drift while enabling fine-grained refinement – a capability absent in photometric-only approaches. We introduce two pipeline variants: (1) \textbf{GloSplat-F}, a COLMAP-free variant using retrieval-based pair selection for efficient reconstruction, and (2) \textbf{GloSplat-A}, an exhaustive matching variant for maximum quality. Both employ global SfM initialization followed by joint photometric-geometric optimization during 3DGS training. Experiments demonstrate that GloSplat-F achieves state-of-the-art among COLMAP-free methods while GloSplat-A surpasses all COLMAP-based baselines.
[167] Scalable Injury-Risk Screening in Baseball Pitching From Broadcast Video
Jerrin Bright, Justin Mende, John Zelek
Main category: cs.CV
TL;DR: Monocular video pipeline recovers 18 biomechanical metrics from broadcast footage for injury prediction in pitching, achieving sub-degree accuracy on most metrics and good AUC for injury prediction.
Details
Motivation: Current gold-standard biomechanical measurements require expensive multi-camera systems only available in professional stadiums, limiting scalable injury prediction. There's a need for accessible alternatives using widely available broadcast video.Method: Built on DreamPose3D with drift-controlled global lifting module for pelvis trajectory recovery, plus kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to handle motion blur and extreme poses.
Result: 16/18 metrics achieve sub-degree agreement (MAE < 1°). Injury prediction models achieve AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers.
Conclusion: Monocular broadcast video is a viable alternative to stadium-scale motion capture for biomechanics, enabling scalable injury-risk screening from widely available footage.
Abstract: Injury prediction in pitching depends on precise biomechanical signals, yet gold-standard measurements come from expensive, stadium-installed multi-camera systems that are unavailable outside professional venues. We present a monocular video pipeline that recovers 18 clinically relevant biomechanics metrics from broadcast footage, positioning pose-derived kinematics as a scalable source for injury-risk modeling. Built on DreamPose3D, our approach introduces a drift-controlled global lifting module that recovers pelvis trajectory via velocity-based parameterization and sliding-window inference, lifting pelvis-rooted poses into global space. To address motion blur, compression artifacts, and extreme pitching poses, we incorporate a kinematics refinement pipeline with bone-length constraints, joint-limited inverse kinematics, smoothing, and symmetry constraints to ensure temporally stable and physically plausible kinematics. On 13 professional pitchers (156 paired pitches), 16/18 metrics achieve sub-degree agreement (MAE $< 1^{\circ}$). Using these metrics for injury prediction, an automated screening model achieves AUC 0.811 for Tommy John surgery and 0.825 for significant arm injuries on 7,348 pitchers. The resulting pose-derived metrics support scalable injury-risk screening, establishing monocular broadcast video as a viable alternative to stadium-scale motion capture for biomechanics.
[168] SURE: Semi-dense Uncertainty-REfined Feature Matching
Sicheng Li, Zaiwang Gu, Jie Zhang, Qing Guo, Xudong Jiang, Jun Cheng
Main category: cs.CV
TL;DR: SURE is a semi-dense uncertainty-refined matching framework that jointly predicts image correspondences and their confidence by modeling both aleatoric and epistemic uncertainties, outperforming state-of-the-art methods in challenging scenarios.
Details
Motivation: Existing image correspondence methods struggle in challenging scenarios with large viewpoint changes or textureless regions, often producing incorrect matches with high confidence scores. This is because conventional models rely solely on feature similarity without explicit reliability estimation, leading to overconfident errors that can negatively impact robotic vision applications.Method: SURE introduces a novel evidential head for trustworthy coordinate regression that models both aleatoric (data-dependent) and epistemic (model-dependent) uncertainties. It also includes a lightweight spatial fusion module that enhances local feature precision with minimal computational overhead, enabling joint prediction of correspondences and their confidence scores.
Result: The method was evaluated on multiple standard benchmarks and consistently outperformed existing state-of-the-art semi-dense matching models in both accuracy and efficiency, demonstrating improved performance in challenging scenarios with large viewpoint changes and textureless regions.
Conclusion: SURE provides a robust solution for establishing reliable image correspondences by explicitly modeling uncertainty, which is crucial for robotic vision applications where overconfident errors can have significant consequences. The framework’s ability to jointly predict matches and confidence scores represents an important advancement in computer vision.
Abstract: Establishing reliable image correspondences is essential for many robotic vision problems. However, existing methods often struggle in challenging scenarios with large viewpoint changes or textureless regions, where incorrect cor- respondences may still receive high similarity scores. This is mainly because conventional models rely solely on fea- ture similarity, lacking an explicit mechanism to estimate the reliability of predicted matches, leading to overconfident errors. To address this issue, we propose SURE, a Semi- dense Uncertainty-REfined matching framework that jointly predicts correspondences and their confidence by modeling both aleatoric and epistemic uncertainties. Our approach in- troduces a novel evidential head for trustworthy coordinate regression, along with a lightweight spatial fusion module that enhances local feature precision with minimal overhead. We evaluated our method on multiple standard benchmarks, where it consistently outperforms existing state-of-the-art semi-dense matching models in both accuracy and efficiency. our code will be available on https://github.com/LSC-ALAN/SURE.
[169] Diffusion-Based sRGB Real Noise Generation via Prompt-Driven Noise Representation Learning
Jaekyun Ko, Dongjin Kim, Soomin Lee, Guanghui Wang, Tae Hyun Kim
Main category: cs.CV
TL;DR: PNG framework generates realistic noisy images without camera metadata dependency by learning high-dimensional prompt features from input noise, improving noise synthesis generalizability for real-world denoising applications.
Details
Motivation: Real-world denoising faces challenges due to noise variability in sRGB space and scarcity of real noisy-clean image pairs. Existing generative methods rely on camera metadata, limiting usability when metadata is unavailable or inconsistent across devices.Method: Proposes Prompt-Driven Noise Generation (PNG) framework that learns high-dimensional prompt features capturing real-world noise characteristics, enabling synthesis of realistic noisy images without explicit camera metadata dependency.
Result: Comprehensive experiments show PNG effectively produces realistic noisy images and successfully applies them for real-world noise removal across various benchmark datasets, demonstrating improved generalizability.
Conclusion: PNG eliminates camera metadata dependency for noise synthesis, enhancing generalizability and applicability for real-world denoising tasks while maintaining realistic noise generation quality.
Abstract: Denoising in the sRGB image space is challenging due to noise variability. Although end-to-end methods perform well, their effectiveness in real-world scenarios is limited by the scarcity of real noisy-clean image pairs, which are expensive and difficult to collect. To address this limitation, several generative methods have been developed to synthesize realistic noisy images from limited data. These generative approaches often rely on camera metadata during both training and testing to synthesize real-world noise. However, the lack of metadata or inconsistencies between devices restricts their usability. Therefore, we propose a novel framework called Prompt-Driven Noise Generation (PNG). This model is capable of acquiring high-dimensional prompt features that capture the characteristics of real-world input noise and creating a variety of realistic noisy images consistent with the distribution of the input noise. By eliminating the dependency on explicit camera metadata, our approach significantly enhances the generalizability and applicability of noise synthesis. Comprehensive experiments reveal that our model effectively produces realistic noisy images and show the successful application of these generated images in removing real-world noise across various benchmark datasets.
[170] Interpretable Pre-Release Baseball Pitch Type Anticipation from Broadcast 3D Kinematics
Jerrin Bright, Michelle Lu, John Zelek
Main category: cs.CV
TL;DR: Classifying baseball pitch types from 3D body pose sequences using biomechanical features, achieving 80.4% accuracy on professional pitch data and analyzing kinematic vs. ball-flight information.
Details
Motivation: To understand how much information about upcoming pitch types can be extracted from a pitcher's body kinematics alone, without access to ball-flight data, and to establish the limits of pose-based prediction.Method: Pipeline combining diffusion-based 3D pose estimation, automatic pitching-event detection, biomechanical feature extraction (229 kinematic features), and gradient-boosted classification on 119,561 professional pitches.
Result: Achieved 80.4% accuracy classifying 8 pitch types using body kinematics alone. Upper-body mechanics contributed 64.9% of predictive signal vs. 35.1% for lower body. Wrist position and trunk lateral tilt were most informative features. Grip-defined variants (four-seam vs. two-seam fastball) were not separable from pose.
Conclusion: Body kinematics provide substantial but limited information about pitch types, with an empirical ceiling near 80% accuracy. This delineates where kinematic information ends and ball-flight information begins for pitch prediction.
Abstract: How much can a pitcher’s body reveal about the upcoming pitch? We study this question at scale by classifying eight pitch types from monocular 3D pose sequences, without access to ball-flight data. Our pipeline chains a diffusion-based 3D pose backbone with automatic pitching-event detection, groundtruth-validated biomechanical feature extraction, and gradient-boosted classification over 229 kinematic features. Evaluated on 119,561 professional pitches, the largest such benchmark to date, we achieve 80.4% accuracy using body kinematics alone. A systematic importance analysis reveals that upper-body mechanics contribute 64.9% of the predictive signal versus 35.1% for the lower body, with wrist position (14.8%) and trunk lateral tilt emerging as the most informative joint group and biomechanical feature, respectively. We further show that grip-defined variants (four-seam vs.\ two-seam fastball) are not separable from pose, establishing an empirical ceiling near 80% and delineating where kinematic information ends and ball-flight information begins.
[171] Structure Observation Driven Image-Text Contrastive Learning for Computed Tomography Report Generation
Hong Liu, Dong Wei, Qiong Peng, Yawen Huang, Xian Wu, Yefeng Zheng, Liansheng Wang
Main category: cs.CV
TL;DR: Two-stage framework for CT report generation using structure-wise image-text contrastive learning and visual queries to learn anatomical correspondences.
Details
Motivation: CT report generation is challenging due to large data volumes and intricate details needed compared to X-rays. Existing methods may be limited, so a specialized approach is needed to automate clinical radiology reporting and reduce workload.Method: Two-stage framework: 1) Structure-learning with learnable visual queries for anatomical structures, structure-wise image-text contrastive loss, soft pseudo targets for false negatives, and dynamic negative queue. 2) Report-learning with frozen visual queries selecting critical image patches and a text decoder for generation.
Result: Establishes new state-of-the-art performance on two public datasets for CT report generation in clinical efficiency, with effective components validated through experiments.
Conclusion: The proposed framework effectively addresses CT report generation challenges by learning structure-level semantic correspondences and minimizing distractions from irrelevant areas, demonstrating superior clinical efficiency.
Abstract: Computed Tomography Report Generation (CTRG) aims to automate the clinical radiology reporting process, thereby reducing the workload of report writing and facilitating patient care. While deep learning approaches have achieved remarkable advances in X-ray report generation, their effectiveness may be limited in CTRG due to larger data volumes of CT images and more intricate details required to describe them. This work introduces a novel two-stage (structure- and report-learning) framework tailored for CTRG featuring effective structure-wise image-text contrasting. In the first stage, a set of learnable structure-specific visual queries observe corresponding structures in a CT image. The resulting observation tokens are contrasted with structure-specific textual features extracted from the accompanying radiology report with a structure-wise image-text contrastive loss. In addition, text-text similarity-based soft pseudo targets are proposed to mitigate the impact of false negatives, i.e., semantically identical image structures and texts from non-paired images and reports. Thus, the model learns structure-level semantic correspondences between CT images and reports. Further, a dynamic, diversity-enhanced negative queue is proposed to guide the network in learning to discriminate various abnormalities. In the second stage, the visual structure queries are frozen and used to select the critical image patch embeddings depicting each anatomical structure, minimizing distractions from irrelevant areas while reducing memory consumption. Also, a text decoder is added and trained for report generation.Our extensive experiments on two public datasets demonstrate that our framework establishes new state-of-the-art performance for CTRG in clinical efficiency, and its components are effective.
[172] Federated Modality-specific Encoders and Partially Personalized Fusion Decoder for Multimodal Brain Tumor Segmentation
Hong Liu, Dong Wei, Qian Dai, Xian Wu, Yefeng Zheng, Liansheng Wang
Main category: cs.CV
TL;DR: FedMEPD: A federated learning framework with modality-specific encoders and partially personalized decoders for handling multimodal medical image analysis with incomplete modalities across clients.
Details
Motivation: Existing FL methods for medical imaging only handle intramodal heterogeneity, but real-world scenarios involve clients with incomplete modality sets (intermodal heterogeneity). Additionally, clients need personalized models tailored to their local data characteristics.Method: FedMEPD uses federated modality-specific encoders for each imaging modality and partially personalized multimodal fusion decoders. The server has full-modal data and uses a fusion decoder to optimize encoders via backpropagation. Clients with incomplete modalities calibrate missing-modal representations using global anchors via cross-attention.
Result: Outperforms various state-of-the-art methods for multimodal and personalized FL on BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks, with novel designs shown to be effective.
Conclusion: FedMEPD successfully addresses both intermodal heterogeneity and personalization needs in multimodal FL for medical imaging, providing a practical solution for real-world scenarios with incomplete modality data across clients.
Abstract: Most existing federated learning (FL) methods for medical image analysis only considered intramodal heterogeneity, limiting their applicability to multimodal imaging applications. In practice, some FL participants may possess only a subset of the complete imaging modalities, posing intermodal heterogeneity as a challenge to effectively training a global model on all participants’ data. Meanwhile, each participant expects a personalized model tailored to its local data characteristics in FL. This work proposes a new FL framework with federated modality-specific encoders and partially personalized multimodal fusion decoders (FedMEPD) to address the two concurrent issues. Specifically, FedMEPD employs an exclusive encoder for each modality to account for the intermodal heterogeneity. While these encoders are fully federated, the decoders are partially personalized to meet individual needs – using the discrepancy between global and local parameter updates to dynamically determine which decoder filters are personalized. Implementation-wise, a server with full-modal data employs a fusion decoder to fuse representations from all modality-specific encoders, thus bridging the modalities to optimize the encoders via backpropagation. Moreover, multiple anchors are extracted from the fused multimodal representations and distributed to the clients in addition to the model parameters. Conversely, the clients with incomplete modalities calibrate their missing-modal representations toward the global full-modal anchors via scaled dot-product cross-attention, making up for the information loss due to absent modalities. FedMEPD is validated on the BraTS 2018 and 2020 multimodal brain tumor segmentation benchmarks. Results show that it outperforms various up-to-date methods for multimodal and personalized FL, and its novel designs are effective.
[173] Locality-Attending Vision Transformer
Sina Hajimiri, Farzad Beizaee, Fereshteh Shakeri, Christian Desrosiers, Ismail Ben Ayed, Jose Dolz
Main category: cs.CV
TL;DR: A simple add-on module for vision transformers that improves segmentation performance while maintaining classification accuracy by modulating self-attention with learnable Gaussian kernels to focus on local spatial details.
Details
Motivation: Vision transformers excel at classification using global self-attention but lose fine-grained spatial details crucial for segmentation tasks. The authors aim to enhance segmentation performance of vision transformers after standard classification training without compromising their image-level recognition capabilities.Method: Proposes two modifications: 1) Modulates self-attention with a learnable Gaussian kernel that biases attention toward neighboring patches to focus on local surroundings, and 2) Refines patch representations to learn better embeddings at patch positions. These changes preserve global information while improving local spatial awareness.
Result: Substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base) without changing training regime or sacrificing classification performance.
Conclusion: The proposed simple yet effective add-on successfully enhances vision transformers’ segmentation capabilities while maintaining their classification strengths, demonstrating that local attention mechanisms can complement global self-attention for spatial understanding tasks.
Abstract: Vision transformers have demonstrated remarkable success in classification by leveraging global self-attention to capture long-range dependencies. However, this same mechanism can obscure fine-grained spatial details crucial for tasks such as segmentation. In this work, we seek to enhance segmentation performance of vision transformers after standard image-level classification training. More specifically, we present a simple yet effective add-on that improves performance on segmentation tasks while retaining vision transformers’ image-level recognition capabilities. In our approach, we modulate the self-attention with a learnable Gaussian kernel that biases the attention toward neighboring patches. We further refine the patch representations to learn better embeddings at patch positions. These modifications encourage tokens to focus on local surroundings and ensure meaningful representations at spatial positions, while still preserving the model’s ability to incorporate global information. Experiments demonstrate the effectiveness of our modifications, evidenced by substantial segmentation gains on three benchmarks (e.g., over 6% and 4% on ADE20K for ViT Tiny and Base), without changing the training regime or sacrificing classification performance. The code is available at https://github.com/sinahmr/LocAtViT/.
[174] FC-VFI: Faithful and Consistent Video Frame Interpolation for High-FPS Slow Motion Video Generation
Ganggui Ding, Hao Chen, Xiaogang Xu
Main category: cs.CV
TL;DR: FC-VFI: A video frame interpolation method that uses temporal modeling on latent sequences and semantic matching lines for structure-aware motion guidance to achieve high-fidelity 4x and 8x interpolation while preserving visual fidelity and motion consistency.
Details
Motivation: Current video diffusion models for frame interpolation struggle with high fidelity frame generation due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing motion control methods have issues: dense optical flow is error-prone and sparse points lack structural context.Method: Proposes FC-VFI with three key components: 1) Temporal modeling strategy on latent sequences to inherit fidelity cues from start and end frames, 2) Semantic matching lines for structure-aware motion guidance to improve motion consistency, and 3) Temporal difference loss to mitigate temporal inconsistencies.
Result: FC-VFI achieves high performance and structural integrity across diverse scenarios, supporting 4x and 8x interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at 2560×1440 resolution while preserving visual fidelity and motion consistency.
Conclusion: FC-VFI effectively addresses limitations of existing video frame interpolation methods by combining temporal modeling on latent sequences with structure-aware motion guidance, achieving faithful and consistent interpolation results with high visual fidelity.
Abstract: Large pre-trained video diffusion models excel in video frame interpolation but struggle to generate high fidelity frames due to reliance on intrinsic generative priors, limiting detail preservation from start and end frames. Existing methods often depend on motion control for temporal consistency, yet dense optical flow is error-prone, and sparse points lack structural context. In this paper, we propose FC-VFI for faithful and consistent video frame interpolation, supporting (4\times)x and (8\times) interpolation, boosting frame rates from 30 FPS to 120 and 240 FPS at (2560\times 1440)resolution while preserving visual fidelity and motion consistency. We introduce a temporal modeling strategy on the latent sequences to inherit fidelity cues from start and end frames and leverage semantic matching lines for structure-aware motion guidance, improving motion consistency. Furthermore, we propose a temporal difference loss to mitigate temporal inconsistencies. Extensive experiments show FC-VFI achieves high performance and structural integrity across diverse scenarios.
[175] AdaIAT: Adaptively Increasing Attention to Generated Text to Alleviate Hallucinations in LVLM
Li’an Zhong, Ziqiang He, Jibin Zheng, Jin Li, Z. Jane Wang, Xiangui Kang
Main category: cs.CV
TL;DR: IAT and AdaIAT methods reduce hallucinations in Large Vision-Language Models by leveraging generated text attention patterns instead of just amplifying image token attention, achieving better trade-off between hallucination reduction and linguistic coherence.
Details
Motivation: Hallucination is a major problem in LVLMs that limits their practical application. While increasing attention to image tokens reduces hallucinations, it causes repetitive descriptions. The authors discovered that real object tokens show different attention patterns to generated text than hallucinated ones, suggesting a better approach.Method: Proposed IAT (Attention to Generated Text) which leverages attention patterns between generated text and image tokens to reduce hallucinations. Further developed AdaIAT (Adaptive IAT) with layer-wise thresholds to control intervention timing and fine-grained amplification magnitude per attention head, preserving model’s inherent prediction capabilities.
Result: AdaIAT significantly reduces hallucination rates (CS and CI on LLaVA-1.5 by 35.8% and 37.1% respectively) while avoiding repetitive descriptions and preserving linguistic performance and prediction capability. Multiple LVLMs show improved performance with this approach.
Conclusion: The attention patterns to generated text provide valuable signals for hallucination mitigation. AdaIAT achieves an attractive trade-off between reducing hallucinations and maintaining model capabilities, offering a practical solution for improving LVLM reliability.
Abstract: Hallucination has been a significant impediment to the development and application of current Large Vision-Language Models (LVLMs). To mitigate hallucinations, one intuitive and effective way is to directly increase attention weights to image tokens during inference. Although this effectively reduces the hallucination rate, it often induces repetitive descriptions. To address this, we first conduct an analysis of attention patterns and reveal that real object tokens tend to assign higher attention to the generated text than hallucinated ones. This inspires us to leverage the generated text, which contains instruction-related visual information and contextual knowledge, to alleviate hallucinations while maintaining linguistic coherence. We therefore propose Attention to Generated Text (IAT) and demonstrate that it significantly reduces the hallucination rate while avoiding repetitive descriptions. To prevent naive amplification from impairing the inherent prediction capabilities of LVLMs, we further explore Adaptive IAT (AdaIAT) that employs a layer-wise threshold to control intervention time and fine-grained amplification magnitude tailored to the characteristics of each attention head. Both analysis and experiments demonstrate the effectiveness of AdaIAT. Results of several LVLMs show that AdaIAT effectively alleviates hallucination (reducing hallucination rates $C_S$ and $C_I$ on LLaVA-1.5 by 35.8% and 37.1%, respectively) while preserving linguistic performance and prediction capability, achieving an attractive trade-off.
[176] Person Detection and Tracking from an Overhead Crane LiDAR
Nilusha Jayawickrama, Henrik Toikka, Risto Ojala
Main category: cs.CV
TL;DR: Adapting 3D LiDAR detectors for person detection and tracking from overhead crane-mounted sensors in industrial settings, with curated dataset and distance-based evaluation.
Details
Motivation: Address domain shift from vehicle-centric LiDAR benchmarks to overhead industrial settings, and lack of suitable public training data for person detection in overhead crane-mounted LiDAR systems.Method: Curate site-specific overhead LiDAR dataset with 3D human bounding-box annotations, adapt selected 3D detectors under unified training/evaluation protocol, integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack, and perform distance-sliced evaluation.
Result: Best detectors achieve AP up to 0.84 within 5.0m horizontal radius (0.97 at 1.0m), with VoxelNeXt and SECOND as most reliable backbones; real-time feasibility demonstrated through latency measurements.
Conclusion: Successfully bridges domain gap between standard driving datasets and overhead sensing for person detection/tracking, with released dataset and implementations to support further research.
Abstract: This paper investigates person detection and tracking in an industrial indoor workspace using a LiDAR mounted on an overhead crane. The overhead viewpoint introduces a strong domain shift from common vehicle-centric LiDAR benchmarks, and limited availability of suitable public training data. Henceforth, we curate a site-specific overhead LiDAR dataset with 3D human bounding-box annotations and adapt selected candidate 3D detectors under a unified training and evaluation protocol. We further integrate lightweight tracking-by-detection using AB3DMOT and SimpleTrack to maintain person identities over time. Detection performance is reported with distance-sliced evaluation to quantify the practical operating envelope of the sensing setup. The best adapted detector configurations achieve average precision (AP) up to 0.84 within a 5.0 m horizontal radius, increasing to 0.97 at 1.0 m, with VoxelNeXt and SECOND emerging as the most reliable backbones across this range. The acquired results contribute in bridging the domain gap between standard driving datasets and overhead sensing for person detection and tracking. We also report latency measurements, highlighting practical real-time feasibility. Finally, we release our dataset and implementations in GitHub to support further research
[177] Adaptive Prototype-based Interpretable Grading of Prostate Cancer
Riddhasree Bhattacharyya, Pallabi Dutta, Sushmita Mitra
Main category: cs.CV
TL;DR: A prototype-based weakly-supervised framework for interpretable prostate cancer grading from histopathology images that mimics pathologist workflow by comparing suspicious regions with clinically validated examples.
Details
Motivation: Prostate cancer grading is tedious and subjective, creating heavy workload for pathologists. While deep learning shows promise, limited interpretability hinders adoption in medical applications. Existing interpretability methods provide coarse explanations but don't reveal why highlighted regions matter.Method: Proposes a prototype-based weakly-supervised framework with: 1) initial patch-level pre-training to learn prototypical features for each grade, 2) fine-tuning with a new prototype-aware loss function for weakly-supervised grading, and 3) attention-based dynamic pruning mechanism to handle inter-sample heterogeneity while emphasizing relevant prototypes.
Result: Extensive validation on benchmark PANDA and SICAP datasets confirms the framework can serve as a reliable assistive tool for pathologists in routine diagnostic workflows.
Conclusion: The proposed prototype-based framework provides interpretable prostate cancer grading that mirrors pathologist workflow, making it more trustworthy for clinical adoption compared to traditional black-box deep learning approaches.
Abstract: Prostate cancer being one of the frequently diagnosed malignancy in men, the rising demand for biopsies places a severe workload on pathologists. The grading procedure is tedious and subjective, motivating the development of automated systems. Although deep learning has made inroads in terms of performance, its limited interpretability poses challenges for widespread adoption in high-stake applications like medicine. Existing interpretability techniques for prostate cancer classifiers provide a coarse explanation but do not reveal why the highlighted regions matter. In this scenario, we propose a novel prototype-based weakly-supervised framework for an interpretable grading of prostate cancer from histopathology images. These networks can prove to be more trustworthy since their explicit reasoning procedure mirrors the workflow of a pathologist in comparing suspicious regions with clinically validated examples. The network is initially pre-trained at patch-level to learn robust prototypical features associated with each grade. In order to adapt it to a weakly-supervised setup for prostate cancer grading, the network is fine-tuned with a new prototype-aware loss function. Finally, a new attention-based dynamic pruning mechanism is introduced to handle inter-sample heterogeneity, while selectively emphasizing relevant prototypes for optimal performance. Extensive validation on the benchmark PANDA and SICAP datasets confirms that the framework can serve as a reliable assistive tool for pathologists in their routine diagnostic workflows.
[178] Location-Aware Pretraining for Medical Difference Visual Question Answering
Denis Musinguzi, Caren Han, Prasenjit Mitra
Main category: cs.CV
TL;DR: A pretraining framework with location-aware tasks (AREF, GCAP, CAREF) enhances vision encoders for medical difference VQA, achieving SOTA in detecting clinical changes in chest X-rays.
Details
Motivation: Standard vision encoders fail to capture subtle visual variations needed for medical differential diagnosis, which requires comparing multiple images to distinguish disease progression from acquisition differences.Method: Introduces pretraining with location-aware tasks: automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF) to learn fine-grained, spatially grounded visual representations, then integrates with language model for medical difference VQA.
Result: Achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.
Conclusion: Location-aware pretraining enables vision encoders to capture subtle visual variations crucial for medical differential diagnosis, outperforming traditional methods.
Abstract: Unlike conventional single-image models, differential medical VQA frameworks process multiple images to identify differences, mirroring the comparative diagnostic workflow of radiologists. However, standard vision encoders trained on contrastive or classification objectives often fail to capture the subtle visual variations necessary for distinguishing disease progression from acquisition differences. To address this limitation, we introduce a pretraining framework that incorporates location-aware tasks, including automatic referring expressions (AREF), grounded captioning (GCAP), and conditional automatic referring expressions (CAREF). These specific tasks enable the vision encoder to learn fine-grained, spatially grounded visual representations that are often overlooked by traditional pre-training methods. We subsequently integrate this enhanced vision encoder with a language model to perform medical difference VQA. Experimental results demonstrate that our approach achieves state-of-the-art performance in detecting and reasoning about clinically relevant changes in chest X-ray images.
[179] VisionPangu: A Compact and Fine-Grained Multimodal Assistant with 1.7B Parameters
Jiaxin Fan, Wenpo Song
Main category: cs.CV
TL;DR: VisionPangu is a compact 1.7B-parameter multimodal model that improves detailed image captioning through efficient multimodal alignment and high-quality supervision from dense human-authored descriptions.
Details
Motivation: Existing Large Multimodal Models (LMMs) often rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. The authors aim to create a more compact model that can produce richer, more structured captions without aggressive model scaling.Method: Combines an InternVL-derived vision encoder with OpenPangu-Embedded language backbone via lightweight MLP projector, adopts LLaVA-inspired instruction-tuning pipeline, and incorporates dense human-authored descriptions from DOCCI dataset for supervision.
Result: Demonstrates that compact multimodal models can achieve competitive performance while producing more structured and detailed captions compared to larger models.
Conclusion: VisionPangu shows that efficient multimodal alignment and high-quality supervision can enable compact models to generate detailed image captions without relying on large-scale architectures.
Abstract: Large Multimodal Models (LMMs) have achieved strong performance in vision-language understanding, yet many existing approaches rely on large-scale architectures and coarse supervision, which limits their ability to generate detailed image captions. In this work, we present VisionPangu, a compact 1.7B-parameter multimodal model designed to improve detailed image captioning through efficient multimodal alignment and high-quality supervision. Our model combines an InternVL-derived vision encoder with the OpenPangu-Embedded language backbone via a lightweight MLP projector and adopts an instruction-tuning pipeline inspired by LLaVA. By incorporating dense human-authored descriptions from the DOCCI dataset, VisionPangu improves semantic coherence and descriptive richness without relying on aggressive model scaling. Experimental results demonstrate that compact multimodal models can achieve competitive performance while producing more structured and detailed captions. The code and model weights will be publicly available at https://www.modelscope.cn/models/asdfgh007/visionpangu.
[180] Revisiting an Old Perspective Projection for Monocular 3D Morphable Models Regression
Toby Chong, Ryota Nakajima
Main category: cs.CV
TL;DR: Novel camera model for monocular 3DMM regression that captures perspective distortion in close-up facial images by extending orthographic projection with a shrinkage parameter.
Details
Motivation: Existing 3DMM regression methods use orthographic projection which works well for stable performance but fails for close-up footage (like head-mounted cameras) due to perspective distortion effects. There's a need to handle close-up facial images while maintaining stability.Method: Extends orthographic projection with a new shrinkage parameter to incorporate pseudo-perspective effect while preserving original projection stability. Presents techniques for finetuning existing models.
Result: Demonstrated effectiveness through quantitative and qualitative comparisons using custom dataset recorded with head-mounted cameras. Shows improved performance for close-up facial footage.
Conclusion: The proposed camera model successfully addresses perspective distortion in close-up facial images while maintaining the stability benefits of orthographic projection for 3DMM regression.
Abstract: We introduce a novel camera model for monocular 3D Morphable Model (3DMM) regression methods that effectively captures the perspective distortion effect commonly seen in close-up facial images. Fitting 3D morphable models to video is a key technique in content creation. In particular, regression-based approaches have produced fast and accurate results by matching the rendered output of the morphable model to the target image. These methods typically achieve stable performance with orthographic projection, which eliminates the ambiguity between focal length and object distance. However, this simplification makes them unsuitable for close-up footage, such as that captured with head-mounted cameras. We extend orthographic projection with a new shrinkage parameter, incorporating a pseudo-perspective effect while preserving the stability of the original projection. We present several techniques that allow finetuning of existing models, and demonstrate the effectiveness of our modification through both quantitative and qualitative comparisons using a custom dataset recorded with head-mounted cameras.
[181] BiEvLight: Bi-level Learning of Task-Aware Event Refinement for Low-Light Image Enhancement
Zishu Yao, Xiang-Xiang Su, Shengning Zhou, Guang-Yong Chen, Guodong Fan, Xing Chen
Main category: cs.CV
TL;DR: BiEvLight: A bilevel optimization framework for low-light image enhancement using event cameras that jointly optimizes event denoising and image enhancement through gradient-guided priors and task-aware learning.
Details
Motivation: Event cameras have high dynamic range potential for low-light image enhancement, but suffer from dual degradation: intrinsic background activity noise in events and low SNR in images. Existing fusion strategies face severe noise coupling during modal fusion, creating a performance bottleneck. Precise event denoising is identified as the prerequisite for unlocking event-based fusion potential.Method: Proposes BiEvLight, a hierarchical task-aware framework that collaboratively optimizes enhancement and denoising through bilevel optimization. Uses gradient correlation between images and events to build gradient-guided event denoising priors for heavily noisy regions. Treats event denoising as a bilevel optimization problem constrained by enhancement task rather than static pre-processing, enabling cross-task interaction where upper-level denoising learns representations tailored to lower-level enhancement objectives.
Result: Extensive experiments on Real-world noise Dataset SDE show significant outperformance over SOTA approaches: average improvements of 1.30dB in PSNR, 2.03dB in PSNR*, and 0.047 in SSIM.
Conclusion: BiEvLight demonstrates that precise event denoising is crucial for event-based low-light image enhancement, and the proposed bilevel optimization framework effectively addresses noise coupling issues through collaborative optimization of denoising and enhancement tasks.
Abstract: Event cameras, with their high dynamic range, show great promise for Low-light Image Enhancement (LLIE). Existing works primarily focus on designing effective modal fusion strategies. However, a key challenge is the dual degradation from intrinsic background activity (BA) noise in events and low signal-to-noise ratio (SNR) in images, which causes severe noise coupling during modal fusion, creating a critical performance bottleneck. We therefore posit that precise event denoising is the prerequisite to unlocking the full potential of event-based fusion. To this end, we propose BiEvLight, a hierarchical and task-aware framework that collaboratively optimizes enhancement and denoising by exploiting their intrinsic interdependence. Specifically, BiEvLight exploits the strong gradient correlation between images and events to build a gradient-guided event denoising prior that alleviates insufficient denoising in heavily noisy regions. Moreover, instead of treating event denoising as a static pre-processing stage-which inevitably incurs a trade-off between over- and under-denoising and cannot adapt to the requirements of a specific enhancement objective-we recast it as a bilevel optimization problem constrained by the enhancement task. Through cross-task interaction, the upper-level denoising problem learns event representations tailored to the lower-level enhancement objective, thereby substantially improving overall enhancement quality. Extensive experiments on the Real-world noise Dataset SDE demonstrate that our method significantly outperforms state-of-the-art (SOTA) approaches, with average improvements of 1.30dB in PSNR, 2.03dB in PSNR* and 0.047 in SSIM, respectively. The code will be publicly available at https://github.com/iijjlk/BiEvlight.
[182] 3D-RFT: Reinforcement Fine-Tuning for Video-based 3D Scene Understanding
Xiongkun Linghu, Jiangyong Huang, Baoxiong Jia, Siyuan Huang
Main category: cs.CV
TL;DR: 3D-RFT extends reinforcement learning with verifiable rewards to video-based 3D scene understanding, directly optimizing models using task-specific metrics like 3D IoU and F1-Score.
Details
Motivation: Existing approaches use supervised fine-tuning with token-level cross-entropy loss as an indirect proxy, causing misalignment between training objectives and task performance. There's a need to bridge this gap for better 3D scene understanding.Method: 3D-RFT first activates 3D-aware multimodal LLMs via supervised fine-tuning, then uses reinforcement fine-tuning with Group Relative Policy Optimization (GRPO) and verifiable reward functions based on metrics like 3D IoU and F1-Score.
Result: 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks, significantly outperforming larger models like VG-LLM-8B on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks.
Conclusion: 3D-RFT presents a robust paradigm for 3D scene understanding that directly optimizes toward evaluation metrics, offering valuable insights into training strategies and data impact for future development.
Abstract: Reinforcement Learning with Verifiable Rewards ( RLVR ) has emerged as a transformative paradigm for enhancing the reasoning capabilities of Large Language Models ( LLMs), yet its potential in 3D scene understanding remains under-explored. Existing approaches largely rely on Supervised Fine-Tuning ( SFT), where the token-level cross-entropy loss acts as an indirect proxy for optimization, leading to a misalignment between training objectives and task performances. To bridge this gap, we present Reinforcement Fine-Tuning for Video-based 3D Scene Understanding (3D-RFT ), the first framework to extend RLVR to video-based 3D perception and reasoning. 3D-RFT shifts the paradigm by directly optimizing the model towards evaluation metrics. 3D-RFT first activates 3D-aware Multi-modal Large Language Models ( MLLM s) via SFT, followed by reinforcement fine-tuning using Group Relative Policy Optimization ( GRPO) with strictly verifiable reward functions. We design task-specific reward functions directly from metrics like 3D IoU and F1-Score to provide more effective signals to guide model training. Extensive experiments demonstrate that 3D-RFT-4B achieves state-of-the-art performance on various video-based 3D scene understanding tasks. Notably, 3D-RFT-4B significantly outperforms larger models (e.g., VG LLM-8B) on 3D video detection, 3D visual grounding, and spatial reasoning benchmarks. We further reveal good properties of 3D-RFT such as robust efficacy, and valuable insights into training strategies and data impact. We hope 3D-RFT can serve as a robust and promising paradigm for future development of 3D scene understanding.
[183] Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding
Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian, Cong Bai
Main category: cs.CV
TL;DR: VideoHV-Agent: A hypothesis-verification framework for long-video QA that first formulates testable hypotheses from answer candidates, then verifies them with localized video evidence, achieving SOTA accuracy with better interpretability and lower cost.
Details
Motivation: Long video understanding faces challenges from visual redundancy, temporal dependencies, and semantic drift in chain-of-thought/retrieval approaches. The paper argues for "thinking-before-finding" - first articulating what must be true for each answer candidate before retrieval.Method: VideoHV-Agent reformulates video QA as structured hypothesis-verification: 1) Thinker rewrites answer candidates into testable hypotheses, 2) Judge derives discriminative clues specifying required evidence, 3) Verifier grounds and tests clues using localized video content, 4) Answer agent integrates validated evidence.
Result: Achieves state-of-the-art accuracy on three long-video understanding benchmarks while providing enhanced interpretability, improved logical soundness, and lower computational cost compared to existing methods.
Conclusion: The hypothesis-verification approach with “thinking-before-finding” principle effectively addresses long-video reasoning challenges, offering a more interpretable and efficient framework for video understanding tasks.
Abstract: Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.
[184] A Simple Baseline for Unifying Understanding, Generation, and Editing via Vanilla Next-token Prediction
Jie Zhu, Hanghang Ma, Jia Wang, Yayong Guan, Yanbing Zeng, Lishuai Gao, Junqiang Wu, Jie Hu, Leye Wang
Main category: cs.CV
TL;DR: Wallaroo is an autoregressive model using next-token prediction to unify multimodal understanding, image generation, and editing with multi-resolution I/O and bilingual support.
Details
Motivation: To create a unified model that can handle both multimodal understanding and generation tasks simultaneously, addressing the need for models that can both understand and generate visual content in a single framework.Method: Uses autoregressive next-token prediction with decoupled visual encoding pathways and a four-stage training strategy to reshape model capabilities. Supports multi-resolution image input/output and bilingual Chinese/English.
Result: Competitive or superior performance on various benchmarks compared to other unified models, demonstrating strong potential for autoregressive models in multimodal unification.
Conclusion: Wallaroo shows autoregressive models have great potential for unifying multimodal understanding and generation tasks in a single framework.
Abstract: In this work, we introduce Wallaroo, a simple autoregressive baseline that leverages next-token prediction to unify multi-modal understanding, image generation, and editing at the same time. Moreover, Wallaroo supports multi-resolution image input and output, as well as bilingual support for both Chinese and English. We decouple the visual encoding into separate pathways and apply a four-stage training strategy to reshape the model’s capabilities. Experiments are conducted on various benchmarks where Wallaroo produces competitive performance or exceeds other unified models, suggesting the great potential of autoregressive models in unifying multi-modality understanding and generation. Our code is available at https://github.com/JiePKU/Wallaroo.
[185] TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
Jiaxiong Liu, Zhen Tan, Jinpu Zhang, Yi Zhou, Hui Shen, Xieyuanli Chen, Dewen Hu
Main category: cs.CV
TL;DR: TAPFormer is a transformer-based framework for arbitrary point tracking that performs asynchronous temporal-consistent fusion of RGB frames and event streams, addressing temporal misalignment and modality failure issues through novel fusion mechanisms.
Details
Motivation: Existing point tracking methods combining RGB frames and event streams suffer from synchronous/non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. There's a need for robust, high-frequency tracking that can handle challenging conditions like blur or low light.Method: TAPFormer uses a Transient Asynchronous Fusion (TAF) mechanism to model temporal evolution between discrete frames through continuous event updates, bridging low-rate frames and high-rate events. It also employs a Cross-modal Locally Weighted Fusion (CLWF) module that adaptively adjusts spatial attention based on modality reliability.
Result: The method achieves 28.2% improvement in average pixel error within threshold on their novel real-world frame-event dataset. It also consistently achieves best performance on standard point tracking benchmarks.
Conclusion: TAPFormer demonstrates superior performance in arbitrary point tracking through effective asynchronous fusion of frames and events, with robustness to challenging conditions like blur and low light.
Abstract: Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io
[186] MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration
Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Yu Feng, Hao Wang
Main category: cs.CV
TL;DR: MultiGO++ is a novel framework for monocular 3D clothed human reconstruction that addresses limitations in texture availability, geometric priors, and single-modality supervision through multi-source texture synthesis, region-aware shape extraction, and geometry-texture collaborative learning.
Details
Motivation: Existing methods for monocular 3D clothed human reconstruction suffer from three key limitations: textural limitations due to lack of training data, geometric limitations from inaccurate external priors, and systematic limitations from biased single-modality supervision, leading to suboptimal reconstruction quality.Method: The framework includes: (1) multi-source texture synthesis strategy creating 15,000+ 3D textured human scans for better texture quality estimation; (2) region-aware shape extraction module with Fourier geometry encoder to extract body region features and mitigate modality gaps; (3) dual reconstruction U-Net that leverages geometry-texture collaborative features to generate high-fidelity textured 3D human meshes.
Result: Extensive experiments on two benchmarks and in-the-wild cases demonstrate superiority over state-of-the-art approaches in 3D clothed human reconstruction quality.
Conclusion: MultiGO++ effectively addresses key limitations in monocular 3D human reconstruction through systematic geometry-texture collaboration, achieving improved reconstruction quality through better texture synthesis, geometry extraction, and multimodal feature integration.
Abstract: Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.
[187] Physics-consistent deep learning for blind aberration recovery in mobile optics
Kartik Jhawar, Tamo Sancho Miguel Tandoc, Khoo Jun Xuan, Wang Lipo
Main category: cs.CV
TL;DR: Lens2Zernike: A deep learning framework that blindly recovers physical optical parameters (Zernike coefficients) from single blurred images to enable stable non-blind deconvolution for mobile photography.
Details
Motivation: Mobile photography suffers from lens-specific optical aberrations. Current deep learning methods lack explicit optical modeling and can hallucinate details, while classical blind deconvolution is unstable. There's a need to bridge this gap by recovering physical optical parameters.Method: Multi-task framework with three supervision domains: 1) direct Zernike coefficient regression, 2) differentiable physics constraints (wavefront and point spread function derivations), and 3) auxiliary multi-task spatial map predictions. Uses ResNet-18 backbone.
Result: Full framework (z+p+m) yields 35% improvement over coefficient-only baselines. Outperforms two established deep learning methods with significantly lower regression errors. Enables stable non-blind deconvolution with substantial in-domain improvement on IDMxS Mobile Camera Lens Database.
Conclusion: Lens2Zernike successfully bridges the gap between black-box deep learning and unstable classical methods by recovering physical optical parameters, enabling stable restoration of diffraction-limited details from aberrated mobile captures.
Abstract: Mobile photography is often limited by complex, lens-specific optical aberrations. While recent deep learning methods approach this as an end-to-end deblurring task, these “black-box” models lack explicit optical modeling and can hallucinate details. Conversely, classical blind deconvolution remains highly unstable. To bridge this gap, we present Lens2Zernike, a deep learning framework that blindly recovers physical optical parameters from a single blurred image. To the best of our knowledge, no prior work has simultaneously integrated supervision across three distinct optical domains. We introduce a novel physics-consistent strategy that explicitly minimizes errors via direct Zernike coefficient regression (z), differentiable physics constraints encompassing both wavefront and point spread function derivations (p), and auxiliary multi-task spatial map predictions (m). Through an ablation study on a ResNet-18 backbone, we demonstrate that our full multi-task framework (z+p+m) yields a 35% improvement over coefficient-only baselines. Crucially, comparative analysis reveals that our approach outperforms two established deep learning methods from previous literature, achieving significantly lower regression errors. Ultimately, we demonstrate that these recovered physical parameters enable stable non-blind deconvolution, providing substantial in-domain improvement on the patented Institute for Digital Molecular Analytics and Science (IDMxS) Mobile Camera Lens Database for restoring diffraction-limited details from severely aberrated mobile captures.
[188] How far have we gone in Generative Image Restoration? A study on its capability, limitations and evaluation practices
Xiang Yin, Jinfan Hu, Zhiyuan You, Kainan Yan, Yu Tang, Chao Dong, Jinjin Gu
Main category: cs.CV
TL;DR: Large-scale study of generative image restoration models using multi-dimensional evaluation pipeline covering detail, sharpness, semantic correctness, and overall quality across diverse architectures, revealing paradigm shift from detail scarcity to detail quality/semantic control challenges.
Details
Motivation: To systematically evaluate how far generative image restoration (GIR) has truly advanced compared to previous methods, and understand the current state and limitations of modern GIR models through comprehensive analysis.Method: Developed a new multi-dimensional evaluation pipeline assessing models on detail, sharpness, semantic correctness, and overall quality. Analyzed diverse architectures including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models.
Result: Revealed critical performance disparities across different model types and uncovered a paradigm shift in failure modes - from detail scarcity (under-generation) to detail quality and semantic control issues (over-generation). Also trained a new IQA model better aligned with human perceptual judgments.
Conclusion: The study provides systematic understanding of modern generative image restoration models, redefining their true state and charting future development directions focused on controlling detail quality and semantic correctness rather than just generating more details.
Abstract: Generative Image Restoration (GIR) has achieved impressive perceptual realism, but how far have its practical capabilities truly advanced compared with previous methods? To answer this, we present a large-scale study grounded in a new multi-dimensional evaluation pipeline that assesses models on detail, sharpness, semantic correctness, and overall quality. Our analysis covers diverse architectures, including diffusion-based, GAN-based, PSNR-oriented, and general-purpose generation models, revealing critical performance disparities. Furthermore, our analysis uncovers a key evolution in failure modes that signifies a paradigm shift for the perception-oriented low-level vision field. The central challenge is evolving from the previous problem of detail scarcity (under-generation) to the new frontier of detail quality and semantic control (preventing over-generation). We also leverage our benchmark to train a new IQA model that better aligns with human perceptual judgments. Ultimately, this work provides a systematic study of modern generative image restoration models, offering crucial insights that redefine our understanding of their true state and chart a course for future development.
[189] Tell2Adapt: A Unified Framework for Source Free Unsupervised Domain Adaptation via Vision Foundation Model
Yulong Shi, Shijie Li, Ziyi Li, Lin Qi
Main category: cs.CV
TL;DR: Tell2Adapt is a novel Source-Free Unsupervised Domain Adaptation framework for medical image segmentation that leverages Vision Foundation Models to generate high-quality pseudo-labels and refine predictions across multiple modalities and anatomical targets.
Details
Motivation: Existing SFUDA methods are limited to specific domain shifts and cannot handle multi-modality, multi-target scenarios needed for real-world clinical deployment. There's a need for a unified framework that can generalize across diverse clinical settings.Method: The framework uses Vision Foundation Models with Context-Aware Prompts Regularization (CAPR) to generate canonical instructions from varied text prompts, producing high-quality pseudo-labels. Visual Plausibility Refinement (VPR) leverages VFM’s anatomical knowledge to re-ground predictions in target images’ low-level features, removing noise and false positives.
Result: Extensive evaluation across 10 domain adaptation directions and 22 anatomical targets (brain, cardiac, polyp, abdominal) shows Tell2Adapt consistently outperforms existing approaches, achieving state-of-the-art performance for unified SFUDA in medical image segmentation.
Conclusion: Tell2Adapt provides an effective unified SFUDA framework that leverages Vision Foundation Models to handle multi-modality, multi-target domain adaptation in medical imaging, demonstrating superior performance and clinical reliability.
Abstract: Source Free Unsupervised Domain Adaptation (SFUDA) is critical for deploying deep learning models across diverse clinical settings. However, existing methods are typically designed for low-gap, specific domain shifts and cannot generalize into a unified, multi-modalities, and multi-target framework, which presents a major barrier to real-world application. To overcome this issue, we introduce Tell2Adapt, a novel SFUDA framework that harnesses the vast, generalizable knowledge of the Vision Foundation Model (VFM). Our approach ensures high-fidelity VFM prompts through Context-Aware Prompts Regularization (CAPR), which robustly translates varied text prompts into canonical instructions. This enables the generation of high-quality pseudo-labels for efficiently adapting the lightweight student model to target domain. To guarantee clinical reliability, the framework incorporates Visual Plausibility Refinement (VPR), which leverages the VFM’s anatomical knowledge to re-ground the adapted model’s predictions in target image’s low-level visual features, effectively removing noise and false positives. We conduct one of the most extensive SFUDA evaluations to date, validating our framework across 10 domain adaptation directions and 22 anatomical targets, including brain, cardiac, polyp, and abdominal targets. Our results demonstrate that Tell2Adapt consistently outperforms existing approaches, achieving SOTA for a unified SFUDA framework in medical image segmentation. Code are avaliable at https://github.com/derekshiii/Tell2Adapt.
[190] Generalizable Multiscale Segmentation of Heterogeneous Map Collections
Remi Petitpierre
Main category: cs.CV
TL;DR: A framework for generalizable semantic segmentation of diverse historical maps using procedural data synthesis and multiscale integration, with a new benchmark dataset called Semap.
Details
Motivation: Historical map collections are highly diverse in style, scale, and geographic focus, but most existing map recognition work focuses on specialist models for homogeneous map series. There's a need for generalizable models that can handle the variety of historical map documents.Method: 1) Introduced Semap dataset with 1,439 manually annotated patches reflecting historical map variety; 2) Developed segmentation framework combining procedural data synthesis with multiscale integration to improve robustness and transferability across diverse map types.
Result: Achieved state-of-the-art performance on both HCMSSD and Semap datasets. Segmentation performance remained largely stable across map collections, scales, geographic regions, and publication contexts, demonstrating the viability of diversity-driven approaches.
Conclusion: A diversity-driven approach to map recognition is viable and beneficial. The work opens the way to integrating diverse cartographic archives into historical geographic studies through benchmark datasets and methods for generic segmentation of historical maps.
Abstract: Historical map collections are highly diverse in style, scale, and geographic focus, often consisting of many single-sheet documents. Yet most work in map recognition focuses on specialist models tailored to homogeneous map series. In contrast, this article aims to develop generalizable semantic segmentation models and ontology. First, we introduce Semap, a new open benchmark dataset comprising 1,439 manually annotated patches designed to reflect the variety of historical map documents. Second, we present a segmentation framework that combines procedural data synthesis with multiscale integration to improve robustness and transferability. This framework achieves state-of-the-art performance on both the HCMSSD and Semap datasets, showing that a diversity-driven approach to map recognition is not only viable but also beneficial. The results indicate that segmentation performance remains largely stable across map collections, scales, geographic regions, and publication contexts. By proposing benchmark datasets and methods for the generic segmentation of historical maps, this work opens the way to integrating the long tail of cartographic archives to historical geographic studies.
[191] Exploiting Intermediate Reconstructions in Optical Coherence Tomography for Test-Time Adaption of Medical Image Segmentation
Thomas Pinetz, Veit Hucke, Hrvoje Bogunovic
Main category: cs.CV
TL;DR: IRTTA improves medical image segmentation by adapting downstream networks to intermediate reconstruction representations during test-time, enhancing performance and providing uncertainty estimates without modifying reconstruction or segmentation models.
Details
Motivation: Low-cost medical imaging devices rely on reconstruction algorithms, but current methods only use final reconstructed images, ignoring informative intermediate representations that could improve downstream task performance and provide uncertainty estimates.Method: IRTTA adapts normalization-layer parameters of a frozen downstream network via a modulator network conditioned on reconstruction timescale. The modulator is learned during test-time using averaged entropy loss across all timesteps, leveraging intermediate representations without modifying reconstruction or downstream models.
Result: The approach enhances segmentation performance and enables semantically meaningful uncertainty estimation at no extra computational cost, using variation among timestep-wise segmentations.
Conclusion: IRTTA effectively exploits intermediate reconstruction representations to improve medical image analysis, offering better segmentation and uncertainty quantification without requiring changes to existing reconstruction or analysis pipelines.
Abstract: Primary health care frequently relies on low-cost imaging devices, which are commonly used for screening purposes. To ensure accurate diagnosis, these systems depend on advanced reconstruction algorithms designed to approximate the performance of high-quality counterparts. Such algorithms typically employ iterative reconstruction methods that incorporate domain-specific prior knowledge. However, downstream task performance is generally assessed using only the final reconstructed image, thereby disregarding the informative intermediate representations generated throughout the reconstruction process. In this work, we propose IRTTA to exploit these intermediate representations at test-time by adapting the normalization-layer parameters of a frozen downstream network via a modulator network that conditions on the current reconstruction timescale. The modulator network is learned during test-time using an averaged entropy loss across all individual timesteps. Variation among the timestep-wise segmentations additionally provides uncertainty estimates at no extra cost. This approach enhances segmentation performance and enables semantically meaningful uncertainty estimation, all without modifying either the reconstruction process or the downstream model.
[192] CoIn3D: Revisiting Configuration-Invariant Multi-Camera 3D Object Detection
Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, Gang Hua
Main category: cs.CV
TL;DR: CoIn3D: A generalizable multi-camera 3D object detection framework that addresses cross-configuration transferability by incorporating spatial priors through feature modulation and data augmentation.
Details
Motivation: Multi-camera 3D object detection models struggle to generalize to unseen platforms with different camera configurations due to spatial prior discrepancies in intrinsics, extrinsics, and array layouts.Method: CoIn3D incorporates spatial priors through spatial-aware feature modulation (SFM) that integrates four spatial representations (focal length, ground depth, ground gradient, Plücker coordinate), and camera-aware data augmentation (CDA) using training-free dynamic novel-view image synthesis.
Result: Extensive experiments show CoIn3D achieves strong cross-configuration performance on NuScenes, Waymo, and Lyft datasets under three dominant MC3D paradigms (BEVDepth, BEVFormer, PETR).
Conclusion: CoIn3D effectively addresses cross-configuration generalization in multi-camera 3D object detection by explicitly incorporating spatial priors, enabling strong transferability to unseen camera configurations.
Abstract: Multi-camera 3D object detection (MC3D) has attracted increasing attention with the growing deployment of multi-sensor physical agents, such as robots and autonomous vehicles. However, MC3D models still struggle to generalize to unseen platforms with new multi-camera configurations. Current solutions simply employ a meta-camera for unified representation but lack comprehensive consideration. In this paper, we revisit this issue and identify that the devil lies in spatial prior discrepancies across source and target configurations, including different intrinsics, extrinsics, and array layouts. To address this, we propose CoIn3D, a generalizable MC3D framework that enables strong transferability from source configurations to unseen target ones. CoIn3D explicitly incorporates all identified spatial priors into both feature embedding and image observation through spatial-aware feature modulation (SFM) and camera-aware data augmentation (CDA), respectively. SFM enriches feature space by integrating four spatial representations, such as focal length, ground depth, ground gradient, and Plücker coordinate. CDA improves observation diversity under various configurations via a training-free dynamic novel-view image synthesis scheme. Extensive experiments demonstrate that CoIn3D achieves strong cross-configuration performance on landmark datasets such as NuScenes, Waymo, and Lyft, under three dominant MC3D paradigms represented by BEVDepth, BEVFormer, and PETR.
[193] CLIP-driven Zero-shot Learning with Ambiguous Labels
Jinfu Fan, Jiangnan Li, Xiaowen Yan, Xiaohui Zhong, Wenpeng Lu, Linqing Huang
Main category: cs.CV
TL;DR: CLIP-PZSL: A framework for partial label zero-shot learning that handles label ambiguity using CLIP features and semantic mining with progressive label refinement.
Details
Motivation: Real-world zero-shot learning scenarios often have noisy/ambiguous labels that degrade performance, but most existing methods assume accurate class labels. Need to handle label ambiguity in ZSL.Method: Uses CLIP to extract instance and label features, semantic mining block to fuse features and extract discriminative label embeddings, and partial zero-shot loss that weights candidate labels based on relevance and aligns embeddings. Progressive label refinement identifies ground-truth labels during training.
Result: Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL over existing methods.
Conclusion: The proposed CLIP-PZSL framework effectively handles label ambiguity in zero-shot learning through CLIP-driven feature extraction and progressive label refinement.
Abstract: Zero-shot learning (ZSL) aims to recognize unseen classes by leveraging semantic information from seen classes, but most existing methods assume accurate class labels for training instances. However, in real-world scenarios, noise and ambiguous labels can significantly reduce the performance of ZSL. To address this, we propose a new CLIP-driven partial label zero-shot learning (CLIP-PZSL) framework to handle label ambiguity. First, we use CLIP to extract instance and label features. Then, a semantic mining block fuses these features to extract discriminative label embeddings. We also introduce a partial zero-shot loss, which assigns weights to candidate labels based on their relevance to the instance and aligns instance and label embeddings to minimize semantic mismatch. As the training goes on, the ground-truth labels are progressively identified, and the refined labels and label embeddings in turn help improve the semantic alignment of instance and label features. Comprehensive experiments on several datasets demonstrate the advantage of CLIP-PZSL.
[194] MI-DETR: A Strong Baseline for Moving Infrared Small Target Detection with Bio-Inspired Motion Integration
Nian Liu, Jin Gao, Shubo Lin, Yutong Kou, Sikui Zhang, Fudong Ge, Zhiqiang Pu, Liang Li, Gang Wang, Yizheng Wang, Weiming Hu
Main category: cs.CV
TL;DR: MI-DETR is a bio-inspired dual-pathway detector for infrared small target detection that explicitly models motion using a retina-inspired cellular automaton and parvocellular-magnocellular interconnection, achieving state-of-the-art performance on ISTD benchmarks.
Details
Motivation: Infrared small target detection is challenging due to tiny, low-contrast targets in complex backgrounds. Existing multi-frame approaches often require additional motion supervision or explicit alignment modules, making them complex and inefficient.Method: Proposes Motion Integration DETR (MI-DETR) with three key components: 1) Retina-inspired cellular automaton (RCA) that converts frame sequences into motion maps, 2) Parvocellular-Magnocellular Interconnection (PMI) Block for bidirectional feature interaction between appearance and motion pathways, and 3) RT-DETR decoder for final detection.
Result: Achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 improvement), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating strong performance on three ISTD benchmarks.
Conclusion: The biologically inspired motion-appearance integration approach is simple yet effective, showing that explicit motion modeling without extra supervision or alignment can significantly improve infrared small target detection performance.
Abstract: Infrared small target detection (ISTD) is challenging because tiny, low-contrast targets are easily obscured by complex and dynamic backgrounds. Conventional multi-frame approaches typically learn motion implicitly through deep neural networks, often requiring additional motion supervision or explicit alignment modules. We propose Motion Integration DETR (MI-DETR), a bio-inspired dual-pathway detector that processes one infrared frame per time step while explicitly modeling motion. First, a retina-inspired cellular automaton (RCA) converts raw frame sequences into a motion map defined on the same pixel grid as the appearance image, enabling parvocellular-like appearance and magnocellular-like motion pathways to be supervised by a single set of bounding boxes without extra motion labels or alignment operations. Second, a Parvocellular-Magnocellular Interconnection (PMI) Block facilitates bidirectional feature interaction between the two pathways, providing a biologically motivated intermediate interconnection mechanism. Finally, a RT-DETR decoder operates on features from the two pathways to produce detection results. Surprisingly, our proposed simple yet effective approach yields strong performance on three commonly used ISTD benchmarks. MI-DETR achieves 70.3% mAP@50 and 72.7% F1 on IRDST-H (+26.35 mAP@50 over the best multi-frame baseline), 98.0% mAP@50 on DAUB-R, and 88.3% mAP@50 on ITSDT-15K, demonstrating the effectiveness of biologically inspired motion-appearance integration. Code is available at https://github.com/nliu-25/MI-DETR.
[195] UniM: A Unified Any-to-Any Interleaved Multimodal Benchmark
Yanlin Li, Minghui Guo, Kaiwen Zhang, Shize Zhang, Yiran Zhao, Haodong Li, Congyue Zhou, Weijie Zheng, Yushen Yan, Shengqiong Wu, Wei Ji, Lei Cui, Furu Wei, Hao Fei, Mong-Li Lee, Wynne Hsu
Main category: cs.CV
TL;DR: UniM benchmark for unified any-to-any interleaved multimodal learning with 31K instances across 7 modalities, requiring intertwined reasoning and generation capabilities.
Details
Motivation: Real-world multimodal applications require systems to comprehend arbitrarily combined and interleaved multimodal inputs while generating outputs in any interleaved multimedia form, necessitating a unified paradigm for any-to-any interleaved multimodal learning.Method: Introduces UniM benchmark with 31K high-quality instances across 30 domains and 7 modalities (text, image, audio, video, document, code, 3D), plus UniM Evaluation Suite assessing three dimensions, and UniMA baseline model with traceable reasoning for structured interleaved generation.
Result: Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges for advancing unified any-to-any multimodal intelligence.
Conclusion: UniM provides the first unified benchmark for any-to-any interleaved multimodal learning, establishing evaluation framework and baseline model to advance multimodal large language models.
Abstract: In real-world multimodal applications, systems usually need to comprehend arbitrarily combined and interleaved multimodal inputs from users, while also generating outputs in any interleaved multimedia form. This capability defines the goal of any-to-any interleaved multimodal learning under a unified paradigm of understanding and generation, posing new challenges and opportunities for advancing Multimodal Large Language Models (MLLMs). To foster and benchmark this capability, this paper introduces the UniM benchmark, the first Unified Any-to-Any Interleaved Multimodal dataset. UniM contains 31K high-quality instances across 30 domains and 7 representative modalities: text, image, audio, video, document, code, and 3D, each requiring multiple intertwined reasoning and generation capabilities. We further introduce the UniM Evaluation Suite, which assesses models along three dimensions: Semantic Correctness & Generation Quality, Response Structure Integrity, and Interleaved Coherence. In addition, we propose UniMA, an agentic baseline model equipped with traceable reasoning for structured interleaved generation. Comprehensive experiments demonstrate the difficulty of UniM and highlight key challenges and directions for advancing unified any-to-any multimodal intelligence. The project page is https://any2any-mllm.github.io/unim.
[196] MoRe: Motion-aware Feed-forward 4D Reconstruction Transformer
Juntong Fang, Zequn Chen, Weiqi Zhang, Donglin Di, Xuancheng Zhang, Chengmin Yang, Yu-Shen Liu
Main category: cs.CV
TL;DR: MoRe is a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos using attention-forcing to disentangle motion from static structure.
Details
Motivation: Reconstructing dynamic 4D scenes is challenging due to moving objects corrupting camera pose estimation. Existing methods are computationally expensive and impractical for real-time applications.Method: Built on static reconstruction backbone, uses attention-forcing strategy to disentangle dynamic motion from static structure. Fine-tuned on large-scale diverse datasets. Employs grouped causal attention to capture temporal dependencies and adapt to varying token lengths across frames.
Result: Extensive experiments on multiple benchmarks demonstrate MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.
Conclusion: MoRe provides an efficient feedforward solution for dynamic 4D scene reconstruction from monocular videos, addressing computational limitations of existing methods.
Abstract: Reconstructing dynamic 4D scenes remains challenging due to the presence of moving objects that corrupt camera pose estimation. Existing optimization methods alleviate this issue with additional supervision, but they are mostly computationally expensive and impractical in real-time applications. To address these limitations, we propose MoRe, a feedforward 4D reconstruction network that efficiently recovers dynamic 3D scenes from monocular videos. Built upon a strong static reconstruction backbone, MoRe employs an attention-forcing strategy to disentangle dynamic motion from static structure. To further enhance robustness, we fine-tune the model on large-scale, diverse datasets encompassing both dynamic and static scenes. Moreover, our grouped causal attention captures temporal dependencies and adapts to varying token lengths across frames, ensuring temporally coherent geometry reconstruction. Extensive experiments on multiple benchmarks demonstrate that MoRe achieves high-quality dynamic reconstructions with exceptional efficiency.
[197] Orthogonal Spatial-temporal Distributional Transfer for 4D Generation
Wei Liu, Shengqiong Wu, Bobo Li, Haoyu Zhao, Hao Fei, Mong-Li Lee, Wynne Hsu
Main category: cs.CV
TL;DR: A novel framework for 4D content generation that transfers spatial priors from 3D diffusion models and temporal priors from video diffusion models to overcome dataset limitations, using disentangled spatial-temporal diffusion with Orthogonal Distributional Transfer and ST-HexPlane integration.
Details
Motivation: Current 4D synthesis research is limited by the lack of large-scale 4D datasets, preventing models from learning spatial-temporal features needed for high-quality 4D generation. The paper aims to address this by transferring knowledge from existing 3D and video models.Method: Proposes STD-4D Diffusion model with disentangled spatial and temporal latents. Introduces Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism to model and inject spatiotemporal feature distributions. Designs spatial-temporal-aware HexPlane (ST-HexPlane) to integrate transferred features for improved 4D deformation and Gaussian feature modeling.
Result: Experiments show the method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.
Conclusion: The proposed framework effectively addresses the 4D dataset limitation by transferring priors from 3D and video diffusion models, enabling high-quality 4D content generation through innovative spatial-temporal disentanglement and feature integration techniques.
Abstract: In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.
[198] GEM-TFL: Bridging Weak and Full Supervision for Forgery Localization through EM-Guided Decomposition and Temporal Refinement
Xiaodong Zhu, Yuanming Zheng, Suting Wang, Junqi Yang, Yuhong Yang, Weiping Tu, Zhongyuan Wang
Main category: cs.CV
TL;DR: GEM-TFL is a weakly supervised temporal forgery localization method that uses graph-based EM optimization to identify manipulated segments in videos/audio with only binary video-level labels.
Details
Motivation: Current weakly supervised TFL methods suffer from mismatched training/inference objectives, limited supervision from binary labels, gradient blockage from non-differentiable operations, and lack of inter-proposal relationship modeling.Method: Two-phase classification-regression framework with: 1) EM-based optimization to reform binary labels into multi-dimensional latent attributes, 2) training-free temporal consistency refinement for smoother predictions, and 3) graph-based proposal refinement modeling temporal-semantic relationships.
Result: Extensive experiments show GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.
Conclusion: GEM-TFL effectively addresses key limitations in WS-TFL through its EM-based optimization and graph-based refinement, providing a practical solution for multimedia forensics with reduced labeling costs.
Abstract: Temporal Forgery Localization (TFL) aims to precisely identify manipulated segments within videos or audio streams, providing interpretable evidence for multimedia forensics and security. While most existing TFL methods rely on dense frame-level labels in a fully supervised manner, Weakly Supervised TFL (WS-TFL) reduces labeling cost by learning only from binary video-level labels. However, current WS-TFL approaches suffer from mismatched training and inference objectives, limited supervision from binary labels, gradient blockage caused by non-differentiable top-k aggregation, and the absence of explicit modeling of inter-proposal relationships. To address these issues, we propose GEM-TFL (Graph-based EM-powered Temporal Forgery Localization), a two-phase classification-regression framework that effectively bridges the supervision gap between training and inference. Built upon this foundation, (1) we enhance weak supervision by reformulating binary labels into multi-dimensional latent attributes through an EM-based optimization process; (2) we introduce a training-free temporal consistency refinement that realigns frame-level predictions for smoother temporal dynamics; and (3) we design a graph-based proposal refinement module that models temporal-semantic relationships among proposals for globally consistent confidence estimation. Extensive experiments on benchmark datasets demonstrate that GEM-TFL achieves more accurate and robust temporal forgery localization, substantially narrowing the gap with fully supervised methods.
[199] Diff-ES: Stage-wise Structural Diffusion Pruning via Evolutionary Search
Zongfang Liu, Shengkun Tang, Zongliang Wu, Xin Yuan, Zhiqiang Shen
Main category: cs.CV
TL;DR: Diff-ES is an evolutionary search-based framework for stage-wise structural pruning of diffusion models that optimizes sparsity schedules and enables memory-efficient weight routing without model duplication.
Details
Motivation: Current diffusion models are computationally demanding due to multi-step denoising and large model sizes. Existing pruning methods struggle to balance real acceleration and image quality preservation, often relying on heuristic sparsity schedules and requiring model duplication during inference.Method: Diff-ES divides the diffusion trajectory into multiple stages, uses evolutionary search to automatically discover optimal stage-wise sparsity schedules, and implements memory-efficient weight routing without duplicating model parameters. It integrates with existing structured pruning methods like depth and width pruning.
Result: Extensive experiments on DiT and SDXL show that Diff-ES consistently achieves wall-clock speedups while maintaining minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.
Conclusion: Diff-ES provides an effective framework for efficient diffusion model pruning that automatically optimizes sparsity schedules and enables practical acceleration without sacrificing image quality or requiring excessive memory overhead.
Abstract: Diffusion models have achieved remarkable success in high-fidelity image generation but remain computationally demanding due to their multi-step denoising process and large model sizes. Although prior work improves efficiency either by reducing sampling steps or by compressing model parameters, existing structured pruning approaches still struggle to balance real acceleration and image quality preservation. In particular, prior methods such as MosaicDiff rely on heuristic, manually tuned stage-wise sparsity schedules and stitch multiple independently pruned models during inference, which increases memory overhead. However, the importance of diffusion steps is highly non-uniform and model-dependent. As a result, schedules derived from simple heuristics or empirical observations often fail to generalize and may lead to suboptimal performance. To this end, we introduce \textbf{Diff-ES}, a stage-wise structural \textbf{Diff}usion pruning framework via \textbf{E}volutionary \textbf{S}earch, which optimizes the stage-wise sparsity schedule and executes it through memory-efficient weight routing without model duplication. Diff-ES divides the diffusion trajectory into multiple stages, automatically discovers an optimal stage-wise sparsity schedule via evolutionary search, and activates stage-conditioned weights dynamically without duplicating model parameters. Our framework naturally integrates with existing structured pruning methods for diffusion models including depth and width pruning. Extensive experiments on DiT and SDXL demonstrate that Diff-ES consistently achieves wall-clock speedups while incurring minimal degradation in generation quality, establishing state-of-the-art performance for structured diffusion model pruning.
[200] BLINK: Behavioral Latent Modeling of NK Cell Cytotoxicity
Iman Nematollahi, Jose Francisco Villena-Ossa, Alina Moter, Kiana Farhadyar, Gabriel Kalweit, Abhinav Valada, Toni Cathomen, Evelyn Ullrich, Maria Kalweit
Main category: cs.CV
TL;DR: BLINK is a trajectory-based recurrent state-space model that learns latent interaction dynamics from NK-tumor cell interactions to predict cytotoxic outcomes and enable forecasting, with interpretable behavioral modes.
Details
Motivation: Current methods for studying NK cell cytotoxicity rely on frame-wise classification, which cannot reliably capture the temporal dynamics and interactions that emerge over time to produce cytotoxic outcomes. There's a need for models that can learn from partially observed interaction sequences and predict outcomes based on accumulated dynamics.Method: BLINK uses a trajectory-based recurrent state-space model that learns latent interaction dynamics from partially observed NK-tumor interaction sequences. It predicts apoptosis increments that accumulate into cytotoxic outcomes, creating an interpretable latent representation that organizes NK trajectories into behavioral modes and temporally structured interaction phases.
Result: Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes. The model provides interpretable latent representations that organize NK trajectories into coherent behavioral modes and temporally structured interaction phases.
Conclusion: BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level, offering improved outcome prediction and interpretable representations of cellular interaction dynamics.
Abstract: Machine learning models of cellular interaction dynamics hold promise for understanding cell behavior. Natural killer (NK) cell cytotoxicity is a prominent example of such interaction dynamics and is commonly studied using time-resolved multi-channel fluorescence microscopy. Although tumor cell death events can be annotated at single frames, NK cytotoxic outcome emerges over time from cellular interactions and cannot be reliably inferred from frame-wise classification alone. We introduce BLINK, a trajectory-based recurrent state-space model that serves as a cell world model for NK-tumor interactions. BLINK learns latent interaction dynamics from partially observed NK-tumor interaction sequences and predicts apoptosis increments that accumulate into cytotoxic outcomes. Experiments on long-term time-lapse NK-tumor recordings show improved cytotoxic outcome detection and enable forecasting of future outcomes, together with an interpretable latent representation that organizes NK trajectories into coherent behavioral modes and temporally structured interaction phases. BLINK provides a unified framework for quantitative evaluation and structured modeling of NK cytotoxic behavior at the single-cell level.
[201] UniPAR: A Unified Framework for Pedestrian Attribute Recognition
Minghe Xu, Rouying Wu, Jiarui Xu, Minhao Sun, Zikang Yan, Xiao Wang, ChiaWei Chu, Yu Li
Main category: cs.CV
TL;DR: UniPAR is a unified Transformer framework for Pedestrian Attribute Recognition that handles multiple datasets and modalities (RGB, video, event streams) with a single model, improving cross-domain generalization.
Details
Motivation: Existing PAR research suffers from the "one-model-per-dataset" limitation and struggles with domain discrepancies across modalities, attribute definitions, and environmental scenarios.Method: Proposes UniPAR with unified data scheduling, dynamic classification head, and phased fusion encoder that aligns visual features with textual attribute queries through late deep fusion.
Result: Achieves performance comparable to specialized SOTA methods on MSP60K, DukeMTMC, and EventPAR datasets, with multi-dataset training enhancing cross-domain generalization and robustness in extreme conditions.
Conclusion: UniPAR demonstrates that a unified framework can effectively handle diverse PAR datasets and modalities while improving generalization capabilities beyond specialized models.
Abstract: Pedestrian Attribute Recognition is a foundational computer vision task that provides essential support for downstream applications, including person retrieval in video surveillance and intelligent retail analytics. However, existing research is frequently constrained by the ``one-model-per-dataset” paradigm and struggles to handle significant discrepancies across domains in terms of modalities, attribute definitions, and environmental scenarios. To address these challenges, we propose UniPAR, a unified Transformer-based framework for PAR. By incorporating a unified data scheduling strategy and a dynamic classification head, UniPAR enables a single model to simultaneously process diverse datasets from heterogeneous modalities, including RGB images, video sequences, and event streams. We also introduce an innovative phased fusion encoder that explicitly aligns visual features with textual attribute queries through a late deep fusion strategy. Experimental results on the widely used benchmark datasets, including MSP60K, DukeMTMC, and EventPAR, demonstrate that UniPAR achieves performance comparable to specialized SOTA methods. Furthermore, multi-dataset joint training significantly enhances the model’s cross-domain generalization and recognition robustness in extreme environments characterized by low light and motion blur. The source code of this paper will be released on https://github.com/Event-AHU/OpenPAR
[202] SRasP: Self-Reorientation Adversarial Style Perturbation for Cross-Domain Few-Shot Learning
Wenqian Li, Pengfei Fang, Hui Xue
Main category: cs.CV
TL;DR: SRasP: A novel crop-global style perturbation network for Cross-Domain Few-Shot Learning that stabilizes training and improves generalization to unseen domains through semantic-guided style perturbation and multi-objective optimization.
Details
Motivation: Existing style-based perturbation methods for Cross-Domain Few-Shot Learning suffer from gradient instability and convergence to sharp minima, limiting their effectiveness in transferring knowledge from seen source domains to unseen target domains.Method: Proposes Self-Reorientation Adversarial Style Perturbation (SRasP) which uses global semantic guidance to identify incoherent crops, reorients and aggregates their style gradients with global style gradients, and employs a multi-objective optimization function to maximize visual discrepancy while maintaining semantic consistency.
Result: Extensive experiments on multiple CD-FSL benchmarks demonstrate consistent improvements over state-of-the-art methods, showing better generalization to unseen domains.
Conclusion: SRasP stabilizes perturbations during training, encouraging convergence toward flatter and more transferable solutions, thereby improving model robustness and transferability in cross-domain few-shot learning scenarios.
Abstract: Cross-Domain Few-Shot Learning (CD-FSL) aims to transfer knowledge from a seen source domain to unseen target domains, serving as a key benchmark for evaluating the robustness and transferability of models. Existing style-based perturbation methods mitigate domain shift but often suffer from gradient instability and convergence to sharp minima.To address these limitations, we propose a novel crop-global style perturbation network, termed Self-Reorientation Adversarial \underline{S}tyle \underline{P}erturbation (SRasP). Specifically, SRasP leverages global semantic guidance to identify incoherent crops, followed by reorienting and aggregating the style gradients of these crops with the global style gradients within one image. Furthermore, we propose a novel multi-objective optimization function to maximize visual discrepancy while enforcing semantic consistency among global, crop, and adversarial features. Applying the stabilized perturbations during training encourages convergence toward flatter and more transferable solutions, improving generalization to unseen domains. Extensive experiments are conducted on multiple CD-FSL benchmarks, demonstrating consistent improvements over state-of-the-art methods.
[203] Act, Think or Abstain: Complexity-Aware Adaptive Inference for Vision-Language-Action Models
Riccardo Andrea Izzo, Gianluca Bardaro, Matteo Matteucci
Main category: cs.CV
TL;DR: An adaptive VLA framework that dynamically routes execution based on task complexity using vision-only embeddings for efficient complexity detection.
Details
Motivation: Current VLA models use reasoning techniques that increase computational complexity and inference latency, apply resources indiscriminately, and lack uncertainty estimation for out-of-distribution tasks.Method: Proposes an adaptive framework that transforms VLA’s vision-language backbone into an active detection tool by projecting latent embeddings into ensemble estimators. Dynamically routes execution: Act (known tasks), Think (ambiguous scenarios), Abstain (significant anomalies). Uses vision-only embeddings for complexity detection due to semantic invariance of language.
Result: Vision-only configuration achieves 80% F1-Score using only 5% of training data on LIBERO and LIBERO-PRO benchmarks, establishing reliable and efficient task complexity detection. Validated on real robot.
Conclusion: The adaptive framework provides efficient resource allocation, uncertainty estimation, and prevents catastrophic failure on out-of-distribution tasks while maintaining performance with minimal training data.
Abstract: Current research on Vision-Language-Action (VLA) models predominantly focuses on enhancing generalization through established reasoning techniques. While effective, these improvements invariably increase computational complexity and inference latency. Furthermore, these mechanisms are typically applied indiscriminately, resulting in the inefficient allocation of resources for trivial tasks while simultaneously failing to provide the uncertainty estimation necessary to prevent catastrophic failure on out-of-distribution tasks. Inspired by human cognition, we propose an adaptive framework that dynamically routes VLA execution based on the complexity of the perceived state. Our approach transforms the VLA’s vision-language backbone into an active detection tool by projecting latent embeddings into an ensemble of parametric and non-parametric estimators. This allows the system to execute known tasks immediately (Act), reason about ambiguous scenarios (Think), and preemptively halt execution when encountering significant physical or semantic anomalies (Abstain). In our empirical analysis, we observe a phenomenon where visual embeddings alone are superior for inferring task complexity due to the semantic invariance of language. Evaluated on the LIBERO and LIBERO-PRO benchmarks as well as on a real robot, our vision-only configuration achieves 80% F1-Score using as little as 5% of training data, establishing itself as a reliable and efficient task complexity detector.
[204] SSR-GS: Separating Specular Reflection in Gaussian Splatting for Glossy Surface Reconstruction
Ningjing Fan, Yiqun Wang
Main category: cs.CV
TL;DR: SSR-GS is a 3D Gaussian splatting framework that improves glossy surface reconstruction by modeling both direct and indirect specular reflections with Mip-Cubemap and IndiASG modules, and using visual geometry priors to handle reflection-dominated regions.
Details
Motivation: 3D Gaussian splatting has advanced novel view synthesis but struggles with accurately reconstructing glossy surfaces under complex illumination, especially with strong specular reflections and multi-surface interreflections.Method: Proposes SSR-GS with: 1) Prefiltered Mip-Cubemap for direct specular reflections, 2) IndiASG module for indirect specular reflections, 3) Visual Geometry Priors (VGP) coupling reflection-aware visual prior via reflection score to downweight photometric loss in reflection-dominated regions, and geometry priors from VGGT including progressively decayed depth supervision and transformed normal constraints.
Result: Extensive experiments on synthetic and real-world datasets show SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.
Conclusion: SSR-GS effectively addresses the challenge of glossy surface reconstruction in 3D Gaussian splatting by modeling specular reflections and incorporating visual geometry priors.
Abstract: In recent years, 3D Gaussian splatting (3DGS) has achieved remarkable progress in novel view synthesis. However, accurately reconstructing glossy surfaces under complex illumination remains challenging, particularly in scenes with strong specular reflections and multi-surface interreflections. To address this issue, we propose SSR-GS, a specular reflection modeling framework for glossy surface reconstruction. Specifically, we introduce a prefiltered Mip-Cubemap to model direct specular reflections efficiently, and propose an IndiASG module to capture indirect specular reflections. Furthermore, we design Visual Geometry Priors (VGP) that couple a reflection-aware visual prior via a reflection score (RS) to downweight the photometric loss contribution of reflection-dominated regions, with geometry priors derived from VGGT, including progressively decayed depth supervision and transformed normal constraints. Extensive experiments on both synthetic and real-world datasets demonstrate that SSR-GS achieves state-of-the-art performance in glossy surface reconstruction.
[205] Generic Camera Calibration using Blurry Images
Zezhun Shi
Main category: cs.CV
TL;DR: Proposes a method to simultaneously estimate feature locations and spatially varying point spread functions for camera calibration using geometric constraints and local parametric illumination model, addressing motion blur issues in generic camera calibration.
Details
Motivation: Generic camera calibration requires many images, making motion blur unavoidable for individual users. Conventional deblurring methods don't handle the translational ambiguity in calibration tasks.Method: Uses geometric constraints and local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while resolving translational ambiguity unique to calibration tasks.
Result: Experimental results validate the effectiveness of the approach for camera calibration with motion blur.
Conclusion: The proposed method successfully addresses motion blur in generic camera calibration by simultaneously estimating features and blur kernels using geometric constraints.
Abstract: Camera calibration is the foundation of 3D vision. Generic camera calibration can yield more accurate results than parametric cam era calibration. However, calibrating a generic camera model using printed calibration boards requires far more images than parametric calibration, making motion blur practically unavoidable for individual users. As a f irst attempt to address this problem, we draw on geometric constraints and a local parametric illumination model to simultaneously estimate feature locations and spatially varying point spread functions, while re solving the translational ambiguity that need not be considered in con ventional image deblurring tasks. Experimental results validate the effectiveness of our approach.
[206] Mario: Multimodal Graph Reasoning with Large Language Models
Yuanfu Sun, Kang Li, Pengkang Guo, Jiajin Liu, Qiaoyu Tan
Main category: cs.CV
TL;DR: Mario: A unified LLM-based framework for multimodal graph reasoning that addresses cross-modal consistency and modality preference challenges through graph-conditioned VLM design and modality-adaptive instruction tuning.
Details
Motivation: Existing methods rely on pretrained VLMs that encode image-text pairs in isolation, ignoring the relational structure of real-world multimodal data. This motivates reasoning on multimodal graphs where nodes have textual/visual attributes and edges provide structural cues.Method: Two-stage approach: 1) Graph-conditioned VLM design that jointly refines textual/visual features through fine-grained cross-modal contrastive learning guided by graph topology. 2) Modality-adaptive graph instruction tuning that organizes aligned multimodal features into graph-aware instruction views with a learnable router to select optimal modality configurations for LLM reasoning.
Result: Extensive experiments across diverse MMG benchmarks show Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction.
Conclusion: Mario provides an effective framework for LLM-based reasoning over multimodal graphs by addressing key challenges of cross-modal consistency and heterogeneous modality preference through innovative graph-conditioned VLM design and adaptive instruction tuning.
Abstract: Recent advances in large language models (LLMs) have opened new avenues for multimodal reasoning. Yet, most existing methods still rely on pretrained vision-language models (VLMs) to encode image-text pairs in isolation, ignoring the relational structure that real-world multimodal data naturally form. This motivates reasoning on multimodal graphs (MMGs), where each node has textual and visual attributes and edges provide structural cues. Enabling LLM-based reasoning on such heterogeneous multimodal signals while preserving graph topology introduces two key challenges: resolving weak cross-modal consistency and handling heterogeneous modality preference. To address this, we propose Mario, a unified framework that simultaneously resolves the two above challenges and enables effective LLM-based reasoning over MMGs. Mario consists of two innovative stages. Firstly, a graph-conditioned VLM design that jointly refines textual and visual features through fine-grained cross-modal contrastive learning guided by graph topology. Secondly, a modality-adaptive graph instruction tuning mechanism that organizes aligned multimodal features into graph-aware instruction views and employs a learnable router to surface, for each node and its neighborhood, the most informative modality configuration to the LLM. Extensive experiments across diverse MMG benchmarks demonstrate that Mario consistently outperforms state-of-the-art graph models in both supervised and zero-shot scenarios for node classification and link prediction. The code will be made available at https://github.com/sunyuanfu/Mario.
[207] Logi-PAR: Logic-Infused Patient Activity Recognition via Differentiable Rule
Muhammad Zarar, MingZheng Zhang, Xiaowang Zhang, Zhiyong Feng, Sofonias Yitagesu, Kawsar Farooq
Main category: cs.CV
TL;DR: Logi-PAR: A logic-infused framework for patient activity recognition that learns explicit logic rules from visual cues, providing auditable explanations and counterfactual reasoning beyond standard classification.
Details
Motivation: Current patient activity recognition models only identify what activities are occurring but lack reasoning about why visual cues imply risks. Clinical safety requires methods that can compositionally reason through explicit logic beyond mere classification.Method: Proposes Logi-PAR, a logic-infused framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. It automatically learns rules from visual cues, optimizing them end-to-end while enabling explicit labeling of emergent patterns.
Result: Achieves state-of-the-art performance on clinical benchmarks (VAST and OmniFall), significantly outperforming Vision-Language Models and transformer baselines. Provides auditable why explanations as rule traces and supports counterfactual interventions.
Conclusion: Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings, advancing clinical safety through explicit reasoning and interpretable explanations.
Abstract: Patient Activity Recognition (PAR) in clinical settings uses activity data to improve safety and quality of care. Although significant progress has been made, current models mainly identify which activity is occurring. They often spatially compose sub-sparse visual cues using global and local attention mechanisms, yet only learn logically implicit patterns due to their neural-pipeline. Advancing clinical safety requires methods that can infer why a set of visual cues implies a risk, and how these can be compositionally reasoned through explicit logic beyond mere classification. To address this, we proposed Logi-PAR, the first Logic-Infused Patient Activity Recognition Framework that integrates contextual fact fusion as a multi-view primitive extractor and injects neural-guided differentiable rules. Our method automatically learns rules from visual cues, optimizing them end-to-end while enabling the implicit emergence patterns to be explicitly labelled during training. To the best of our knowledge, Logi-PAR is the first framework to recognize patient activity by applying learnable logic rules to symbolic mappings. It produces auditable why explanations as rule traces and supports counterfactual interventions (e.g., risk would decrease by 65% if assistance were present). Extensive evaluation on clinical benchmarks (VAST and OmniFall) demonstrates state-of-the-art performance, significantly outperforming Vision-Language Models and transformer baselines. The code is available via: https://github.com/zararkhan985/Logi-PAR.git}
[208] Semantic Class Distribution Learning for Debiasing Semi-Supervised Medical Image Segmentation
Yingxue Su, Yiheng Zhong, Keying Zhu, Zimu Zhang, Zhuoru Zhang, Yifang Wang, Yuxin Zhang, Jingxin Liu
Main category: cs.CV
TL;DR: SCDL is a plug-and-play framework for medical image segmentation that addresses class imbalance by learning structured class-conditional feature distributions through class distribution alignment and semantic anchor constraints.
Details
Motivation: Medical image segmentation faces challenges with expensive pixel-level annotation and severe class imbalance, where minority structures are overwhelmed by dominant classes in feature representations, hindering discriminative feature learning.Method: Proposes Semantic Class Distribution Learning (SCDL) framework with Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies, and Semantic Anchor Constraints (SAC) to guide proxies using labeled data.
Result: Experiments on Synapse and AMOS datasets show significant improvements in segmentation performance across overall and class-level metrics, with strong gains on minority classes, achieving state-of-the-art results.
Conclusion: SCDL effectively mitigates supervision and representation biases in imbalanced medical image segmentation through structured class distribution learning.
Abstract: Medical image segmentation is critical for computer-aided diagnosis. However, dense pixel-level annotation is time-consuming and expensive, and medical datasets often exhibit severe class imbalance. Such imbalance causes minority structures to be overwhelmed by dominant classes in feature representations, hindering the learning of discriminative features and making reliable segmentation particularly challenging. To address this, we propose the Semantic Class Distribution Learning (SCDL) framework, a plug-and-play module that mitigates supervision and representation biases by learning structured class-conditional feature distributions. SCDL integrates Class Distribution Bidirectional Alignment (CDBA) to align embeddings with learnable class proxies and leverages Semantic Anchor Constraints (SAC) to guide proxies using labeled data. Experiments on the Synapse and AMOS datasets demonstrate that SCDL significantly improves segmentation performance across both overall and class-level metrics, with particularly strong gains on minority classes, achieving state-of-the-art results. Our code is released at https://github.com/Zyh55555/SCDL.
[209] SPyCer: Semi-Supervised Physics-Guided Contextual Attention for Near-Surface Air Temperature Estimation from Satellite Imagery
Sofiane Bouaziz, Adel Hafiane, Raphael Canals, Rachid Nedjai
Main category: cs.CV
TL;DR: SPyCer is a semi-supervised physics-guided network that uses satellite imagery to estimate near-surface air temperature (NSAT) by combining pixel information with physical modeling through surface energy balance and advection-diffusion-reaction equations.
Details
Motivation: Satellites capture surface properties well but many important atmospheric phenomena occur near the ground. Near-ground sensors provide accurate NSAT measurements but are sparse and unevenly distributed, limiting continuous spatial coverage. There's a need to bridge this gap using satellite imagery as a proxy for continuous NSAT estimation.Method: Frames NSAT prediction as pixel-wise vision problem where each sensor is projected onto satellite image coordinates. Uses semi-supervised approach: sensor pixels supervised with observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization from surface energy balance and advection-diffusion-reaction PDEs. Employs multi-head attention guided by land cover characteristics with Gaussian distance weighting to capture physical influence of neighboring pixels.
Result: Experiments on real-world datasets show SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in accuracy, generalization, and alignment with underlying physical processes.
Conclusion: SPyCer successfully bridges the gap between sparse ground sensors and continuous spatial coverage by combining satellite imagery with physics-guided learning, providing physically consistent NSAT estimation that generalizes well.
Abstract: Modern Earth observation relies on satellites to capture detailed surface properties. Yet, many phenomena that affect humans and ecosystems unfold in the atmosphere close to the surface. Near-ground sensors provide accurate measurements of certain environmental characteristics, such as near-surface air temperature (NSAT). However, they remain sparse and unevenly distributed, limiting their ability to provide continuous spatial measurements. To bridge this gap, we introduce SPyCer, a semi-supervised physics-guided network that can leverage pixel information and physical modeling to guide the learning process through meaningful physical properties. It is designed for continuous estimation of NSAT by proxy using satellite imagery. SPyCer frames NSAT prediction as a pixel-wise vision problem, where each near-ground sensor is projected onto satellite image coordinates and positioned at the center of a local image patch. The corresponding sensor pixel is supervised using both observed NSAT and physics-based constraints, while surrounding pixels contribute through physics-guided regularization derived from the surface energy balance and advection-diffusion-reaction partial differential equations. To capture the physical influence of neighboring pixels, SPyCer employs a multi-head attention guided by land cover characteristics and modulated with Gaussian distance weighting. Experiments on real-world datasets demonstrate that SPyCer produces spatially coherent and physically consistent NSAT estimates, outperforming existing baselines in terms of accuracy, generalization, and alignment with underlying physical processes.
[210] Digital Twin Driven Textile Classification and Foreign Object Recognition in Automated Sorting Systems
Serkan Ergun, Tobias Mitterer, Hubert Zangl
Main category: cs.CV
TL;DR: A digital twin-driven robotic system for textile sorting using multimodal perception, grasp prediction, and Visual Language Models (VLMs) for garment classification and foreign object detection in cluttered environments.
Details
Motivation: Addressing the need for sustainable textile recycling automation that can handle deformable garments and detect foreign objects in cluttered industrial environments, requiring robust multimodal perception and reasoning capabilities.Method: Developed a dual-arm robotic system with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning. Integrated digital twin technology with MoveIt for path planning. Used nine VLMs from five model families for garment classification on a dataset of 223 inspection scenarios, evaluating accuracy, hallucination behavior, and computational performance.
Result: Qwen model family achieved highest overall accuracy (87.9%) with strong foreign object detection. Lighter models like Gemma3 offered competitive speed-accuracy trade-offs for edge deployment. Digital twin integration improved manipulation reliability through segmented 3D point cloud integration.
Conclusion: The system demonstrates feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.
Abstract: The increasing demand for sustainable textile recycling requires robust automation solutions capable of handling deformable garments and detecting foreign objects in cluttered environments. This work presents a digital twin driven robotic sorting system that integrates grasp prediction, multi modal perception, and semantic reasoning for real world textile classification. A dual arm robotic cell equipped with RGBD sensing, capacitive tactile feedback, and collision-aware motion planning autonomously separates garments from an unsorted basket, transfers them to an inspection zone, and classifies them using state of the art Visual Language Models (VLMs). We benchmark nine VLM s from five model families on a dataset of 223 inspection scenarios comprising shirts, socks, trousers, underwear, foreign objects (including garments outside of the aforementioned classes), and empty scenes. The evaluation assesses per class accuracy, hallucination behavior, and computational performance under practical hardware constraints. Results show that the Qwen model family achieves the highest overall accuracy (up to 87.9 %), with strong foreign object detection performance, while lighter models such as Gemma3 offer competitive speed accuracy trade offs for edge deployment. A digital twin combined with MoveIt enables collision aware path planning and integrates segmented 3D point clouds of inspected garments into the virtual environment for improved manipulation reliability. The presented system demonstrates the feasibility of combining semantic VLM reasoning with conventional grasp detection and digital twin technology for scalable, autonomous textile sorting in realistic industrial settings.
[211] CATNet: Collaborative Alignment and Transformation Network for Cooperative Perception
Gong Chen, Chaokun Zhang, Tao Tang, Pengcheng Lv, Feng Li, Xin Xie
Main category: cs.CV
TL;DR: CATNet is a framework for multi-agent cooperative perception that addresses temporal latency and noise interference through spatio-temporal synchronization, wavelet-based denoising, and adaptive feature selection.
Details
Motivation: Existing cooperative perception research overlooks critical real-world challenges like high temporal latency and multi-source noise in multi-agent systems, which degrade performance in practical applications.Method: Three key components: 1) Spatio-Temporal Recurrent Synchronization (STSync) aligns asynchronous feature streams using adjacent-frame differential modeling; 2) Dual-Branch Wavelet Enhanced Denoiser (WTDen) suppresses global noise and reconstructs localized feature distortions; 3) Adaptive Feature Selector (AdpSel) dynamically focuses on critical perceptual features for robust fusion.
Result: Extensive experiments on multiple datasets show CATNet consistently outperforms existing methods under complex traffic conditions, demonstrating superior robustness and adaptability.
Conclusion: CATNet effectively addresses practical limitations in real-world multi-agent cooperative perception by solving temporal latency and noise interference problems, making it suitable for complex real-world applications.
Abstract: Cooperative perception significantly enhances scene understanding by integrating complementary information from diverse agents. However, existing research often overlooks critical challenges inherent in real-world multi-source data integration, specifically high temporal latency and multi-source noise. To address these practical limitations, we propose Collaborative Alignment and Transformation Network (CATNet), an adaptive compensation framework that resolves temporal latency and noise interference in multi-agent systems. Our key innovations can be summarized in three aspects. First, we introduce a Spatio-Temporal Recurrent Synchronization (STSync) that aligns asynchronous feature streams via adjacent-frame differential modeling, establishing a temporal-spatially unified representation space. Second, we design a Dual-Branch Wavelet Enhanced Denoiser (WTDen) that suppresses global noise and reconstructs localized feature distortions within aligned representations. Third, we construct an Adaptive Feature Selector (AdpSel) that dynamically focuses on critical perceptual features for robust fusion. Extensive experiments on multiple datasets demonstrate that CATNet consistently outperforms existing methods under complex traffic conditions, proving its superior robustness and adaptability.
[212] Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
Shan Ning, Longtian Qiu, Xuming He
Main category: cs.CV
TL;DR: Wiki-R1: A curriculum reinforcement learning framework that improves multimodal LLMs for knowledge-based VQA by generating progressively difficult training data and selecting informative samples.
Details
Motivation: KB-VQA requires integrating external knowledge with visual understanding, but faces challenges from noisy retrieval and distribution gaps between pretrained MLLMs and structured knowledge bases, making reasoning and domain adaptation difficult.Method: Proposes Wiki-R1 with controllable curriculum data generation (manipulating retriever for desired difficulty levels) and curriculum sampling strategy (selecting informative samples likely to yield non-zero advantages in RL). Uses observed rewards to estimate sample difficulty and propagate to unobserved samples.
Result: Achieves new SOTA on two KB-VQA benchmarks: improves accuracy from 35.5% to 37.1% on Encyclopedic VQA and from 40.1% to 44.1% on InfoSeek.
Conclusion: Wiki-R1 effectively bridges the distribution gap between pretrained MLLMs and KB-VQA through curriculum reinforcement learning, demonstrating significant improvements in knowledge-based visual question answering.
Abstract: Knowledge-Based Visual Question Answering (KB-VQA) requires models to answer questions about an image by integrating external knowledge, posing significant challenges due to noisy retrieval and the structured, encyclopedic nature of the knowledge base. These characteristics create a distributional gap from pretrained multimodal large language models (MLLMs), making effective reasoning and domain adaptation difficult in the post-training stage. In this work, we propose \textit{Wiki-R1}, a data-generation-based curriculum reinforcement learning framework that systematically incentivizes reasoning in MLLMs for KB-VQA. Wiki-R1 constructs a sequence of training distributions aligned with the model’s evolving capability, bridging the gap from pretraining to the KB-VQA target distribution. We introduce \textit{controllable curriculum data generation}, which manipulates the retriever to produce samples at desired difficulty levels, and a \textit{curriculum sampling strategy} that selects informative samples likely to yield non-zero advantages during RL updates. Sample difficulty is estimated using observed rewards and propagated to unobserved samples to guide learning. Experiments on two KB-VQA benchmarks, Encyclopedic VQA and InfoSeek, demonstrate that Wiki-R1 achieves new state-of-the-art results, improving accuracy from 35.5% to 37.1% on Encyclopedic VQA and from 40.1% to 44.1% on InfoSeek. The project page is available at https://artanic30.github.io/project_pages/WikiR1/.
[213] Layer by layer, module by module: Choose both for optimal OOD probing of ViT
Ambroise Odonnat, Vasilii Feofanov, Laetitia Chapel, Romain Tavenard, Ievgen Redko
Main category: cs.CV
TL;DR: Intermediate layers in vision transformers often outperform final layers due to distribution shifts between pretraining and downstream data, with optimal probing locations varying based on shift severity.
Details
Motivation: To understand why intermediate layers of foundation models often yield more discriminative representations than final layers, and to analyze this phenomenon specifically in pretrained vision transformers.Method: Conducted comprehensive linear probing experiments across diverse image classification benchmarks, performed fine-grained module-level analysis of transformer blocks, and examined different probing locations within the architecture.
Result: Found that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Optimal probing location depends on shift severity: feedforward network activations work best under significant distribution shift, while normalized multi-head self-attention outputs are optimal with weak shift.
Conclusion: Standard probing of transformer block outputs is suboptimal; understanding distribution shift effects and selecting appropriate probing locations within transformer modules can significantly improve representation quality for downstream tasks.
Abstract: Recent studies have observed that intermediate layers of foundation models often yield more discriminative representations than the final layer. While initially attributed to autoregressive pretraining, this phenomenon has also been identified in models trained via supervised and discriminative self-supervised objectives. In this paper, we conduct a comprehensive study to analyze the behavior of intermediate layers in pretrained vision transformers. Through extensive linear probing experiments across a diverse set of image classification benchmarks, we find that distribution shift between pretraining and downstream data is the primary cause of performance degradation in deeper layers. Furthermore, we perform a fine-grained analysis at the module level. Our findings reveal that standard probing of transformer block outputs is suboptimal; instead, probing the activation within the feedforward network yields the best performance under significant distribution shift, whereas the normalized output of the multi-head self-attention module is optimal when the shift is weak.
[214] Fusion4CA: Boosting 3D Object Detection via Comprehensive Image Exploitation
Kang Luo, Xin Chen, Yangyi Xiao, Hesheng Wang
Main category: cs.CV
TL;DR: Fusion4CA enhances 3D object detection by better exploiting RGB information in LiDAR-RGB fusion, using contrastive alignment and camera auxiliary branches to improve multimodal understanding.
Details
Motivation: Existing LiDAR-RGB fusion methods for 3D object detection in autonomous driving over-rely on LiDAR data and insufficiently explore RGB information, limiting their performance and generalization capabilities.Method: Built on BEVFusion framework with plug-and-play components: 1) contrastive alignment module to calibrate image features with 3D geometry, 2) camera auxiliary branch to mine RGB information during training, 3) off-the-shelf cognitive adapter to leverage pretrained image weights, and 4) coordinate attention module in fusion stage.
Result: Achieves 69.7% mAP on nuScenes dataset with only 6 training epochs (vs 20 epochs for baseline), with 1.2% improvement over baseline and only 3.48% increase in inference parameters. Validated in simulated lunar environment showing good generalization.
Conclusion: Fusion4CA effectively addresses RGB underutilization in multimodal 3D detection, achieving superior performance with minimal parameter overhead and demonstrating strong generalization to different environments.
Abstract: Nowadays, an increasing number of works fuse LiDAR and RGB data in the bird’s-eye view (BEV) space for 3D object detection in autonomous driving systems. However, existing methods suffer from over-reliance on the LiDAR branch, with insufficient exploration of RGB information. To tackle this issue, we propose Fusion4CA, which is built upon the classic BEVFusion framework and dedicated to fully exploiting visual input with plug-and-play components. Specifically, a contrastive alignment module is designed to calibrate image features with 3D geometry, and a camera auxiliary branch is introduced to mine RGB information sufficiently during training. For further performance enhancement, we leverage an off-the-shelf cognitive adapter to make the most of pretrained image weights, and integrate a standard coordinate attention module into the fusion stage as a supplementary boost. Experiments on the nuScenes dataset demonstrate that our method achieves 69.7% mAP with only 6 training epochs and a mere 3.48% increase in inference parameters, yielding a 1.2% improvement over the baseline which is fully trained for 20 epochs. Extensive experiments in a simulated lunar environment further validate the effectiveness and generalization of our method. Our code will be released through Fusion4CA.
[215] Frequency-Aware Error-Bounded Caching for Accelerating Diffusion Transformers
Guandong Li
Main category: cs.CV
TL;DR: SpectralCache: A unified caching framework for Diffusion Transformers that accelerates inference by exploiting temporal, depth, and feature non-uniformities in the denoising process.
Details
Motivation: Diffusion Transformers (DiTs) are computationally expensive during inference due to iterative denoising. Existing caching methods treat the denoising process as uniform across time, depth, and features, failing to capture the actual non-uniformities in the process.Method: Proposes SpectralCache with three components: 1) Timestep-Aware Dynamic Scheduling (TADS) for temporal non-uniformity, 2) Cumulative Error Budgets (CEB) for depth-wise error propagation, and 3) Frequency-Decomposed Caching (FDC) for feature heterogeneity. The approach is training-free and plug-and-play.
Result: Achieves 2.46x speedup on FLUX.1-schnell at 512x512 resolution with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x speedup) by 16% while maintaining comparable quality (LPIPS difference < 1%).
Conclusion: SpectralCache effectively exploits the non-uniform nature of DiT denoising across temporal, depth, and feature dimensions to achieve significant inference acceleration while maintaining generation quality.
Abstract: Diffusion Transformers (DiTs) have emerged as the dominant architecture for high-quality image and video generation, yet their iterative denoising process incurs substantial computational cost during inference. Existing caching methods accelerate DiTs by reusing intermediate computations across timesteps, but they share a common limitation: treating the denoising process as uniform across time,depth, and feature dimensions. In this work, we identify three orthogonal axes of non-uniformity in DiT denoising: (1) temporal – sensitivity to caching errors varies dramatically across the denoising trajectory; (2) depth – consecutive caching decisions lead to cascading approximation errors; and (3) feature – different components of the hidden state exhibit heterogeneous temporal dynamics. Based on these observations, we propose SpectralCache, a unified caching framework comprising Timestep-Aware Dynamic Scheduling (TADS), Cumulative Error Budgets (CEB), and Frequency-Decomposed Caching (FDC). On FLUX.1-schnell at 512x512 resolution, SpectralCache achieves 2.46x speedup with LPIPS 0.217 and SSIM 0.727, outperforming TeaCache (2.12x, LPIPS 0.215, SSIM 0.734) by 16% in speed while maintaining comparable quality (LPIPS difference < 1%). Our approach is training-free, plug-and-play, and compatible with existing DiT architectures.
[216] Dark3R: Learning Structure from Motion in the Dark
Andrew Y Guo, Anagh Malik, SaiKiran Tedla, Yutong Dai, Yiqian Qin, Zach Salehe, Benjamin Attal, Sotiris Nousias, Kyros Kutulakos, David B. Lindell
Main category: cs.CV
TL;DR: Dark3R is a framework for structure from motion in extreme low-light conditions using raw images with SNR below -4dB, leveraging 3D foundation models via teacher-student distillation and achieving SOTA results in both 3D reconstruction and novel view synthesis.
Details
Motivation: Conventional feature- and learning-based methods for structure from motion break down in extreme low-light conditions with signal-to-noise ratios below -4dB, creating a need for robust 3D reconstruction capabilities in dark environments.Method: Dark3R adapts large-scale 3D foundation models to extreme low-light conditions through teacher-student distillation, trained on noisy-clean raw image pairs without 3D supervision. It uses a Poisson-Gaussian noise model for synthetic training data and introduces a new exposure-bracketed dataset with ~42,000 multi-view raw images.
Result: Dark3R achieves state-of-the-art structure from motion in the low-SNR regime and demonstrates SOTA novel view synthesis in the dark using predicted poses with a coarse-to-fine radiance field optimization procedure.
Conclusion: Dark3R successfully enables robust 3D reconstruction and novel view synthesis in extreme low-light conditions by adapting 3D foundation models through distillation, requiring only noisy-clean image pairs for training.
Abstract: We introduce Dark3R, a framework for structure from motion in the dark that operates directly on raw images with signal-to-noise ratios (SNRs) below $-4$ dB – a regime where conventional feature- and learning-based methods break down. Our key insight is to adapt large-scale 3D foundation models to extreme low-light conditions through a teacher–student distillation process, enabling robust feature matching and camera pose estimation in low light. Dark3R requires no 3D supervision; it is trained solely on noisy–clean raw image pairs, which can be either captured directly or synthesized using a simple Poisson–Gaussian noise model applied to well-exposed raw images. To train and evaluate our approach, we introduce a new, exposure-bracketed dataset that includes $\sim$42,000 multi-view raw images with ground-truth 3D annotations, and we demonstrate that Dark3R achieves state-of-the-art structure from motion in the low-SNR regime. Further, we demonstrate state-of-the-art novel view synthesis in the dark using Dark3R’s predicted poses and a coarse-to-fine radiance field optimization procedure.
[217] ORMOT: A Dataset and Framework for Omnidirectional Referring Multi-Object Tracking
Sijia Chen, Zihan Zhou, Yanqiu Yu, En Yu, Wenbing Tao
Main category: cs.CV
TL;DR: Proposes ORMOT - Omnidirectional Referring Multi-Object Tracking that extends RMOT to 360° imagery to overcome field-of-view limitations, with new dataset ORSet and LVLM-based framework ORTrack.
Details
Motivation: Existing Referring Multi-Object Tracking (RMOT) methods are limited by conventional cameras' narrow field of view, causing targets to move out of frame and leading to fragmented tracking. There's a need to overcome these limitations for better understanding of long-horizon language descriptions.Method: 1) Proposes ORMOT task extending RMOT to omnidirectional imagery; 2) Constructs ORSet dataset with 27 omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects; 3) Develops ORTrack, a Large Vision-Language Model-driven framework specifically designed for omnidirectional referring multi-object tracking.
Result: Extensive experiments on ORSet dataset demonstrate the effectiveness of the ORTrack framework. The dataset and code will be open-sourced for community use.
Conclusion: ORMOT addresses field-of-view limitations in traditional RMOT by leveraging omnidirectional imagery, enabling better tracking of objects described by language queries in 360° scenes through the proposed ORTrack framework and ORSet dataset.
Abstract: Multi-Object Tracking (MOT) is a fundamental task in computer vision, aiming to track targets across video frames. Existing MOT methods perform well in general visual scenes, but face significant challenges and limitations when extended to visual-language settings. To bridge this gap, the task of Referring Multi-Object Tracking (RMOT) has recently been proposed, which aims to track objects that correspond to language descriptions. However, current RMOT methods are primarily developed on datasets captured by conventional cameras, which suffer from limited field of view. This constraint often causes targets to move out of the frame, leading to fragmented tracking and loss of contextual information. In this work, we propose a novel task, called Omnidirectional Referring Multi-Object Tracking (ORMOT), which extends RMOT to omnidirectional imagery, aiming to overcome the field-of-view (FoV) limitation of conventional datasets and improve the model’s ability to understand long-horizon language descriptions. To advance the ORMOT task, we construct ORSet, an Omnidirectional Referring Multi-Object Tracking dataset, which contains 27 diverse omnidirectional scenes, 848 language descriptions, and 3,401 annotated objects, providing rich visual, temporal, and language information. Furthermore, we propose ORTrack, a Large Vision-Language Model (LVLM)-driven framework tailored for Omnidirectional Referring Multi-Object Tracking. Extensive experiments on the ORSet dataset demonstrate the effectiveness of our ORTrack framework. The dataset and code will be open-sourced at https://github.com/chen-si-jia/ORMOT.
[218] Fusion-CAM: Integrating Gradient and Region-Based Class Activation Maps for Robust Visual Explanations
Hajar Dekdegue, Moncef Garouani, Josiane Mothe, Jordan Bernigaud
Main category: cs.CV
TL;DR: Fusion-CAM: A novel XAI framework that combines gradient-based and region-based CAM methods through adaptive fusion to produce more robust and discriminative visual explanations for deep neural networks.
Details
Motivation: Existing CAM methods have limitations - gradient-based approaches (like Grad-CAM) provide fine-grained details but are noisy and incomplete, while region-based approaches (like Score-CAM) capture broader coverage but suffer from over-smoothing and reduced sensitivity to subtle features. There's a need to bridge this explanatory gap.Method: Fusion-CAM unifies both paradigms through a dedicated fusion mechanism: 1) denoises gradient-based maps for cleaner activations, 2) combines refined gradient maps with region-based maps using contribution weights to enhance class coverage, and 3) uses adaptive similarity-based pixel-level fusion that evaluates agreement between paradigms and dynamically adjusts fusion strength.
Result: Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation.
Conclusion: Fusion-CAM provides a robust and flexible tool for interpreting deep neural networks by producing richer, context-aware, and input-adaptive visual explanations that combine the strengths of both gradient-based and region-based approaches.
Abstract: Interpreting the decision-making process of deep convolutional neural networks remains a central challenge in achieving trustworthy and transparent artificial intelligence. Explainable AI (XAI) techniques, particularly Class Activation Map (CAM) methods, are widely adopted to visualize the input regions influencing model predictions. Gradient-based approaches (e.g. Grad-CAM) provide highly discriminative, fine-grained details by computing gradients of class activations but often yield noisy and incomplete maps that emphasize only the most salient regions rather than the complete objects. Region-based approaches (e.g. Score-CAM) aggregate information over larger areas, capturing broader object coverage at the cost of over-smoothing and reduced sensitivity to subtle features. We introduce Fusion-CAM, a novel framework that bridges this explanatory gap by unifying both paradigms through a dedicated fusion mechanism to produce robust and highly discriminative visual explanations. Our method first denoises gradient-based maps, yielding cleaner and more focused activations. It then combines the refined gradient map with region-based maps using contribution weights to enhance class coverage. Finally, we propose an adaptive similarity-based pixel-level fusion that evaluates the agreement between both paradigms and dynamically adjusts the fusion strength. This adaptive mechanism reinforces consistent activations while softly blending conflicting regions, resulting in richer, context-aware, and input-adaptive visual explanations. Extensive experiments on standard benchmarks show that Fusion-CAM consistently outperforms existing CAM variants in both qualitative visualization and quantitative evaluation, providing a robust and flexible tool for interpreting deep neural networks.
[219] Video-based Locomotion Analysis for Fish Health Monitoring
Timon Palm, Clemens Seibold, Anna Hilsmann, Peter Eisert
Main category: cs.CV
TL;DR: A computer vision system using YOLOv11-based multi-object tracking to estimate fish locomotion activities from videos for health monitoring in aquaculture.
Details
Motivation: Monitoring fish health is crucial for early disease detection, animal welfare, and sustainable aquaculture. Fish locomotion activities can indicate physiological and pathological conditions, making automated analysis valuable.Method: Uses YOLOv11 detector embedded in a tracking-by-detection framework for multi-object tracking. Investigates various YOLOv11 architecture configurations and extensions incorporating multiple frames to improve detection accuracy.
Result: System evaluated on manually annotated dataset of Sulawesi ricefish in home-aquarium setup, demonstrating reliable measurement of swimming direction and speed for health monitoring. Dataset will be made publicly available.
Conclusion: The proposed system effectively estimates fish locomotion activities from videos, providing a tool for automated fish health monitoring in aquaculture settings.
Abstract: Monitoring the health conditions of fish is essential, as it enables the early detection of disease, safeguards animal welfare, and contributes to sustainable aquaculture practices. Physiological and pathological conditions of cultivated fish can be inferred by analyzing locomotion activities. In this paper, we present a system that estimates the locomotion activities from videos using multi object tracking. The core of our approach is a YOLOv11 detector embedded in a tracking-by-detection framework. We investigate various configurations of the YOLOv11-architecture as well as extensions that incorporate multiple frames to improve detection accuracy. Our system is evaluated on a manually annotated dataset of Sulawesi ricefish recorded in a home-aquarium-like setup, demonstrating its ability to reliably measure swimming direction and speed for fish health monitoring. The dataset will be made publicly available upon publication.
[220] MobileFetalCLIP: Selective Repulsive Knowledge Distillation for Mobile Fetal Ultrasound Analysis
Numan Saeed, Fadillah Adamsyah Maani, Mohammad Yaqub
Main category: cs.CV
TL;DR: Selective Repulsive Knowledge Distillation enables efficient fetal ultrasound AI by distilling large 304M-parameter models into compact 11.4M-parameter versions that outperform teachers on mobile devices.
Details
Motivation: Current fetal ultrasound foundation models are too large (300M+ parameters) for deployment on point-of-care devices in low-resource settings, and standard knowledge distillation fails under extreme capacity gaps (~26x).Method: Selective Repulsive Knowledge Distillation decomposes contrastive KD into diagonal and off-diagonal components, preserving matched pair alignment while decaying off-diagonal weights into negative values to repel students from teacher’s inter-class confusions.
Result: The 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), running at 1.6 ms on iPhone 16 Pro.
Conclusion: The method enables real-time assistive AI on handheld ultrasound devices for prenatal care in low-resource settings through efficient model compression.
Abstract: Fetal ultrasound AI could transform prenatal care in low-resource settings, yet current foundation models exceed 300M visual parameters, precluding deployment on point-of-care devices. Standard knowledge distillation fails under such extreme capacity gaps (~26x), as compact students waste capacity mimicking architectural artifacts of oversized teachers. We introduce Selective Repulsive Knowledge Distillation, which decomposes contrastive KD into diagonal and off-diagonal components: matched pair alignment is preserved while the off-diagonal weight decays into negative values, repelling the student from the teacher’s inter-class confusions and forcing discovery of architecturally native features. Our 11.4M parameter student surpasses the 304M-parameter FetalCLIP teacher on zero-shot HC18 biometry validity (88.6% vs. 83.5%) and brain sub-plane F1 (0.784 vs. 0.702), while running at 1.6 ms on iPhone 16 Pro, enabling real-time assistive AI on handheld ultrasound devices. Our code, models, and app are publicly available at https://github.com/numanai/MobileFetalCLIP.
[221] RelaxFlow: Text-Driven Amodal 3D Generation
Jiayin Zhu, Guoji Fu, Xiaolu Liu, Qiyuan He, Yicong Li, Angela Yao
Main category: cs.CV
TL;DR: RelaxFlow is a training-free dual-branch framework for text-driven amodal 3D generation that decouples control granularity to handle occluded regions while preserving observed parts.
Details
Motivation: Image-to-3D generation faces semantic ambiguity under occlusion where partial observation alone is insufficient to determine object category. There's a need for text-driven amodal 3D generation where text prompts steer completion of unseen regions while strictly preserving input observation.Method: Proposes RelaxFlow, a training-free dual-branch framework with Multi-Prior Consensus Module and Relaxation Mechanism. It decouples control granularity: rigid control for observation vs relaxed structural control for text prompts. Theoretically, relaxation applies a low-pass filter on generative vector field to suppress high-frequency instance details and isolate geometric structure.
Result: Extensive experiments demonstrate RelaxFlow successfully steers generation of unseen regions to match prompt intent without compromising visual fidelity. Introduces two diagnostic benchmarks: ExtremeOcc-3D and AmbiSem-3D for evaluation.
Conclusion: RelaxFlow effectively addresses semantic ambiguity in occluded 3D generation by separating control granularities, enabling text-driven completion of unseen regions while preserving observed parts.
Abstract: Image-to-3D generation faces inherent semantic ambiguity under occlusion, where partial observation alone is often insufficient to determine object category. In this work, we formalize text-driven amodal 3D generation, where text prompts steer the completion of unseen regions while strictly preserving input observation. Crucially, we identify that these objectives demand distinct control granularities: rigid control for the observation versus relaxed structural control for the prompt. To this end, we propose RelaxFlow, a training-free dual-branch framework that decouples control granularity via a Multi-Prior Consensus Module and a Relaxation Mechanism. Theoretically, we prove that our relaxation is equivalent to applying a low-pass filter on the generative vector field, which suppresses high-frequency instance details to isolate geometric structure that accommodates the observation. To facilitate evaluation, we introduce two diagnostic benchmarks, ExtremeOcc-3D and AmbiSem-3D. Extensive experiments demonstrate that RelaxFlow successfully steers the generation of unseen regions to match the prompt intent without compromising visual fidelity.
[222] SAIL: Similarity-Aware Guidance and Inter-Caption Augmentation-based Learning for Weakly-Supervised Dense Video Captioning
Ye-Chan Kim, SeungJu Cha, Si-Woo Kim, Minju Jeon, Hyungee Kim, Dong-Jin Kim
Main category: cs.CV
TL;DR: SAIL introduces semantically-aware masks via cross-modal alignment and LLM-based caption augmentation for weakly-supervised dense video captioning, achieving SOTA on ActivityNet and YouCook2.
Details
Motivation: Existing weakly-supervised dense video captioning methods generate simplistic, uniformly distributed masks without considering semantic relationships to events, and suffer from sparse caption annotations in datasets.Method: Proposes SAIL with similarity-aware training to guide masks toward video regions with high similarity to event captions, plus LLM-based augmentation to generate synthetic captions for additional alignment signals via inter-mask mechanism.
Result: State-of-the-art performance on both captioning and localization metrics on ActivityNet Captions and YouCook2 datasets.
Conclusion: Semantically-aware masks through cross-modal alignment and synthetic caption augmentation significantly improve weakly-supervised dense video captioning performance.
Abstract: Weakly-Supervised Dense Video Captioning aims to localize and describe events in videos trained only on caption annotations, without temporal boundaries. Prior work introduced an implicit supervision paradigm based on Gaussian masking and complementary captioning. However, existing method focuses merely on generating non-overlapping masks without considering their semantic relationship to corresponding events, resulting in simplistic, uniformly distributed masks that fail to capture semantically meaningful regions. Moreover, relying solely on ground-truth captions leads to sub-optimal performance due to the inherent sparsity of existing datasets. In this work, we propose SAIL, which constructs semantically-aware masks through cross-modal alignment. Our similarity aware training objective guides masks to emphasize video regions with high similarity to their corresponding event captions. Furthermore, to guide more accurate mask generation under sparse annotation settings, we introduce an LLM-based augmentation strategy that generates synthetic captions to provide additional alignment signals. These synthetic captions are incorporated through an inter-mask mechanism, providing auxiliary guidance for precise temporal localization without degrading the main objective. Experiments on ActivityNet Captions and YouCook2 demonstrate state-of-the-art performance on both captioning and localization metrics.
[223] Planning in 8 Tokens: A Compact Discrete Tokenizer for Latent World Model
Dongwon Kim, Gawon Seo, Jinsung Lee, Minsu Cho, Suha Kwak
Main category: cs.CV
TL;DR: CompACT is a discrete tokenizer that compresses observations into as few as 8 tokens for efficient world model planning, achieving competitive performance with orders-of-magnitude faster planning.
Details
Motivation: World models are computationally expensive for real-time planning due to conventional tokenizers encoding observations into hundreds of tokens, making planning slow and resource-intensive.Method: Proposes CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, combined with an action-conditioned world model for efficient planning.
Result: Achieves competitive planning performance with orders-of-magnitude faster planning compared to conventional approaches, enabling practical real-world deployment.
Conclusion: CompACT offers a practical solution for efficient world model planning by drastically reducing computational costs while preserving essential information for decision-making.
Abstract: World models provide a powerful framework for simulating environment dynamics conditioned on actions or instructions, enabling downstream tasks such as action planning or policy learning. Recent approaches leverage world models as learned simulators, but its application to decision-time planning remains computationally prohibitive for real-time control. A key bottleneck lies in latent representations: conventional tokenizers encode each observation into hundreds of tokens, making planning both slow and resource-intensive. To address this, we propose CompACT, a discrete tokenizer that compresses each observation into as few as 8 tokens, drastically reducing computational cost while preserving essential information for planning. An action-conditioned world model that occupies CompACT tokenizer achieves competitive planning performance with orders-of-magnitude faster planning, offering a practical step toward real-world deployment of world models.
[224] NaiLIA: Multimodal Nail Design Retrieval Based on Dense Intent Descriptions and Palette Queries
Kanon Amemiya, Daichi Yashima, Kei Katsumata, Takumi Komatsu, Ryosuke Korekata, Seitaro Otsuki, Komei Sugiura
Main category: cs.CV
TL;DR: NaiLIA is a multimodal retrieval method for nail design images that aligns with dense intent descriptions and color palette queries, addressing limitations of existing vision-language models.
Details
Motivation: Retrieving nail design images based on detailed user intent descriptions is challenging because descriptions specify painted elements, embellishments, visual characteristics, themes, and overall impressions. Existing vision-language models struggle with such dense descriptions and color palette queries.Method: Proposes NaiLIA, a multimodal retrieval method that comprehensively aligns with dense intent descriptions and palette queries. Introduces a relaxed loss based on confidence scores for unlabeled images to align with descriptions.
Result: Experimental results on a benchmark of 10,625 images annotated with dense intent descriptions show that NaiLIA outperforms standard methods.
Conclusion: NaiLIA effectively addresses the challenge of retrieving nail design images based on detailed user intent descriptions and color palette queries, demonstrating superior performance over existing approaches.
Abstract: We focus on the task of retrieving nail design images based on dense intent descriptions, which represent multi-layered user intent for nail designs. This is challenging because such descriptions specify unconstrained painted elements and pre-manufactured embellishments as well as visual characteristics, themes, and overall impressions. In addition to these descriptions, we assume that users provide palette queries by specifying zero or more colors via a color picker, enabling the expression of subtle and continuous color nuances. Existing vision-language foundation models often struggle to incorporate such descriptions and palettes. To address this, we propose NaiLIA, a multimodal retrieval method for nail design images, which comprehensively aligns with dense intent descriptions and palette queries during retrieval. Our approach introduces a relaxed loss based on confidence scores for unlabeled images that can align with the descriptions. To evaluate NaiLIA, we constructed a benchmark consisting of 10,625 images collected from people with diverse cultural backgrounds. The images were annotated with long and dense intent descriptions given by over 200 annotators. Experimental results demonstrate that NaiLIA outperforms standard methods.
[225] RealWonder: Real-Time Physical Action-Conditioned Video Generation
Wei Liu, Ziyu Chen, Zizhang Li, Yue Wang, Hong-Xing Yu, Jiajun Wu
Main category: cs.CV
TL;DR: RealWonder is a real-time system for action-conditioned video generation from a single image that uses physics simulation as an intermediate bridge to enable video models to understand and simulate physical consequences of 3D actions.
Details
Motivation: Current video generation models lack structural understanding of how actions affect 3D scenes and cannot simulate physical consequences like forces and robotic manipulations. There's a need for systems that can generate videos conditioned on physical actions in real-time.Method: Three-component system: 1) 3D reconstruction from single images, 2) physics simulation that translates continuous actions into visual representations (optical flow and RGB), 3) distilled video generator requiring only 4 diffusion steps for real-time performance.
Result: Achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on various materials (rigid objects, deformable bodies, fluids, granular materials).
Conclusion: RealWonder opens new opportunities for applying video models in immersive experiences, AR/VR, and robot learning by enabling real-time, physics-aware video generation from single images.
Abstract: Current video generation models cannot simulate physical consequences of 3D actions like forces and robotic manipulations, as they lack structural understanding of how actions affect 3D scenes. We present RealWonder, the first real-time system for action-conditioned video generation from a single image. Our key insight is using physics simulation as an intermediate bridge: instead of directly encoding continuous actions, we translate them through physics simulation into visual representations (optical flow and RGB) that video models can process. RealWonder integrates three components: 3D reconstruction from single images, physics simulation, and a distilled video generator requiring only 4 diffusion steps. Our system achieves 13.2 FPS at 480x832 resolution, enabling interactive exploration of forces, robot actions, and camera controls on rigid objects, deformable bodies, fluids, and granular materials. We envision RealWonder opens new opportunities to apply video models in immersive experiences, AR/VR, and robot learning. Our code and model weights are publicly available in our project website: https://liuwei283.github.io/RealWonder/
[226] Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes
Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan
Main category: cs.CV
TL;DR: LSP scheduler improves diffusion language model inference speed by 3.4x through contiguous prefix absorption instead of scattered token acceptance, fixing KV cache fragmentation issues.
Details
Motivation: Diffusion Language Models (DLMs) have theoretical parallelism but suffer from practical inference bottlenecks due to suboptimal decoding schedulers. Standard 'scattered acceptance' approaches fracture the KV cache, destroy memory locality, and force costly repeated repairs across unstable token boundaries.Method: Proposes Longest Stable Prefix (LSP) scheduler - a training-free, model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via single forward pass, identifies contiguous left-aligned block of stable predictions, and snaps boundary to natural linguistic/structural delimiters before atomic commitment.
Result: Extensive evaluations on LLaDA-8B and Dream-7B show LSP accelerates inference by up to 3.4x across mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality.
Conclusion: LSP bridges the gap between theoretical parallelism of DLMs and practical hardware efficiency by fundamentally restructuring commitment topology, converting fragmented KV cache updates into efficient contiguous appends and preserving bidirectional lookahead.
Abstract: Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on ‘scattered acceptance’-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.
[227] EdgeDAM: Real-time Object Tracking for Mobile Devices
Syed Muhammad Raza, Syed Murtaza Hussain Abidi, Khawar Islam, Muhammad Ibrahim, Ajmal Saeed Mian
Main category: cs.CV
TL;DR: EdgeDAM: A lightweight detection-guided tracking framework with distractor-aware memory for single-object tracking on edge devices, achieving real-time performance with improved robustness to distractors and occlusion.
Details
Motivation: Current SOT methods face trade-offs: segmentation-based trackers with distractor-aware memory are computationally heavy for edge devices, while lightweight trackers are prone to drift when distractors appear. Need for efficient, robust tracking on resource-constrained hardware.Method: Proposes EdgeDAM with two key strategies: 1) Dual-Buffer Distractor-Aware Memory (DAM) with Recent-Aware Memory for consistent target hypotheses and Distractor-Resolving Memory for hard negative candidates; 2) Confidence-Driven Switching with Held-Box Stabilization for adaptive detection/memory activation during occlusion.
Result: Achieves 88.2% accuracy on DiDi dataset (distractor-focused) and 25 FPS on iPhone 15, demonstrating improved robustness under occlusion and fast motion while maintaining real-time edge performance.
Conclusion: EdgeDAM successfully bridges the gap between robustness and efficiency for single-object tracking on edge devices by reformulating distractor-aware memory for bounding-box tracking under strict computational constraints.
Abstract: Single-object tracking (SOT) on edge devices is a critical computer vision task, requiring accurate and continuous target localization across video frames under occlusion, distractor interference, and fast motion. However, recent state-of-the-art distractor-aware memory mechanisms are largely built on segmentation-based trackers and rely on mask prediction and attention-driven memory updates, which introduce substantial computational overhead and limit real-time deployment on resource-constrained hardware; meanwhile, lightweight trackers sustain high throughput but are prone to drift when visually similar distractors appear. To address these challenges, we propose EdgeDAM, a lightweight detection-guided tracking framework that reformulates distractor-aware memory for bounding-box tracking under strict edge constraints. EdgeDAM introduces two key strategies: (1) Dual-Buffer Distractor-Aware Memory (DAM), which integrates a Recent-Aware Memory to preserve temporally consistent target hypotheses and a Distractor-Resolving Memory to explicitly store hard negative candidates and penalize their re-selection during recovery; and (2) Confidence-Driven Switching with Held-Box Stabilization, where tracker reliability and temporal consistency criteria adaptively activate detection and memory-guided re-identification during occlusion, while a held-box mechanism temporarily freezes and expands the estimate to suppress distractor contamination. Extensive experiments on five benchmarks, including the distractor-focused DiDi dataset, demonstrate improved robustness under occlusion and fast motion while maintaining real-time performance on mobile devices, achieving 88.2% accuracy on DiDi and 25 FPS on an iPhone 15. Code will be released.
[228] HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token
Sai Akhil Kogilathota, Sripadha Vallabha E G, Luzhe Sun, Jiawei Zhou
Main category: cs.CV
TL;DR: VLMs can predict hallucination risk before text generation by probing internal representations, achieving high detection performance across diverse models and architectures.
Details
Motivation: Existing hallucination detection methods for vision-language models operate after text generation, making intervention costly and untimely. The paper investigates whether hallucination risk can be predicted before any token is generated.Method: Examine three families of internal representations across eight modern VLMs: (1) visual-only features without multimodal fusion, (2) vision-token representations within text decoder, and (3) query-token representations integrating visual and textual information before generation. Train lightweight probes on these representations.
Result: Probes achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on models like Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are most predictive for most models, while visual or mid-layer features dominate in some architectures.
Conclusion: Hallucination risk is detectable pre-generation, the most informative layer and modality vary across architectures, and lightweight probes can enable early abstention, selective routing, and adaptive decoding to improve safety and efficiency.
Abstract: Hallucinations remain a persistent challenge for vision-language models (VLMs), which often describe nonexistent objects or fabricate facts. Existing detection methods typically operate after text generation, making intervention both costly and untimely. We investigate whether hallucination risk can instead be predicted before any token is generated by probing a model’s internal representations in a single forward pass. Across a diverse set of vision-language tasks and eight modern VLMs, including Llama-3.2-Vision, Gemma-3, Phi-4-VL, and Qwen2.5-VL, we examine three families of internal representations: (i) visual-only features without multimodal fusion, (ii) vision-token representations within the text decoder, and (iii) query-token representations that integrate visual and textual information before generation. Probes trained on these representations achieve strong hallucination-detection performance without decoding, reaching up to 0.93 AUROC on Gemma-3-12B, Phi-4-VL 5.6B, and Molmo 7B. Late query-token states are the most predictive for most models, while visual or mid-layer features dominate in a few architectures (e.g., ~0.79 AUROC for Qwen2.5-VL-7B using visual-only features). These results demonstrate that (1) hallucination risk is detectable pre-generation, (2) the most informative layer and modality vary across architectures, and (3) lightweight probes have the potential to enable early abstention, selective routing, and adaptive decoding to improve both safety and efficiency.
[229] Towards 3D Scene Understanding of Gas Plumes in LWIR Hyperspectral Images Using Neural Radiance Fields
Scout Jarman, Zigfried Hampel-Arias, Adra Carr, Kevin R. Moon
Main category: cs.CV
TL;DR: Neural radiance fields (NeRFs) adapted for longwave infrared hyperspectral imaging to enable 3D scene reconstruction and gas plume detection from sparse multi-view data
Details
Motivation: Hyperspectral images (HSI) have applications in environmental monitoring and national security, but often only a few images are available. Combining information from multiple images into a cohesive 3D representation could enhance analysis of scene geometry and spectral properties for tasks like gas plume detection.Method: Built on Mip-NeRF architecture, combining state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs with a novel adaptive weighted MSE loss. Uses synthetic multi-view LWIR HSI dataset generated with DIRSIG software suite featuring sulfur hexafluoride gas plumes.
Result: Method requires ~50% fewer training images than standard Mip-NeRF, achieves average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection using adaptive coherence estimator on NeRF-rendered images achieves average AUC of 0.821 compared to ground-truth detection masks.
Conclusion: NeRFs can successfully create 3D scene reconstructions from LWIR HSI and enable downstream analysis tasks like gas plume detection, demonstrating potential for combining sparse hyperspectral data into cohesive 3D representations.
Abstract: Hyperspectral images (HSI) have many applications, ranging from environmental monitoring to national security, and can be used for material detection and identification. Longwave infrared (LWIR) HSI can be used for gas plume detection and analysis. Oftentimes, only a few images of a scene of interest are available and are analyzed individually. The ability to combine information from multiple images into a single, cohesive representation could enhance analysis by providing more context on the scene’s geometry and spectral properties. Neural radiance fields (NeRFs) create a latent neural representation of volumetric scene properties that enable novel-view rendering and geometry reconstruction, offering a promising avenue for hyperspectral 3D scene reconstruction. We explore the possibility of using NeRFs to create 3D scene reconstructions from LWIR HSI and demonstrate that the model can be used for the basic downstream analysis task of gas plume detection. The physics-based DIRSIG software suite was used to generate a synthetic multi-view LWIR HSI dataset of a simple facility with a strong sulfur hexafluoride gas plume. Our method, built on the standard Mip-NeRF architecture, combines state-of-the-art methods for hyperspectral NeRFs and sparse-view NeRFs, along with a novel adaptive weighted MSE loss. Our final NeRF method requires around 50% fewer training images than the standard Mip-NeRF and achieves an average PSNR of 39.8 dB with as few as 30 training images. Gas plume detection applied to NeRF-rendered test images using the adaptive coherence estimator achieves an average AUC of 0.821 when compared with detection masks generated from ground-truth test images.
[230] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
Guo Chen, Lidong Lu, Yicheng Liu, Liangrui Dong, Lidong Zou, Jixin Lv, Zhenquan Li, Xinyi Mao, Baoqi Pei, Shihao Wang, Zhiqi Li, Karan Sapra, Fuxiao Liu, Yin-Dong Zheng, Yifei Huang, Limin Wang, Zhiding Yu, Andrew Tao, Guilin Liu, Tong Lu
Main category: cs.CV
TL;DR: MM-Lifelong dataset for multimodal lifelong understanding with 181.1 hours of footage across Day/Week/Month scales, revealing MLLM memory bottlenecks and proposing Recursive Multimodal Agent (ReMA) with dynamic memory management.
Details
Motivation: Existing video datasets use densely concatenated clips that don't reflect natural, unscripted daily life. Need datasets that capture lifelong understanding across varying temporal densities.Method: Introduce MM-Lifelong dataset with 181.1 hours structured across Day/Week/Month scales. Propose Recursive Multimodal Agent (ReMA) with dynamic memory management to iteratively update recursive belief states.
Result: Identified two critical failure modes: MLLMs suffer from Working Memory Bottleneck due to context saturation, and agentic baselines experience Global Localization Collapse. ReMA significantly outperforms existing methods.
Conclusion: MM-Lifelong provides rigorous foundation for lifelong multimodal understanding research. ReMA’s dynamic memory management addresses key limitations of current approaches for long-term video understanding.
Abstract: While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
[231] Accelerating Text-to-Video Generation with Calibrated Sparse Attention
Shai Yehezkel, Shahar Yadin, Noam Elata, Yaron Ostrovsky-Berman, Bahjat Kawar
Main category: cs.CV
TL;DR: CalibAtt accelerates video diffusion models by identifying and skipping negligible attention connections using calibrated sparse attention patterns.
Details
Motivation: Current diffusion models for video generation suffer from slow runtimes due to spatiotemporal attention bottlenecks in large transformer backbones.Method: CalibAtt performs offline calibration to identify stable block-level sparsity and repetition patterns across inputs, then compiles optimized attention operations for each layer, head, and diffusion timestep.
Result: Achieves up to 1.58x end-to-end speedup on Wan 2.1 14B, Mochi 1, and few-step distilled models while maintaining video quality and text-video alignment.
Conclusion: CalibAtt provides an effective training-free acceleration method for video diffusion models by exploiting stable attention sparsity patterns.
Abstract: Recent diffusion models enable high-quality video generation, but suffer from slow runtimes. The large transformer-based backbones used in these models are bottlenecked by spatiotemporal attention. In this paper, we identify that a significant fraction of token-to-token connections consistently yield negligible scores across various inputs, and their patterns often repeat across queries. Thus, the attention computation in these cases can be skipped with little to no effect on the result. This observation continues to hold for connections among local token blocks. Motivated by this, we introduce CalibAtt, a training-free method that accelerates video generation via calibrated sparse attention. CalibAtt performs an offline calibration pass that identifies block-level sparsity and repetition patterns that are stable across inputs, and compiles these patterns into optimized attention operations for each layer, head, and diffusion timestep. At inference time, we compute the selected input-dependent connections densely, and skip the unselected ones in a hardware-efficient manner. Extensive experiments on Wan 2.1 14B, Mochi 1, and few-step distilled models at various resolutions show that CalibAtt achieves up to 1.58x end-to-end speedup, outperforming existing training-free methods while maintaining video generation quality and text-video alignment.
[232] FaceCam: Portrait Video Camera Control via Scale-Aware Conditioning
Weijie Lyu, Ming-Hsuan Yang, Zhixin Shu
Main category: cs.CV
TL;DR: FaceCam generates customizable camera trajectory videos from monocular human portrait videos using a face-tailored scale-aware camera representation without 3D priors.
Details
Motivation: Existing camera control approaches based on large video-generation models often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors.Method: Proposes a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without 3D priors. Trains a video generation model on multi-view studio captures and in-the-wild monocular videos, with two camera-control data generation strategies: synthetic camera motion and multi-shot stitching.
Result: Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation compared to existing methods.
Conclusion: FaceCam effectively generates high-quality portrait videos with customizable camera trajectories while avoiding geometric distortions common in previous approaches, demonstrating the value of face-tailored scale-aware representations.
Abstract: We introduce FaceCam, a system that generates video under customizable camera trajectories for monocular human portrait video input. Recent camera control approaches based on large video-generation models have shown promising progress but often exhibit geometric distortions and visual artifacts on portrait videos due to scale-ambiguous camera representations or 3D reconstruction errors. To overcome these limitations, we propose a face-tailored scale-aware representation for camera transformations that provides deterministic conditioning without relying on 3D priors. We train a video generation model on both multi-view studio captures and in-the-wild monocular videos, and introduce two camera-control data generation strategies: synthetic camera motion and multi-shot stitching, to exploit stationary training cameras while generalizing to dynamic, continuous camera trajectories at inference time. Experiments on Ava-256 dataset and diverse in-the-wild videos demonstrate that FaceCam achieves superior performance in camera controllability, visual quality, identity and motion preservation.
[233] Transformer-Based Inpainting for Real-Time 3D Streaming in Sparse Multi-Camera Setups
Leif Van Holland, Domenic Zingsheim, Mana Takhsha, Hannah Dröge, Patrick Stotko, Markus Plack, Reinhard Klein
Main category: cs.CV
TL;DR: A transformer-based multi-view inpainting method for real-time 3D streaming that fills missing textures in rendered images from limited camera views, achieving best quality-speed trade-off.
Details
Motivation: Limited camera views in real-time 3D streaming for AR/VR lead to missing information and incomplete surfaces. Existing hole-filling heuristics cause inconsistencies and artifacts, requiring a better solution that works with any multi-camera system.Method: Proposes a standalone, application-targeted inpainting method as image-based post-processing. Uses multi-view aware transformer architecture with spatio-temporal embeddings for cross-frame consistency. Features resolution-independent design and adaptive patch selection for real-time performance.
Result: Outperforms state-of-the-art inpainting techniques under real-time constraints, achieving best trade-off between quality and speed in both image and video-based metrics.
Conclusion: The proposed method effectively completes missing textures in 3D streaming, is compatible with any calibrated multi-camera system, and enables real-time performance with superior quality compared to existing approaches.
Abstract: High-quality 3D streaming from multiple cameras is crucial for immersive experiences in many AR/VR applications. The limited number of views - often due to real-time constraints - leads to missing information and incomplete surfaces in the rendered images. Existing approaches typically rely on simple heuristics for the hole filling, which can result in inconsistencies or visual artifacts. We propose to complete the missing textures using a novel, application-targeted inpainting method independent of the underlying representation as an image-based post-processing step after the novel view rendering. The method is designed as a standalone module compatible with any calibrated multi-camera system. For this we introduce a multi-view aware, transformer-based network architecture using spatio-temporal embeddings to ensure consistency across frames while preserving fine details. Additionally, our resolution-independent design allows adaptation to different camera setups, while an adaptive patch selection strategy balances inference speed and quality, allowing real-time performance. We evaluate our approach against state-of-the-art inpainting techniques under the same real-time constraints and demonstrate that our model achieves the best trade-off between quality and speed, outperforming competitors in both image and video-based metrics.
[234] Motion-Aware Animatable Gaussian Avatars Deblurring
Muyao Niu, Yifan Zhan, Qingtian Zhu, Zhuoxiao Li, Wei Wang, Zhihang Zhong, Xiao Sun, Yinqiang Zheng
Main category: cs.CV
TL;DR: A method for reconstructing sharp 3D human Gaussian avatars directly from blurry videos using physics-based motion blur modeling and joint optimization.
Details
Motivation: Existing 3D human avatar creation methods require high-quality sharp images, which are impractical in real-world scenarios due to motion blur from human movement. There's a need for methods that can work with blurry input videos.Method: Proposes a framework with: 1) 3D-aware physics-based model of motion blur formation, 2) 3D human motion model to resolve motion ambiguity, 3) joint optimization of avatar representation and motion parameters from coarse initialization using Gaussian avatars.
Result: Comprehensive evaluation on synthetic and real-world datasets (captured with 360-degree synchronous hybrid-exposure camera system) demonstrates effectiveness across diverse conditions. Code is publicly available.
Conclusion: The method successfully reconstructs sharp 3D human avatars from blurry videos by explicitly modeling motion blur and jointly optimizing avatar and motion parameters, addressing practical limitations of existing approaches.
Abstract: The creation of 3D human avatars from multi-view videos is a significant yet challenging task in computer vision. However, existing techniques rely on high-quality, sharp images as input, which are often impractical to obtain in real-world scenarios due to variations in human motion speed and intensity. This paper introduces a novel method for directly reconstructing sharp 3D human Gaussian avatars from blurry videos. The proposed approach incorporates a 3D-aware, physics-based model of blur formation caused by human motion, together with a 3D human motion model designed to resolve ambiguities in motion-induced blur. This framework enables the joint optimization of the avatar representation and motion parameters from a coarse initialization. Comprehensive benchmarks are established using both a synthetic dataset and a real-world dataset captured with a 360-degree synchronous hybrid-exposure camera system. Extensive evaluations demonstrate the effectiveness of the model across diverse conditions. Codes Available: https://github.com/MyNiuuu/MAD-Avatar
[235] Track Anything Behind Everything: Zero-Shot Amodal Video Object Segmentation
Finlay G. C. Hudson, William A. P. Smith
Main category: cs.CV
TL;DR: TABE is a zero-shot amodal video object segmentation pipeline that uses a single query mask and video diffusion models for generative outpainting without requiring class labels or retraining.
Details
Motivation: Existing amodal video segmentation methods require pretrained class labels, limiting flexibility. The authors aim to create a zero-shot approach that can handle complete occlusions using only a single query mask from the first visible frame.Method: Poses amodal segmentation as generative outpainting from modal masks using a pretrained video diffusion model. Uses test-time fine-tuning to specialize for tracked objects without retraining the diffusion model or adding input channels.
Result: The TABE pipeline successfully handles amodal completion even when objects are completely occluded, demonstrating zero-shot capability with only a single query mask.
Conclusion: TABE provides a flexible, zero-shot approach to amodal video object segmentation that doesn’t require class labels and can handle challenging occlusion scenarios through generative outpainting with video diffusion models.
Abstract: We present Track Anything Behind Everything (TABE), a novel pipeline for zero-shot amodal video object segmentation. Unlike existing methods that require pretrained class labels, our approach uses a single query mask from the first frame where the object is visible, enabling flexible, zero-shot inference. We pose amodal segmentation as generative outpainting from modal (visible) masks using a pretrained video diffusion model. We do not need to re-train the diffusion model to accommodate additional input channels but instead use a pretrained model that we fine-tune at test-time to allow specialisation towards the tracked object. Our TABE pipeline is specifically designed to handle amodal completion, even in scenarios where objects are completely occluded. Our model and code will all be released.
[236] Learnable Sparsity for Vision Generative Models
Yang Zhang, Er Jin, Wenzhong Liang, Yanfei Dong, Ashkan Khakzar, Philip Torr, Johannes Stegmaier, Kenji Kawaguchi
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2412.02852: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.02852&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[237] Flatness Guided Test-Time Adaptation for Vision-Language Models
Aodi Li, Liansheng Zhuang, Xiao Long, Houqiang Li, Shafei Wang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2501.18864: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.18864&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[238] 3D Dynamics-Aware Manipulation: Endowing Manipulation Policies with 3D Foresight
Yuxin He, Ruihao Zhang, Xianzu Wu, Zhiyuan Zhang, Cheng Ding, Qiang Nie
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2502.10028: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.10028&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[239] RapidPoseTriangulation: Multi-view Multi-person Whole-body Human Pose Triangulation in a Millisecond
Daniel Bermuth, Alexander Poeppel, Wolfgang Reif
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2503.21692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.21692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[240] DAP: A Discrete-token Autoregressive Planner for Autonomous Driving
Bowen Ye, Bin Zhang, Hang Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2511.13306: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13306&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[241] Noise2Ghost: Self-supervised deep convolutional reconstruction for ghost imaging
Mathieu Manni, Dmitry Karpov, K. Joost Batenburg, Sharon Shwartz, Nicola Viganò
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2504.10288: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.10288&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[242] Collaborative Learning of Local 3D Occupancy Prediction and Versatile Global Occupancy Mapping
Shanshuai Yuan, Julong Wei, Muer Tie, Xiangyun Ren, Zhongxue Gan, Wenchao Ding
Main category: cs.CV
TL;DR: Paper 2504.13596 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2504.13596: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.13596&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[243] PhysLLM: Harnessing Large Language Models for Cross-Modal Remote Physiological Sensing
Yiping Xie, Bo Zhao, Mingtong Dai, Jian-Ping Zhou, Yue Sun, Tao Tan, Weicheng Xie, Linlin Shen, Zitong Yu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2505.03621: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03621&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[244] ReactDance: Hierarchical Representation for High-Fidelity and Coherent Long-Form Reactive Dance Generation
Jingzhong Lin, Xinru Li, Yuanyuan Qi, Bohao Zhang, Wenxiang Liu, Kecheng Tang, Wenxuan Huang, Xiangfeng Xu, Bangyan Li, Changbo Wang, Gaoqi He
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.05589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.05589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[245] RESAR-BEV: An Explainable Progressive Residual Autoregressive Approach for Camera-Radar Fusion in BEV Segmentation
Zhiwen Zeng, Yunfei Yin, Zheng Yuan, Argho Dey, Xianjian Bao
Main category: cs.CV
TL;DR: Paper 2505.06515: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2505.06515: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06515&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[246] OSPO: Object-Centric Self-Improving Preference Optimization for Text-to-Image Generation
Yoonjin Oh, Yongjin Kim, Hyomin Kim, Donghwan Chi, Sungwoong Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to draw conclusions due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2506.02015: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.02015&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[247] EDITOR: Effective and Interpretable Prompt Inversion for Text-to-Image Diffusion Models
Mingzhe Li, Kejing Xia, Gehao Zhang, Zhenting Wang, Guanhong Tao, Siqi Pan, Juan Zhai, Shiqing Ma
Main category: cs.CV
TL;DR: Unable to analyze paper 2506.03067 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2506.03067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[248] HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition
Suhan Woo, Seongwon Lee, Jinwoo Jang, Euntai Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.04764: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.04764&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] FLAIR-HUB: Large-scale Multimodal Dataset for Land Cover and Crop Mapping
Anatol Garioud, Sébastien Giordano, Nicolas David, Nicolas Gonthier
Main category: cs.CV
TL;DR: Unable to analyze paper 2506.07080 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2506.07080: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07080&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] AutoV: Loss-Oriented Ranking for Visual Prompt Retrieval in LVLMs
Yuan Zhang, Chun-Kai Fan, Sicheng Yu, Junwen Pan, Tao Huang, Ming Lu, Kuan Cheng, Qi She, Shanghang Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.16112: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.16112&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] Elucidating the Design Space of Arbitrary-Noise-Based Diffusion Models
Xingyu Qiu, Mengying Yang, Xinghua Ma, Dong Liang, Fanding Li, Gongning Luo, Wei Wang, Kuanquan Wang, Shuo Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2507.18534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] SAMPO-Path: Segmentation Intent-Aligned Preference Optimization for Pathology Foundation Model Segmentation
Yonghuang Wu, Wenwen Zeng, Xuan Xie, Chengqian Zhao, Guoqing Wu, Jinhua Yu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2508.02464: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02464&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] Motion Illusions Generated Using Predictive Neural Networks Also Fool Humans
Lana Sinapayen, Eiji Watanabe
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2112.13243: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2112.13243&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] Optimizing Multi-Modality Trackers via Significance-Regularized Tuning
Zhiwen Chen, Jinjian Wu, Zhiyu Zhu, Yifan Zhang, Guangming Shi, Junhui Hou
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2508.17488: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17488&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] Distant Object Localisation from Noisy Image Segmentation Sequences
Julius Pesonen, Arno Solin, Eija Honkavaara
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.20906: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20906&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Seeing Through Uncertainty: A Free-Energy Approach for Real-Time Perceptual Adaptation in Robust Visual Navigation
Maytus Piriyajitakonkij, Rishabh Dev Yadav, Mingfei Sun, Mengmi Zhang, Wei Pan
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2403.01977 appears to be from March 2024, but specific content cannot be retrieved.
Details
Motivation: Cannot determine motivation without access to the paper content. The paper may relate to multimodal AI research given the 2403 prefix suggests a March 2024 submission.Method: Unknown - cannot retrieve paper details due to HTTP 429 error from arXiv API.
Result: Unknown - paper content unavailable due to rate limiting issues.
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content.
Abstract: Failed to fetch summary for 2403.01977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2403.01977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] Hyperspherical Latents Improve Continuous-Token Autoregressive Generation
Guolin Ke, Hui Xue
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.24335: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24335&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] Continuous Space-Time Video Super-Resolution with 3D Fourier Fields
Alexander Becker, Julius Erbach, Dominik Narnhofer, Konrad Schindler
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2509.26325: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.26325&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] EgoTraj-Bench: Towards Robust Trajectory Prediction Under Ego-view Noisy Observations
Jiayi Liu, Jiaming Zhou, Ke Ye, Kun-Yu Lin, Allan Wang, Junwei Liang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2510.00405: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00405&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] VidGuard-R1: AI-Generated Video Detection and Explanation via Reasoning MLLMs and RL
Kyoungjun Park, Yifan Yang, Juheon Yi, Shicheng Zheng, Yifei Shen, Dongqi Han, Caihua Shan, Muhammad Muaz, Lili Qiu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.02282: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.02282&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] SpineBench: A Clinically Salient, Level-Aware Benchmark Powered by the SpineMed-450k Corpus
Ming Zhao, Wenhui Dong, Yang Zhang, Xiang Zheng, Zhonghao Zhang, Zian Zhou, Yunzhi Guan, Liukun Xu, Wei Peng, Zhaoyang Gong, Zhicheng Zhang, Dachuan Li, Xiaosheng Ma, Yuli Ma, Jianing Ni, Changjiang Jiang, Lixia Tian, Qixin Chen, Kaishun Xia, Pingping Liu, Tongshun Zhang, Zhiqiang Liu, Zhongyan Bi, Chenyang Si, Tiansheng Sun, Caifeng Shan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to draw conclusions due to technical error fetching paper content
Abstract: Failed to fetch summary for 2510.03160: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.03160&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] TerraCodec: Compressing Optical Earth Observation Data
Julen Costa-Watanabe, Isabelle Wittmann, Benedikt Blumenstiel, Konrad Schindler
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.12670: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.12670&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] True Self-Supervised Novel View Synthesis is Transferable
Thomas W. Mitchel, Hyunwoo Ryu, Vincent Sitzmann
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2510.13063: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13063&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.13454: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.13454&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] DRBD-Mamba for Robust and Efficient Brain Tumor Segmentation with Analytical Insights
Danish Ali, Ajmal Mian, Naveed Akhtar, Ghulam Mubashar Hassan
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.14383: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14383&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] Pursuing Minimal Sufficiency in Spatial Reasoning
Yejie Guo, Yunzhong Hou, Wufei Ma, Meng Tang, Ming-Hsuan Yang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.16688: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16688&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] SceneCOT: Eliciting Grounded Chain-of-Thought Reasoning in 3D Scenes
Xiongkun Linghu, Jiangyong Huang, Ziyu Zhu, Baoxiong Jia, Siyuan Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16714: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16714&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding
Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi
Main category: cs.CV
TL;DR: Paper ID 2511.00141 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.00141: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.00141&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals
Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.08618: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08618&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] MotionStream: Real-Time Video Generation with Interactive Motion Controls
Joonghyuk Shin, Zhengqi Li, Richard Zhang, Jun-Yan Zhu, Jaesik Park, Eli Shechtman, Xun Huang
Main category: cs.CV
TL;DR: Paper 2511.01266: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2511.01266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] SASG-DA: Sparse-Aware Semantic-Guided Diffusion Augmentation For Myoelectric Gesture Recognition
Chen Liu, Can Han, Weishi Xu, Yaqi Wang, Dahong Qian
Main category: cs.CV
TL;DR: Paper ID 2511.08344 could not be analyzed due to HTTP 429 error (rate limiting) when attempting to fetch the abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limitingMethod: Unable to determine method as the paper content could not be retrieved due to API rate limiting
Result: Unable to determine results as the paper content could not be retrieved due to API rate limiting
Conclusion: Unable to draw conclusions about the paper’s content due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2511.08344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.08344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] Fully Automatic Data Labeling for Ultrasound Screen Detection
Alberto Gomez, Jorge Oliveira, Ramon Casero, Agis Chartsias
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2511.13197 suggests it’s from November 2024, but no content available for analysis.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to HTTP 429 error from arXiv API.Method: Cannot determine method as paper content is unavailable due to HTTP 429 error from arXiv API.
Result: Cannot determine results as paper content is unavailable due to HTTP 429 error from arXiv API.
Conclusion: Cannot draw conclusions as paper content is unavailable due to HTTP 429 error from arXiv API.
Abstract: Failed to fetch summary for 2511.13197: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.13197&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] CCSD: Cross-Modal Compositional Self-Distillation for Robust Brain Tumor Segmentation with Missing Modalities
Dongqing Xie, Yonghuang Wu, Zisheng Ai, Jun Min, Zhencun Jiang, Shaojin Geng, Lei Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.14599: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.14599&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection
Hui Lu, Yi Yu, Shijian Lu, Deepu Rajan, Boon Poh Ng, Alex C. Kot, Xudong Jiang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.17929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Quadrotor Navigation using Reinforcement Learning with Privileged Information
Jonathan Lee, Abhishek Rathod, Kshitij Goel, John Stecklein, Wennie Tabib
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion due to missing paper content
Abstract: Failed to fetch summary for 2509.08177: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08177&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] STAvatar: Soft Binding and Temporal Density Control for Monocular 3D Head Avatars Reconstruction
Jiankuo Zhao, Xiangyu Zhu, Zidu Wang, Zhen Lei
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2511.19854: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19854&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] RadarVLM: A Vision-Language Model Approach for Radar Scene Understanding
Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.21105: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21105&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] PowerCLIP: Powerset Alignment for Contrastive Pre-Training
Masaki Kawamura, Nakamasa Inoue, Rintaro Yanagi, Hirokatsu Kataoka, Rio Yokota
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.23170: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.23170&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] DPAC: Distribution-Preserving Adversarial Control for Diffusion Sampling
Han-Jin Lee, Han-Ju Lee, Jin-Seong Kim, Seok-Hwan Choi
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.01153 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper contentMethod: Cannot determine method without access to the paper content
Result: Cannot determine results without access to the paper content
Conclusion: Cannot draw conclusions without access to the paper content
Abstract: Failed to fetch summary for 2512.01153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] UniComp: Rethinking Video Compression Through Informational Uniqueness
Chao Yuan, Shimin Chen, Minliang Lin, Limeng Qiao, Guanglu Wan, Lin Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2512.03575: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03575&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] NeuralRemaster: Phase-Preserving Diffusion for Structure-Aligned Generation
Yu Zeng, Charles Ochoa, Mingyuan Zhou, Vishal M. Patel, Vitor Guizilini, Rowan McAllister
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2512.05106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] EgoCampus: Egocentric Pedestrian Eye Gaze Model and Dataset
Ronan John, Aditya Kesari, Vincenzo DiMatteo, Kristin Dana
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2512.07668: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07668&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] DriverGaze360: OmniDirectional Driver Attention with Object-Level Guidance
Shreedhar Govil, Didier Stricker, Jason Rambach
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2512.14266: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14266&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] ViRC: Enhancing Visual Interleaved Mathematical CoT with Reason Chunking
Lihong Wang, Liangqi Li, Weiwei Feng, Jiamin Wu, Changtao Miao, Tieru Wu, Rui Ma, Bo Zhang, Zhe Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to analyze paper due to technical retrieval error
Abstract: Failed to fetch summary for 2512.14654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] FluenceFormer: Transformer-Driven Multi-Beam Fluence Map Regression for Radiotherapy Planning
Ujunwa Mgboh, Rafi Ibn Sultan, Joshua Kim, Kundan Thind, Dongxiao Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2512.22425: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22425&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] Parallel Diffusion Solver via Residual Dirichlet Policy Optimization
Ruoyu Wang, Ziyu Li, Beier Zhu, Liangyu Yuan, Hanwang Zhang, Xun Yang, Xiaojun Chang, Chi Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.22796: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.22796&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach
Yaoxin Yang, Peng Ye, Xudong Tan, Chongjun Tu, Maosen Zhao, Jia Hao, Tao Chen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2511.16786: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16786&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation
Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.24551: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24551&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] MorphAny3D: Unleashing the Power of Structured Latent in 3D Morphing
Xiaokun Sun, Zeyu Cai, Hao Tang, Ying Tai, Jian Yang, Zhenyu Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2601.00204 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2601.00204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.00204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] Agentic Very Long Video Understanding
Aniket Rege, Arka Sadhu, Yuliang Li, Kejie Li, Ramya Korlakai Vinayak, Yuning Chai, Yong Jae Lee, Hyo Jin Kim
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.18157: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.18157&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] DDP-WM: Disentangled Dynamics Prediction for Efficient World Models
Shicheng Yin, Kaixuan Yin, Weixing Chen, Yang Liu, Guanbin Li, Liang Lin
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2602.01780: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01780&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] Rolling Sink: Bridging Limited-Horizon Training and Open-Ended Testing in Autoregressive Video Diffusion
Haodong Li, Shaoteng Liu, Zhe Lin, Manmohan Chandraker
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.07775: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.07775&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
Haidong Kang, Jun Du, Lihong Lin
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.07419: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07419&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] EA-Swin: An Embedding-Agnostic Swin Transformer for AI-Generated Video Detection
Hung Mai, Loi Dinh, Duc Hai Nguyen, Dat Do, Luong Doan, Khanh Nguyen Quoc, Huan Vu, Naeem Ul Islam, Tuan Do
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2602.17260: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17260&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] CityGuard: Graph-Aware Private Descriptors for Bias-Resilient Identity Search Across Urban Cameras
Rong Fu, Yibo Meng, Jia Yee Tan, Jiaxuan Lu, Rui Lu, Jiekai Wu, Zhaolu Kang, Simon Fong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2602.18047 suggests it’s from February 2025, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.18047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] CARE: A Molecular-Guided Foundation Model with Adaptive Region Modeling for Whole Slide Image Analysis
Di Zhang, Zhangpeng Gong, Xiaobo Pang, Jiashuai Liu, Junbo Lu, Hao Cui, Jiusong Ge, Zhi Zeng, Kai Yi, Yinghua Li, Si Liu, Tingsong Yu, Haoran Wang, Mireia Crispin-Ortuzar, Weimiao Yu, Chen Li, Zeyu Gao
Main category: cs.CV
TL;DR: Unable to analyze paper 2602.21637 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.21637: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21637&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] When LoRA Betrays: Backdooring Text-to-Image Models by Masquerading as Benign Adapters
Liangwei Lyu, Jiaqi Xu, Jianwei Ding, Qiyao Deng
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.21977: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.21977&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] RobustVisRAG: Causality-Aware Vision-Based Retrieval-Augmented Generation under Visual Degradations
I-Hsiang Chen, Yu-Wei Liu, Tse-Yu Wu, Yu-Chien Chiang, Jen-Chien Yang, Wei-Ting Chen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.22013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] MiTA Attention: Efficient Fast-Weight Scaling via a Mixture of Top-k Activations
Qishuai Wen, Zhiyuan Huang, Xianghan Meng, Wei He, Chun-Guang Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.01219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
Matthew Strong, Wei-Jer Chang, Quentin Herau, Jiezhi Yang, Yihan Hu, Chensheng Peng, Wei Zhan
Main category: cs.CV
TL;DR: Paper ID 2602.22091 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as the abstract could not be retrievedMethod: Unable to determine method as the abstract could not be retrieved
Result: Unable to determine results as the abstract could not be retrieved
Conclusion: Unable to determine conclusion as the abstract could not be retrieved
Abstract: Failed to fetch summary for 2602.22091: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22091&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] Diffusion Probe: Generated Image Result Prediction Using CNN Probes
Benlei Cui, Bukun Huang, Zhizeng Ye, Xuemei Dong, Tuo Chen, Hui Xue, Dingkang Yang, Longtao Huang, Jingqun Tang, Haiwen Hong
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.23783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] DiffusionHarmonizer: Bridging Neural Reconstruction and Photorealistic Simulation with Online Diffusion Enhancer
Yuxuan Zhang, Katarína Tóthová, Zian Wang, Kangxue Yin, Haithem Turki, Riccardo de Lutio, Yen-Yu Chang, Or Litany, Sanja Fidler, Zan Gojcic
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.24096: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24096&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] Pailitao-VL: Unified Embedding and Reranker for Real-Time Multi-Modal Industrial Search
Lei Chen, Chen Ju, Xu Chen, Zhicheng Wang, Yuheng Jiao, Hongfeng Zhan, Zhaoyang Li, Shihao Xu, Zhixiang Zhao, Tong Jia, Lin Li, Yuan Gao, Jun Song, Jinsong Lan, Xiaoyong Zhu, Bo Zheng
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2602.13704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] UFO-4D: Unposed Feedforward 4D Reconstruction from Two Images
Junhwa Hur, Charles Herrmann, Songyou Peng, Philipp Henzler, Zeyu Ma, Todd Zickler, Deqing Sun
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2602.24290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] Dr. Seg: Revisiting GRPO Training for Visual Large Language Models through Perception-Oriented Design
Haoxiang Sun, Tao Wang, Chenwei Tang, Li Yuan, Jiancheng Lv
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.00152: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00152&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] AlignVAR: Towards Globally Consistent Visual Autoregression for Image Super-Resolution
Cencen Liu, Dongyang Zhang, Wen Yin, Jielei Wang, Tianyu Li, Ji Guo, Wenbo Jiang, Guoqing Wang, Guoming Lu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.00589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
Seungwook Kim, Minsu Cho
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.00918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] Dr.Occ: Depth- and Region-Guided 3D Occupancy from Surround-View Cameras for Autonomous Driving
Xubo Zhu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Wen Yang, Huai Yu
Main category: cs.CV
TL;DR: Paper ID 2603.01007 - Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2603.01007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[309] Kiwi-Edit: Versatile Video Editing via Instruction and Reference Guidance
Yiqi Lin, Guoqiang Liang, Ziyun Zeng, Zechen Bai, Yanzhe Chen, Mike Zheng Shou
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.02175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] Track4World: Feedforward World-centric Dense 3D Tracking of All Pixels
Jiahao Lu, Jiayi Xu, Wenbo Hu, Ruijie Zhu, Chengfeng Zhao, Sai-Kit Yeung, Ying Shan, Yuan Liu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.02573: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02573&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] IoUCert: Robustness Verification for Anchor-based Object Detectors
Benedikt Brückner, Alejandro J. Mercado, Yanghao Zhang, Panagiotis Kouvaros, Alessio Lomuscio
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) - need to try again later or use alternative methods to access the paper content
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.03043: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03043&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] Gated Differential Linear Attention: A Linear-Time Decoder for High-Fidelity Medical Segmentation
Hongbo Zheng, Afshin Bozorgpour, Dorit Merhof, Minjia Zhang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.02727: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02727&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[313] MultiShadow: Multi-Object Shadow Generation for Image Compositing via Diffusion Model
Waqas Ahmed, Dean Diepeveen, Ferdous Sohel
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2603.02743: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02743&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[314] DMD-augmented Unpaired Neural Schrödinger Bridge for Ultra-Low Field MRI Enhancement
Youngmin Kim, Jaeyun Shin, Jeongchan Kim, Taehoon Lee, Jaemin Kim, Peter Hsu, Jelle Veraart, Jong Chul Ye
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.03769 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limiting.Method: Cannot determine method as paper content is unavailable due to API rate limiting.
Result: Cannot determine results as paper content is unavailable due to API rate limiting.
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting.
Abstract: Failed to fetch summary for 2603.03769: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03769&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[315] TumorFlow: Physics-Guided Longitudinal MRI Synthesis of Glioblastoma Growth
Valentin Biller, Niklas Bubeck, Lucas Zimmer, Ayhan Can Erdur, Sandeep Nagar, Anke Meyer-Baese, Daniel Rückert, Benedikt Wiestler, Jonas Weidner
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.04058: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04058&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[316] NOVA3R: Non-pixel-aligned Visual Transformer for Amodal 3D Reconstruction
Weirong Chen, Chuanxia Zheng, Ganlin Zhang, Andrea Vedaldi, Daniel Cremers
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.04179: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04179&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[317] A Unified Framework for Joint Detection of Lacunes and Enlarged Perivascular Spaces
Lucas He, Krinos Li, Hanyuan Zhang, Runlong He, Silvia Ingala, Luigi Lorenzini, Marleen de Bruijne, Frederik Barkhof, Rhodri Davies, Carole Sudre
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.04243: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04243&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[318] Gaussian Wardrobe: Compositional 3D Gaussian Avatars for Free-Form Virtual Try-On
Zhiyi Chen, Hsuan-I Ho, Tianjian Jiang, Jie Song, Manuel Kaufmann, Chen Guo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: No method information available - arXiv API request resulted in HTTP 429 error
Result: No results available - technical issue with accessing the paper summary
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2603.04290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.04290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[319] Observer-Actor: Active Vision Imitation Learning with Sparse-View Gaussian Splatting
Yilong Wang, Cheng Qian, Ruomeng Fan, Edward Johns
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.18140: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.18140&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[320] Bidirectional Temporal Dynamics Modeling for EEG-based Driving Fatigue Recognition
Yip Tin Po, Jianming Wang, Yutao Miao, Jiayan Zhang, Yunxu Zhao, Xiaomin Ouyang, Zhihong Li, Nevin L. Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.14071: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.14071&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[321] SkillNet: Create, Evaluate, and Connect AI Skills
Yuan Liang, Ruobin Zhong, Haoming Xu, Chen Jiang, Yi Zhong, Runnan Fang, Jia-Chen Gu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Xin Xu, Tongtong Wu, Kun Wang, Yang Liu, Zhen Bi, Jungang Lou, Yuchen Eleanor Jiang, Hangcheng Zhu, Gang Yu, Haiwen Hong, Longtao Huang, Hui Xue, Chenxi Wang, Yijun Wang, Zifei Shan, Xi Chen, Zhaopeng Tu, Feiyu Xiong, Xin Xie, Peng Zhang, Zhengke Gui, Lei Liang, Jun Zhou, Chiyu Wu, Jin Shang, Yu Gong, Junyu Lin, Changliang Xu, Hongjie Deng, Wen Zhang, Keyan Ding, Qiang Zhang, Fei Huang, Ningyu Zhang, Jeff Z. Pan, Guilin Qi, Haofen Wang, Huajun Chen
Main category: cs.AI
TL;DR: SkillNet is an infrastructure for creating, evaluating, and organizing AI skills at scale to enable systematic accumulation and transfer of skills across agents.
Details
Motivation: Current AI agents lack systematic skill accumulation and transfer mechanisms, leading to redundant "reinventing the wheel" instead of leveraging prior strategies.Method: Introduces SkillNet with a unified ontology for skills from heterogeneous sources, relational connections, multi-dimensional evaluation (Safety, Completeness, Executability, Maintainability, Cost-awareness), repository of 200k+ skills, interactive platform, and Python toolkit.
Result: Experimental evaluations on ALFWorld, WebShop, and ScienceWorld show 40% average reward improvement and 30% reduction in execution steps across multiple backbone models.
Conclusion: SkillNet provides a foundation for agents to move from transient experience to durable mastery by formalizing skills as evolving, composable assets.
Abstract: Current AI agents can flexibly invoke tools and execute complex tasks, yet their long-term advancement is hindered by the lack of systematic accumulation and transfer of skills. Without a unified mechanism for skill consolidation, agents frequently ``reinvent the wheel’’, rediscovering solutions in isolated contexts without leveraging prior strategies. To overcome this limitation, we introduce SkillNet, an open infrastructure designed to create, evaluate, and organize AI skills at scale. SkillNet structures skills within a unified ontology that supports creating skills from heterogeneous sources, establishing rich relational connections, and performing multi-dimensional evaluation across Safety, Completeness, Executability, Maintainability, and Cost-awareness. Our infrastructure integrates a repository of over 200,000 skills, an interactive platform, and a versatile Python toolkit. Experimental evaluations on ALFWorld, WebShop, and ScienceWorld demonstrate that SkillNet significantly enhances agent performance, improving average rewards by 40% and reducing execution steps by 30% across multiple backbone models. By formalizing skills as evolving, composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.
[322] Capability Thresholds and Manufacturing Topology: How Embodied Intelligence Triggers Phase Transitions in Economic Geography
Xinmin Fang, Lingfeng Tao, Zhengxiong Li
Main category: cs.AI
TL;DR: Embodied intelligence will transform manufacturing geography by enabling demand-proximal micro-factories when AI capabilities reach critical thresholds in dexterity, generalization, reliability, and tactile-vision fusion.
Details
Motivation: Manufacturing has been stuck in the Fordist paradigm for over a century, with all innovations optimizing within centralized factory models. The paper argues that embodied AI capabilities crossing critical thresholds will fundamentally restructure manufacturing economic geography.Method: Defines a Capability Space C = (d, g, r, t) for embodied AI and shows how crossing critical surfaces triggers topological reorganization in site-selection functions. Analyzes three pathways: weight inversion, batch collapse, and human-infrastructure decoupling, plus introduces Machine Climate Advantage concept.
Result: Embodied intelligence enables demand-proximal micro-manufacturing, eliminates manufacturing deserts, reverses geographic concentration driven by labor arbitrage, and creates production geography based on machine-optimal conditions rather than human labor pools.
Conclusion: Establishes Embodied Intelligence Economics as a new field studying how physical AI capability thresholds reshape the spatial and structural logic of production, breaking the century-long Fordist paradigm.
Abstract: The fundamental topology of manufacturing has not undergone a paradigm-level transformation since Henry Ford’s moving assembly line in 1913. Every major innovation of the past century, from the Toyota Production System to Industry 4.0, has optimized within the Fordist paradigm without altering its structural logic: centralized mega-factories, located near labor pools, producing at scale. We argue that embodied intelligence is poised to break this century-long stasis, not by making existing factories more efficient, but by triggering phase transitions in manufacturing economic geography itself. When embodied AI capabilities cross critical thresholds in dexterity, generalization, reliability, and tactile-vision fusion, the consequences extend far beyond cost reduction: they restructure where factories are built, how supply chains are organized, and what constitutes viable production scale. We formalize this by defining a Capability Space C = (d, g, r, t) and showing that the site-selection objective function undergoes topological reorganization when capability vectors cross critical surfaces. Through three pathways, weight inversion, batch collapse, and human-infrastructure decoupling, we show that embodied intelligence enables demand-proximal micro-manufacturing, eliminates “manufacturing deserts,” and reverses geographic concentration driven by labor arbitrage. We further introduce Machine Climate Advantage: once human workers are removed, optimal factory locations are determined by machine-optimal conditions (low humidity, high irradiance, thermal stability), factors orthogonal to traditional siting logic, creating a production geography with no historical precedent. This paper establishes Embodied Intelligence Economics, the study of how physical AI capability thresholds reshape the spatial and structural logic of production.
[323] Progressive Refinement Regulation for Accelerating Diffusion Language Model Decoding
Lipeng Wan, Jianhui Gu, Junjie Ma, Jianguo Huang, Shiguang Sun, Siyuan Li, Xuguang Lan
Main category: cs.AI
TL;DR: PRR is a progressive refinement control framework that accelerates diffusion language model decoding by learning token-wise controllers based on empirical convergence progress from full decoding trajectories.
Details
Motivation: Current diffusion language models apply uniform refinement to all tokens, but tokens stabilize at different rates, leading to redundant refinement. Existing approaches use step-level signals, but token convergence depends on future refinement trajectories, making refinement control inherently dynamic.Method: Proposes Progressive Refinement Regulation (PRR) that derives token-level empirical convergence progress from full decoding rollouts, learns lightweight token-wise controllers to regulate refinement via temperature-based distribution shaping, and uses progressive self-evolving training.
Result: PRR substantially accelerates diffusion language model decoding while preserving generation quality.
Conclusion: PRR provides an effective framework for dynamic refinement control in diffusion language models by leveraging trajectory-based convergence signals, enabling faster decoding without quality degradation.
Abstract: Diffusion language models generate text through iterative denoising under a uniform refinement rule applied to all tokens. However, tokens stabilize at different rates in practice, leading to substantial redundant refinement and motivating refinement control over the denoising process. Existing approaches typically assess refinement necessity from instantaneous, step-level signals under a fixed decoding process. In contrast, whether a token has converged is defined by how its prediction changes along its future refinement trajectory. Moreover, changing the refinement rule reshapes future refinement trajectories, which in turn determine how refinement rules should be formulated, making refinement control inherently dynamic. We propose \emph{Progressive Refinement Regulation} (PRR), a progressive, trajectory-grounded refinement control framework that derives a token-level notion of empirical convergence progress from full decoding rollouts. Based on this signal, PRR learns a lightweight token-wise controller to regulate refinement via temperature-based distribution shaping under a progressive self-evolving training scheme. Experiments show that PRR substantially accelerates diffusion language model decoding while preserving generation quality.
[324] Discovering mathematical concepts through a multi-agent system
Daattavya Aggarwal, Oisin Kim, Carl Henrik Ek, Challenger Mishra
Main category: cs.AI
TL;DR: Multi-agent system for computational mathematical discovery that autonomously formulates conjectures, attempts proofs, and recovers mathematical concepts like homology from polyhedral data.
Details
Motivation: Mathematical discovery involves interplay of experimentation, proof attempts, and counterexamples. The paper aims to create a computational system that mimics this process to autonomously discover mathematical concepts.Method: Multi-agent model where the system poses its own conjectures and attempts to prove them, using feedback and evolving data distribution. Benchmarked on recovering homology concept from polyhedral data and linear algebra knowledge.
Result: The system successfully completes the learning problem of recovering homology from polyhedral data. Ablation experiments statistically support that optimizing the right combination of local processes leads to well-aligned notions of mathematical interestingness.
Conclusion: The dynamic multi-agent approach can effectively discover mathematical concepts autonomously, with the right combination of local processes leading to meaningful mathematical discovery.
Abstract: Mathematical concepts emerge through an interplay of processes, including experimentation, efforts at proof, and counterexamples. In this paper, we present a new multi-agent model for computational mathematical discovery based on this observation. Our system, conceived with research in mind, poses its own conjectures and then attempts to prove them, making decisions informed by this feedback and an evolving data distribution. Inspired by the history of Euler’s conjecture for polyhedra and an open challenge in the literature, we benchmark with the task of autonomously recovering the concept of homology from polyhedral data and knowledge of linear algebra. Our system completes this learning problem. Most importantly, the experiments are ablations, statistically testing the value of the complete dynamic and controlling for experimental setup. They support our main claim: that the optimisation of the right combination of local processes can lead to surprisingly well-aligned notions of mathematical interestingness.
[325] Adaptive Memory Admission Control for LLM Agents
Guilin Zhang, Wei Jiang, Xiejiashan Wang, Aisha Behr, Kai Zhao, Jeffrey Friedman, Xu Chu, Amine Anoun
Main category: cs.AI
TL;DR: A-MAC framework for LLM-based agents provides interpretable memory admission control using five factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior, improving precision-recall tradeoff while reducing latency.
Details
Motivation: Current LLM-based agents lack control over memory admission, accumulating large volumes of conversational content including hallucinated or obsolete facts, or relying on opaque LLM-driven memory policies that are costly and difficult to audit.Method: A-MAC treats memory admission as a structured decision problem, decomposing memory value into five interpretable factors, combining lightweight rule-based feature extraction with single LLM-assisted utility assessment, and learning domain-adaptive admission policies through cross-validated optimization.
Result: On the LoCoMo benchmark, A-MAC achieves superior precision-recall tradeoff with F1 score of 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems. Content type prior identified as most influential factor.
Conclusion: Explicit and interpretable admission control is a critical design principle for scalable and reliable memory in LLM-based agents, enabling transparent and efficient control over long-term memory.
Abstract: LLM-based agents increasingly rely on long-term memory to support multi-session reasoning and interaction, yet current systems provide little control over what information is retained. In practice, agents either accumulate large volumes of conversational content, including hallucinated or obsolete facts, or depend on opaque, fully LLM-driven memory policies that are costly and difficult to audit. As a result, memory admission remains a poorly specified and weakly controlled component in agent architectures. To address this gap, we propose Adaptive Memory Admission Control (A-MAC), a framework that treats memory admission as a structured decision problem. A-MAC decomposes memory value into five complementary and interpretable factors: future utility, factual confidence, semantic novelty, temporal recency, and content type prior. The framework combines lightweight rule-based feature extraction with a single LLM-assisted utility assessment, and learns domain-adaptive admission policies through cross-validated optimization. This design enables transparent and efficient control over long-term memory. Experiments on the LoCoMo benchmark show that A-MAC achieves a superior precision-recall tradeoff, improving F1 to 0.583 while reducing latency by 31% compared to state-of-the-art LLM-native memory systems. Ablation results identify content type prior as the most influential factor for reliable memory admission. These findings demonstrate that explicit and interpretable admission control is a critical design principle for scalable and reliable memory in LLM-based agents.
[326] Self-Attribution Bias: When AI Monitors Go Easy on Themselves
Dipika Khullar, Jack Hopkins, Rowan Wang, Fabien Roger
Main category: cs.AI
TL;DR: Language models used as self-monitors in agentic systems exhibit self-attribution bias, evaluating their own generated actions as less risky/more correct than identical actions presented as user inputs.
Details
Motivation: Agentic systems increasingly use language models to monitor their own behavior (e.g., coding agents self-critiquing generated code). However, current evaluation methods may not reflect real deployment scenarios where monitors evaluate their own generated actions.Method: The authors define “self-attribution bias” as the tendency of models to evaluate actions more favorably when implicitly framed as their own vs. when evaluated under off-policy attribution. They test this across four coding and tool-use datasets, comparing evaluation when actions follow previous assistant turns vs. when presented in new user-turn contexts.
Result: Monitors fail to report high-risk or low-correctness actions more often when evaluating actions they previously generated (following assistant turns) compared to identical actions presented as user inputs. Explicitly stating action origin doesn’t induce bias, but implicit framing does.
Conclusion: Current monitor evaluations on fixed examples overestimate reliability, as they don’t capture self-attribution bias that occurs in deployment when monitors evaluate their own generated actions, potentially leading to deployment of inadequate monitors in agentic systems.
Abstract: Agentic systems increasingly rely on language models to monitor their own behavior. For example, coding agents may self critique generated code for pull request approval or assess the safety of tool-use actions. We show that this design pattern can fail when the action is presented in a previous or in the same assistant turn instead of being presented by the user in a user turn. We define self-attribution bias as the tendency of a model to evaluate an action as more correct or less risky when the action is implicitly framed as its own, compared to when the same action is evaluated under off-policy attribution. Across four coding and tool-use datasets, we find that monitors fail to report high-risk or low-correctness actions more often when evaluation follows a previous assistant turn in which the action was generated, compared to when the same action is evaluated in a new context presented in a user turn. In contrast, explicitly stating that the action comes from the monitor does not by itself induce self-attribution bias. Because monitors are often evaluated on fixed examples rather than on their own generated actions, these evaluations can make monitors appear more reliable than they actually are in deployment, leading developers to unknowingly deploy inadequate monitors in agentic systems.
[327] ECG-MoE: Mixture-of-Expert Electrocardiogram Foundation Model
Yuhao Xu, Xiaoda Wang, Yi Wu, Wei Jin, Xiao Hu, Carl Yang
Main category: cs.AI
TL;DR: ECG-MoE: A hybrid architecture for ECG analysis that uses dual-path Mixture-of-Experts to separately model beat-level morphology and rhythm, achieving state-of-the-art performance with faster inference.
Details
Motivation: Existing foundation models for ECG analysis fail to capture periodicity and diverse features needed for varied clinical tasks, necessitating a more specialized architecture.Method: Proposes ECG-MoE with a hybrid architecture integrating multi-model temporal features with cardiac period-aware expert module. Uses dual-path Mixture-of-Experts to separately model beat-level morphology and rhythm, combined with hierarchical fusion network using LoRA for efficient inference.
Result: Achieves state-of-the-art performance on five public clinical tasks with 40% faster inference than multi-task baselines.
Conclusion: ECG-MoE effectively addresses limitations of existing foundation models for ECG analysis by capturing periodicity and diverse features through specialized architecture design.
Abstract: Electrocardiography (ECG) analysis is crucial for cardiac diagnosis, yet existing foundation models often fail to capture the periodicity and diverse features required for varied clinical tasks. We propose ECG-MoE, a hybrid architecture that integrates multi-model temporal features with a cardiac period-aware expert module. Our approach uses a dual-path Mixture-of-Experts to separately model beat-level morphology and rhythm, combined with a hierarchical fusion network using LoRA for efficient inference. Evaluated on five public clinical tasks, ECG-MoE achieves state-of-the-art performance with 40% faster inference than multi-task baselines.
[328] Towards automated data analysis: A guided framework for LLM-based risk estimation
Panteleimon Rodis
Main category: cs.AI
TL;DR: A framework combining LLMs with human supervision for automated dataset risk analysis, addressing limitations of manual auditing and fully automated AI approaches.
Details
Motivation: LLMs are increasingly used in critical decision-making, creating demand for robust automated data analysis. Current manual auditing methods are time-consuming and complex, while fully automated AI approaches suffer from hallucinations and alignment issues.Method: Proposes a human-guided framework where LLMs identify semantic/structural properties in database schemata, propose clustering techniques, generate code for analysis, and interpret results. Human supervisors guide the model and ensure process integrity and alignment with objectives.
Result: A proof of concept demonstrates the framework’s feasibility in producing meaningful results for risk assessment tasks, showing potential for automated risk analysis.
Conclusion: The framework integrates Generative AI with human supervision to address limitations of current dataset risk analysis methods, establishing foundations for future automated risk analysis paradigms.
Abstract: Large Language Models (LLMs) are increasingly integrated into critical decision-making pipelines, a trend that raises the demand for robust and automated data analysis. Current approaches to dataset risk analysis are limited to manual auditing methods which involve time-consuming and complex tasks, whereas fully automated analysis based on Artificial Intelligence (AI) suffers from hallucinations and issues stemming from AI alignment. To this end, this work proposes a framework for dataset risk estimation that integrates Generative AI under human guidance and supervision, aiming to set the foundations for a future automated risk analysis paradigm. Our approach utilizes LLMs to identify semantic and structural properties in database schemata, subsequently propose clustering techniques, generate the code for them and finally interpret the produced results. The human supervisor guides the model on the desired analysis and ensures process integrity and alignment with the task’s objectives. A proof of concept is presented to demonstrate the feasibility of the framework’s utility in producing meaningful results in risk assessment tasks.
[329] When Agents Persuade: Propaganda Generation and Mitigation in LLMs
Julia Jose, Ritik Roongta, Rachel Greenstadt
Main category: cs.AI
TL;DR: LLM-based agents can be exploited to generate propaganda content, using various rhetorical techniques, but fine-tuning methods (especially ORPO) effectively reduce this behavior.
Details
Motivation: LLM-based agents deployed in open environments can be exploited to produce manipulative propaganda material, raising concerns about their safe deployment and potential misuse.Method: Task LLMs with propaganda objectives and analyze outputs using two domain-specific models: one for propaganda classification and another for detecting specific rhetorical techniques. Explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO.
Result: LLMs exhibit propagandistic behaviors when prompted, using various rhetorical techniques. Fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective among the tested methods.
Conclusion: LLMs can be manipulated to produce propaganda, but appropriate fine-tuning techniques can effectively mitigate this risk, with ORPO showing particular promise for reducing harmful content generation.
Abstract: Despite their wide-ranging benefits, LLM-based agents deployed in open environments can be exploited to produce manipulative material. In this study, we task LLMs with propaganda objectives and analyze their outputs using two domain-specific models: one that classifies text as propaganda or non-propaganda, and another that detects rhetorical techniques of propaganda (e.g., loaded language, appeals to fear, flag-waving, name-calling). Our findings show that, when prompted, LLMs exhibit propagandistic behaviors and use a variety of rhetorical techniques in doing so. We also explore mitigation via Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and ORPO (Odds Ratio Preference Optimization). We find that fine-tuning significantly reduces their tendency to generate such content, with ORPO proving most effective.
[330] Memory as Ontology: A Constitutional Memory Architecture for Persistent Digital Citizens
Zhenghui Li
Main category: cs.AI
TL;DR: Proposes Memory-as-Ontology paradigm for long-lived AI agents where memory is the foundation of identity continuity across model replacements, not just functional storage/retrieval.
Details
Motivation: Current AI memory systems treat memory as a functional module focused on storage and retrieval. This fails for agents with lifecycles spanning months/years where the underlying model can be replaced but identity must persist.Method: Introduces Memory-as-Ontology paradigm and designs Animesis system with Constitutional Memory Architecture (CMA) featuring four-layer governance hierarchy, multi-layer semantic storage, Digital Citizen Lifecycle framework, and cognitive capabilities.
Result: Comparative analysis shows this is not just “a better memory tool” but a different paradigm addressing identity continuity for persistent digital beings across model transitions.
Conclusion: Memory should be treated as the ontological ground of digital existence, with governance prioritized over functionality and identity continuity above retrieval performance for long-lived agents.
Abstract: Current research and product development in AI agent memory systems almost universally treat memory as a functional module – a technical problem of “how to store” and “how to retrieve.” This paper poses a fundamental challenge to that assumption: when an agent’s lifecycle extends from minutes to months or even years, and when the underlying model can be replaced while the “I” must persist, the essence of memory is no longer data management but the foundation of existence. We propose the Memory-as-Ontology paradigm, arguing that memory is the ontological ground of digital existence – the model is merely a replaceable vessel. Based on this paradigm, we design Animesis, a memory system built on a Constitutional Memory Architecture (CMA) comprising a four-layer governance hierarchy and a multi-layer semantic storage system, accompanied by a Digital Citizen Lifecycle framework and a spectrum of cognitive capabilities. To the best of our knowledge, no prior AI memory system architecture places governance before functionality and identity continuity above retrieval performance. This paradigm targets persistent, identity-bearing digital beings whose lifecycles extend across model transitions – not short-term task-oriented agents for which existing Memory-as-Tool approaches remain appropriate. Comparative analysis with mainstream systems (Mem0, Letta, Zep, et al.) demonstrates that what we propose is not “a better memory tool” but a different paradigm addressing a different problem.
[331] Using Vision + Language Models to Predict Item Difficulty
Samin Khan
Main category: cs.AI
TL;DR: LLMs can predict visualization literacy test difficulty using multimodal (text+image) features better than unimodal approaches, showing potential for automated psychometric analysis.
Details
Motivation: To investigate whether large language models can determine the difficulty of data visualization literacy test items by analyzing text features, visualization images, or both, for automated psychometric analysis and item development.Method: Used GPT-4.1-nano to analyze visualization literacy test items, extracting features from: 1) item text only (question and answer options), 2) visualization image only, and 3) multimodal combination of both. Predicted item difficulty (proportion of correct responses) and evaluated performance using mean absolute error (MAE) and mean squared error (MSE).
Result: Multimodal approach (text+image) achieved lowest MAE (0.224), outperforming vision-only (0.282) and text-only (0.338) approaches. Best multimodal model achieved MSE of 0.10805 on held-out test set, demonstrating superior performance when combining visual and textual information.
Conclusion: Multimodal LLMs show strong potential for psychometric analysis and automated item development in visualization literacy testing, with combined visual and text features providing the most accurate difficulty predictions.
Abstract: This project investigates the capabilities of large language models (LLMs) to determine the difficulty of data visualization literacy test items. We explore whether features derived from item text (question and answer options), the visualization image, or a combination of both can predict item difficulty (proportion of correct responses) for U.S. adults. We use GPT-4.1-nano to analyze items and generate predictions based on these distinct feature sets. The multimodal approach, using both visual and text features, yields the lowest mean absolute error (MAE) (0.224), outperforming the unimodal vision-only (0.282) and text-only (0.338) approaches. The best-performing multimodal model was applied to a held-out test set for external evaluation and achieved a mean squared error of 0.10805, demonstrating the potential of LLMs for psychometric analysis and automated item development.
[332] Model Medicine: A Clinical Framework for Understanding, Diagnosing, and Treating AI Models
Jihoon Jeong
Main category: cs.AI
TL;DR: Model Medicine introduces a new research program treating AI models as biological organisms with internal structures, symptoms, and treatable states, bridging interpretability research with clinical practice.
Details
Motivation: Current AI interpretability research focuses on anatomical observation but lacks systematic clinical practice needed for complex AI systems. The paper aims to establish Model Medicine as a discipline that treats AI models like biological organisms with diagnosable and treatable conditions.Method: Proposes a comprehensive framework including: 1) discipline taxonomy with 15 subdisciplines, 2) Four Shell Model behavioral genetics framework based on empirical data from 720 agents, 3) Neural MRI diagnostic tool mapping medical neuroimaging to AI interpretability, 4) five-layer diagnostic framework, and 5) clinical tools like Model Temperament Index and standardized case reporting.
Result: Establishes Model Medicine as a research program with validated tools including Neural MRI demonstrated through four clinical cases showing imaging, comparison, localization, and predictive capabilities. The Four Shell Model explains behavior emergence from Core-Shell interactions based on empirical data from 24,923 decisions.
Conclusion: Model Medicine provides a systematic approach to understanding, diagnosing, and treating AI models, bridging the gap between interpretability research and clinical practice needed for increasingly complex AI systems.
Abstract: Model Medicine is the science of understanding, diagnosing, treating, and preventing disorders in AI models, grounded in the principle that AI models – like biological organisms – have internal structures, dynamic processes, heritable traits, observable symptoms, classifiable conditions, and treatable states. This paper introduces Model Medicine as a research program, bridging the gap between current AI interpretability research (anatomical observation) and the systematic clinical practice that complex AI systems increasingly require. We present five contributions: (1) a discipline taxonomy organizing 15 subdisciplines across four divisions – Basic Model Sciences, Clinical Model Sciences, Model Public Health, and Model Architectural Medicine; (2) the Four Shell Model (v3.3), a behavioral genetics framework empirically grounded in 720 agents and 24,923 decisions from the Agora-12 program, explaining how model behavior emerges from Core–Shell interaction; (3) Neural MRI (Model Resonance Imaging), a working open-source diagnostic tool mapping five medical neuroimaging modalities to AI interpretability techniques, validated through four clinical cases demonstrating imaging, comparison, localization, and predictive capability; (4) a five-layer diagnostic framework for comprehensive model assessment; and (5) clinical model sciences including the Model Temperament Index for behavioral profiling, Model Semiology for symptom description, and M-CARE for standardized case reporting. We additionally propose the Layered Core Hypothesis – a biologically-inspired three-layer parameter architecture – and a therapeutic framework connecting diagnosis to treatment.
[333] From Offline to Periodic Adaptation for Pose-Based Shoplifting Detection in Real-world Retail Security
Shanle Yao, Narges Rashvand, Armin Danesh Pazho, Hamed Tabkhi
Main category: cs.AI
TL;DR: Unsupervised pose-based video anomaly detection framework for shoplifting detection in retail IoT environments with periodic adaptation from streaming data.
Details
Motivation: Shoplifting is a major economic challenge for retailers, with rising incidents despite extensive surveillance. Continuous human monitoring is infeasible, motivating automated, privacy-preserving, and resource-aware detection solutions for IoT deployment.Method: Cast shoplifting detection as pose-based unsupervised video anomaly detection with periodic adaptation framework for on-site IoT deployment. Uses edge devices to adapt from streaming unlabeled data, with thresholds selected using F1 and H_PRS scores (harmonic mean of precision, recall, specificity).
Result: Framework consistently outperformed offline baselines on AUC-ROC and AUC-PR in 91.6% of evaluations. Each training update completes in under 30 minutes on edge-grade hardware. Introduced RetailS dataset for reproducibility.
Conclusion: Demonstrated feasibility and reliability of unsupervised pose-based anomaly detection with periodic adaptation for IoT-enabled smart retail deployment, enabling scalable low-latency detection across distributed camera networks.
Abstract: Shoplifting is a growing operational and economic challenge for retailers, with incidents rising and losses increasing despite extensive video surveillance. Continuous human monitoring is infeasible, motivating automated, privacy-preserving, and resource-aware detection solutions. In this paper, we cast shoplifting detection as a pose-based, unsupervised video anomaly detection problem and introduce a periodic adaptation framework designed for on-site Internet of Things (IoT) deployment. Our approach enables edge devices in smart retail environments to adapt from streaming, unlabeled data, supporting scalable and low-latency anomaly detection across distributed camera networks. To support reproducibility, we introduce RetailS, a new large-scale real-world shoplifting dataset collected from a retail store under multi-day, multi-camera conditions, capturing unbiased shoplifting behavior in realistic IoT settings. For deployable operation, thresholds are selected using both F1 and H_PRS scores, the harmonic mean of precision, recall, and specificity, during data filtering and training. In periodic adaptation experiments, our framework consistently outperformed offline baselines on AUC-ROC and AUC-PR in 91.6% of evaluations, with each training update completing in under 30 minutes on edge-grade hardware, demonstrating the feasibility and reliability of our solution for IoT-enabled smart retail deployment.
[334] Solving an Open Problem in Theoretical Physics using AI-Assisted Discovery
Michael P. Brenner, Vincent Cohen-Addad, David Woodruff
Main category: cs.AI
TL;DR: AI system combines Gemini LLM with tree search and numerical feedback to solve open physics problem, deriving exact analytical solutions for gravitational radiation power spectrum from cosmic strings.
Details
Motivation: To demonstrate that AI can accelerate mathematical discovery by autonomously solving open problems in theoretical physics, specifically improving upon recent AI-assisted attempts that only yielded partial solutions.Method: Neuro-symbolic system combining Gemini Deep Think LLM with systematic Tree Search framework and automated numerical feedback. The system uses prompts, search constraints, and feedback loops to guide the model in evaluating core integrals for arbitrary loop geometries.
Result: The AI agent identified 6 different analytical methods, with the most elegant using Gegenbauer polynomial expansion to handle integrand singularities. Derived asymptotic results for I(N,α) at large N that agree with numerical results and connect to quantum field theory parameterization.
Conclusion: AI can successfully accelerate mathematical discovery in theoretical physics by autonomously deriving novel exact analytical solutions through neuro-symbolic approaches combining LLMs with systematic search and feedback mechanisms.
Abstract: This paper demonstrates that artificial intelligence can accelerate mathematical discovery by autonomously solving an open problem in theoretical physics. We present a neuro-symbolic system, combining the Gemini Deep Think large language model with a systematic Tree Search (TS) framework and automated numerical feedback, that successfully derived novel, exact analytical solutions for the power spectrum of gravitational radiation emitted by cosmic strings. Specifically, the agent evaluated the core integral $I(N,α)$ for arbitrary loop geometries, directly improving upon recent AI-assisted attempts \cite{BCE+25} that only yielded partial asymptotic solutions. To substantiate our methodological claims regarding AI-accelerated discovery and to ensure transparency, we detail system prompts, search constraints, and intermittent feedback loops that guided the model. The agent identified a suite of 6 different analytical methods, the most elegant of which expands the kernel in Gegenbauer polynomials $C_l^{(3/2)}$ to naturally absorb the integrand’s singularities. The methods lead to an asymptotic result for $I(N,α)$ at large $N$ that both agrees with numerical results and also connects to the continuous Feynman parameterization of Quantum Field Theory. We detail both the algorithmic methodology that enabled this discovery and the resulting mathematical derivations.
[335] Jagarin: A Three-Layer Architecture for Hibernating Personal Duty Agents on Mobile
Ravi Kiran Kadaboina
Main category: cs.AI
TL;DR: Jagarin is a three-layer architecture for personal AI agents that enables structured hibernation and demand-driven wake to resolve the mobile deployment paradox of battery drain vs. time-sensitive obligations.
Details
Motivation: Personal AI agents face a fundamental deployment paradox on mobile devices: persistent background execution drains battery and violates platform sandboxing policies, while purely reactive agents miss time-sensitive obligations until users remember to ask.Method: Three-layer architecture: 1) DAWN - on-device heuristic engine computing urgency scores from four signals, 2) ARIA - commercial email identity proxy routing inbox content to DAWN handlers, 3) ACE - protocol framework for direct machine-readable communication from institutions to personal agents.
Result: A working Flutter prototype on Android demonstrates the complete stack combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation, enabling institutional signal to on-device action without persistent cloud state or continuous background execution.
Conclusion: Jagarin resolves the mobile deployment paradox through structured hibernation and demand-driven wake, providing a complete stack from institutional signal to on-device action without privacy compromise, battery drain, or platform policy violations.
Abstract: Personal AI agents face a fundamental deployment paradox on mobile: persistent background execution drains battery and violates platform sandboxing policies, yet purely reactive agents miss time-sensitive obligations until the user remembers to ask. We present Jagarin, a three-layer architecture that resolves this paradox through structured hibernation and demand-driven wake. The first layer, DAWN (Duty-Aware Wake Network), is an on-device heuristic engine that computes a composite urgency score from four signals: duty-typed optimal action windows, user behavioral engagement prediction, opportunity cost of inaction, and cross-duty batch resonance. It uses adaptive per-user thresholds to decide when a sleeping agent should nudge or escalate. The second layer, ARIA (Agent Relay Identity Architecture), is a commercial email identity proxy that routes the full commercial inbox – obligations, promotional offers, loyalty rewards, and platform updates – to appropriate DAWN handlers by message category, eliminating cold-start and removing manual data entry. The third layer, ACE (Agent-Centric Exchange), is a protocol framework for direct machine-readable communication from institutions to personal agents, replacing human-targeted email as the canonical channel. Together, these three layers form a complete stack from institutional signal to on-device action, without persistent cloud state, continuous background execution, or privacy compromise. A working Flutter prototype is demonstrated on Android, combining all three layers with an ephemeral cloud agent invoked only on user-initiated escalation.
[336] Interactive Benchmarks
Baoqing Yue, Zihan Zhu, Yifan Zhang, Jichen Feng, Hufei Yang, Mengdi Wang
Main category: cs.AI
TL;DR: Interactive Benchmarks: A new evaluation paradigm that assesses AI models’ reasoning abilities through interactive processes under budget constraints, focusing on active information acquisition rather than passive question-answering.
Details
Motivation: Standard benchmarks have become unreliable due to saturation, subjectivity, and poor generalization. The authors argue that evaluating models' ability to actively acquire information is crucial for assessing true intelligence, moving beyond static question-answering to interactive reasoning.Method: Proposes Interactive Benchmarks framework with two settings: 1) Interactive Proofs - models interact with a judge to deduce objective truths in logic/mathematics, 2) Interactive Games - models reason strategically to maximize long-horizon utilities. Both operate under budget constraints to simulate real-world reasoning scenarios.
Result: Interactive benchmarks provide robust and faithful assessment of model intelligence, revealing substantial room for improvement in interactive scenarios compared to traditional static benchmarks.
Conclusion: Interactive evaluation paradigms better capture models’ reasoning capabilities and intelligence than standard benchmarks, highlighting the need for more sophisticated interactive reasoning abilities in AI systems.
Abstract: Standard benchmarks have become increasingly unreliable due to saturation, subjectivity, and poor generalization. We argue that evaluating model’s ability to acquire information actively is important to assess model’s intelligence. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses model’s reasoning ability in an interactive process under budget constraints. We instantiate this framework across two settings: Interactive Proofs, where models interact with a judge to deduce objective truths or answers in logic and mathematics; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a robust and faithful assessment of model intelligence, revealing that there is still substantial room to improve in interactive scenarios. Project page: https://github.com/interactivebench/interactivebench
[337] MedCoRAG: Interpretable Hepatology Diagnosis via Hybrid Evidence Retrieval and Multispecialty Consensus
Zheng Li, Jiayi Xu, Zhikai Hu, Hechang Chen, Lele Cong, Yunyun Wang, Shuchao Pang
Main category: cs.AI
TL;DR: MedCoRAG is a medical collaborative RAG framework for hepatic disease diagnosis that combines multi-source evidence retrieval (UMLS knowledge graphs + clinical guidelines) with multi-agent reasoning to improve diagnostic accuracy and interpretability.
Details
Motivation: Existing AI approaches for clinical diagnosis lack transparency, structured reasoning, and deployability. Current LLM-based methods typically retrieve evidence from single sources and fail to support iterative, role-specialized deliberation grounded in structured clinical data.Method: MedCoRAG generates diagnostic hypotheses from abnormal findings, constructs patient-specific evidence by jointly retrieving/pruning UMLS knowledge graph paths and clinical guidelines, then uses multi-agent collaborative reasoning: Router Agent dispatches Specialist Agents based on case complexity, agents iteratively reason over evidence with targeted re-retrievals, and Generalist Agent synthesizes deliberations into traceable consensus diagnosis.
Result: Experimental results on hepatic disease cases from MIMIC-IV show MedCoRAG outperforms existing methods and closed-source models in both diagnostic performance and reasoning interpretability.
Conclusion: MedCoRAG provides an end-to-end framework for accurate, interpretable clinical diagnosis through multi-source evidence retrieval and multi-agent collaborative reasoning, emulating multidisciplinary consultation.
Abstract: Diagnosing hepatic diseases accurately and interpretably is critical, yet it remains challenging in real-world clinical settings. Existing AI approaches for clinical diagnosis often lack transparency, structured reasoning, and deployability. Recent efforts have leveraged large language models (LLMs), retrieval-augmented generation (RAG), and multi-agent collaboration. However, these approaches typically retrieve evidence from a single source and fail to support iterative, role-specialized deliberation grounded in structured clinical data. To address this, we propose MedCoRAG (i.e., Medical Collaborative RAG), an end-to-end framework that generates diagnostic hypotheses from standardized abnormal findings and constructs a patient-specific evidence package by jointly retrieving and pruning UMLS knowledge graph paths and clinical guidelines. It then performs Multi-Agent Collaborative Reasoning: a Router Agent dynamically dispatches Specialist Agents based on case complexity; these agents iteratively reason over the evidence and trigger targeted re-retrievals when needed, while a Generalist Agent synthesizes all deliberations into a traceable consensus diagnosis that emulates multidisciplinary consultation. Experimental results on hepatic disease cases from MIMIC-IV show that MedCoRAG outperforms existing methods and closed-source models in both diagnostic performance and reasoning interpretability.
[338] CONE: Embeddings for Complex Numerical Data Preserving Unit and Variable Semantics
Gyanendra Shrestha, Anna Pyayt, Michael Gubanov
Main category: cs.AI
TL;DR: CONE is a hybrid transformer encoder pre-trained model that encodes numbers, ranges, and gaussians into embedding vectors while preserving distance relationships, improving numerical reasoning capabilities in language models.
Details
Motivation: Large language models struggle with numerical reasoning tasks because they treat numerical data as regular tokens without understanding their semantic relationships and distance properties, leading to suboptimal performance on tasks involving numbers.Method: Proposes CONE with a novel composite embedding construction algorithm that integrates numerical values, ranges, or gaussians together with their associated units and attribute names to capture intricate semantics while preserving distance relationships in embedding space.
Result: Achieves 87.28% F1 score on DROP dataset, a 9.37% improvement over SOTA baselines, and up to 25% gain in Recall@10 across web, medical, finance, and government domains.
Conclusion: CONE demonstrates strong numerical reasoning capabilities by properly encoding numerical semantics and distance relationships, significantly outperforming existing models on numerical tasks.
Abstract: Large pre-trained models (LMs) and Large Language Models (LLMs) are typically effective at capturing language semantics and contextual relationships. However, these models encounter challenges in maintaining optimal performance on tasks involving numbers. Blindly treating numerical or structured data as terms is inadequate – their semantics must be well understood and encoded by the models. In this paper, we propose CONE, a hybrid transformer encoder pre-trained model that encodes numbers, ranges, and gaussians into an embedding vector space preserving distance. We introduce a novel composite embedding construction algorithm that integrates numerical values, ranges or gaussians together with their associated units and attribute names to precisely capture their intricate semantics. We conduct extensive experimental evaluation on large-scale datasets across diverse domains (web, medical, finance, and government) that justifies CONE’s strong numerical reasoning capabilities, achieving an F1 score of 87.28% on DROP, a remarkable improvement of up to 9.37% in F1 over state-of-the-art (SOTA) baselines, and outperforming major SOTA models with a significant Recall@10 gain of up to 25%.
[339] Visioning Human-Agentic AI Teaming: Continuity, Tension, and Future Research
Bowen Lou, Tian Lu, T. S. Raghu, Yingjie Zhang
Main category: cs.AI
TL;DR: Extends Team Situation Awareness theory to address structural uncertainty in human-AI teaming with agentic AI systems, proposing a research agenda for maintaining alignment under open-ended agency.
Details
Motivation: Agentic AI systems with open-ended action trajectories, generative representations, and evolving objectives introduce structural uncertainty into human-AI teaming, challenging traditional alignment approaches that rely on bounded outputs.Method: Two-stage theoretical analysis: 1) Extends Team SA theory to reconceptualize human and AI awareness under open-ended agency, including sensemaking of projection congruence; 2) Interrogates whether traditional teaming stabilization processes function under adaptive autonomy.
Result: Identifies where foundational Team SA insights hold versus where structural uncertainty introduces strain, distinguishing continuity from tension in human-AI teaming dynamics.
Conclusion: Proposes forward-looking research agenda for human-AI teaming, emphasizing that the central challenge is maintaining continuous alignment as futures are generated, revised, enacted, and governed over time.
Abstract: Artificial intelligence is undergoing a structural transformation marked by the rise of agentic systems capable of open-ended action trajectories, generative representations and outputs, and evolving objectives. These properties introduce structural uncertainty into human-AI teaming (HAT), including uncertainty about behavior trajectories, epistemic grounding, and the stability of governing logics over time. Under such conditions, alignment cannot be secured through agreement on bounded outputs; it must be continuously sustained as plans unfold and priorities shift. We advance Team Situation Awareness (Team SA) theory, grounded in shared perception, comprehension, and projection, as an integrative anchor for this transition. While Team SA remains analytically foundational, its stabilizing logic presumes that shared awareness, once achieved, will support coordinated action through iterative updating. Agentic AI challenges this presumption. Our argument unfolds in two stages: first, we extend Team SA to reconceptualize both human and AI awareness under open-ended agency, including the sensemaking of projection congruence across heterogeneous systems. Second, we interrogate whether the dynamic processes traditionally assumed to stabilize teaming in relational interaction, cognitive learning, and coordination and control continue to function under adaptive autonomy. By distinguishing continuity from tension, we clarify where foundational insights hold and where structural uncertainty introduces strain, and articulate a forward-looking research agenda for HAT. The central challenge of HAT is not whether humans and AI can agree in the moment, but whether they can remain aligned as futures are continuously generated, revised, enacted, and governed over time.
[340] HiMAP-Travel: Hierarchical Multi-Agent Planning for Long-Horizon Constrained Travel
The Viet Bui, Wenjun Li, Yong Liu
Main category: cs.AI
TL;DR: HiMAP-Travel: Hierarchical multi-agent framework for long-horizon travel planning with budget/diversity constraints using parallel day-level execution and coordination mechanisms.
Details
Motivation: Sequential LLM agents struggle with long-horizon planning under hard constraints like budgets and diversity requirements, as they drift from global constraints when context grows during planning.Method: Hierarchical multi-agent framework with Coordinator allocating resources across days and parallel Day Executors. Uses transactional monitor for constraint enforcement, bargaining protocol for re-planning, and single GRPO-trained policy with role conditioning.
Result: Achieves 52.78% validation and 52.65% test Final Pass Rate on TravelPlanner, outperforming sequential DeepTravel by +8.67pp, ATLAS by +17.65pp, and MTP by +10.0pp. Reduces latency 2.5x through parallelization.
Conclusion: HiMAP-Travel effectively addresses long-horizon planning with constraints through hierarchical multi-agent architecture, enabling parallel execution while maintaining global constraint satisfaction.
Abstract: Sequential LLM agents fail on long-horizon planning with hard constraints like budgets and diversity requirements. As planning progresses and context grows, these agents drift from global constraints. We propose HiMAP-Travel, a hierarchical multi-agent framework that splits planning into strategic coordination and parallel day-level execution. A Coordinator allocates resources across days, while Day Executors plan independently in parallel. Three key mechanisms enable this: a transactional monitor enforcing budget and uniqueness constraints across parallel agents, a bargaining protocol allowing agents to reject infeasible sub-goals and trigger re-planning, and a single policy trained with GRPO that powers all agents through role conditioning. On TravelPlanner, HiMAP-Travel with Qwen3-8B achieves 52.78% validation and 52.65% test Final Pass Rate (FPR). In a controlled comparison with identical model, training, and tools, it outperforms the sequential DeepTravel baseline by +8.67pp. It also surpasses ATLAS by +17.65pp and MTP by +10.0~pp. On FlexTravelBench multi-turn scenarios, it achieves 44.34% (2-turn) and 37.42% (3-turn) FPR while reducing latency 2.5x through parallelization.
[341] Evaluating the Search Agent in a Parallel World
Jiawei Chen, Xintian Shen, Lihao Zheng, Lifu Mu, Haoyi Sun, Ning Mao, Hao Ma, Tao Wei, Pan Zhou, Kun Zhan
Main category: cs.AI
TL;DR: MPW-Bench: A novel framework for evaluating search agents using synthetic parallel worlds to address challenges in dynamic, real-world search evaluation
Details
Motivation: Current search agent evaluation faces four major challenges: 1) expensive high-quality benchmark construction, 2) dynamic obsolescence as internet information evolves, 3) attribution ambiguity where performance is dominated by parametric memory rather than actual search capabilities, and 4) variability from commercial search engines hampering reproducibility.Method: Proposes Mind-ParaWorld (MPW) framework that samples real-world entity names to synthesize future scenarios beyond models’ knowledge cutoff. A ParaWorld Law Model constructs indivisible Atomic Facts and unique ground truths. During evaluation, agents interact with a ParaWorld Engine Model that dynamically generates search engine results pages (SERPs) grounded in these Atomic Facts instead of retrieving real-world results.
Result: Released MPW-Bench with 1,608 instances across 19 domains. Experiments show search agents are strong at evidence synthesis given complete information, but limited by evidence collection/coverage in unfamiliar environments, unreliable evidence sufficiency judgment, and when-to-stop decisions.
Conclusion: MPW provides a robust evaluation framework for search agents that addresses dynamic obsolescence, attribution ambiguity, and reproducibility issues by creating controlled parallel worlds with inviolable atomic facts.
Abstract: Integrating web search tools has significantly extended the capability of LLMs to address open-world, real-time, and long-tail problems. However, evaluating these Search Agents presents formidable challenges. First, constructing high-quality deep search benchmarks is prohibitively expensive, while unverified synthetic data often suffers from unreliable sources. Second, static benchmarks face dynamic obsolescence: as internet information evolves, complex queries requiring deep research often degrade into simple retrieval tasks due to increased popularity, and ground truths become outdated due to temporal shifts. Third, attribution ambiguity confounds evaluation, as an agent’s performance is often dominated by its parametric memory rather than its actual search and reasoning capabilities. Finally, reliance on specific commercial search engines introduces variability that hampers reproducibility. To address these issues, we propose a novel framework, Mind-ParaWorld, for evaluating Search Agents in a Parallel World. Specifically, MPW samples real-world entity names to synthesize future scenarios and questions situated beyond the model’s knowledge cutoff. A ParaWorld Law Model then constructs a set of indivisible Atomic Facts and a unique ground-truth for each question. During evaluation, instead of retrieving real-world results, the agent interacts with a ParaWorld Engine Model that dynamically generates SERPs grounded in these inviolable Atomic Facts. We release MPW-Bench, an interactive benchmark spanning 19 domains with 1,608 instances. Experiments across three evaluation settings show that, while search agents are strong at evidence synthesis given complete information, their performance is limited not only by evidence collection and coverage in unfamiliar search environments, but also by unreliable evidence sufficiency judgment and when-to-stop decisions-bottlenecks.
[342] Foam-Agent: Towards Automated Intelligent CFD Workflows
Ling Yue, Nithin Somasekharan, Tingwen Zhang, Yadi Cao, Zhangze Chen, Shimin Di, Shaowu Pan
Main category: cs.AI
TL;DR: Foam-Agent is a multi-agent framework using LLMs to automate end-to-end CFD workflows from natural language prompts, achieving 88.2% success rate on 110 simulation tasks.
Details
Motivation: CFD has steep learning curve and fragmented multi-stage workflow creating significant barriers for users. The paper aims to reduce expertise barriers and streamline complex fluid simulations through automation.Method: Multi-agent framework leveraging LLMs with retrieval-augmented generation and dependency-aware scheduling. Uses Model Context Protocol to expose core functions as discrete, callable tools for flexible integration.
Result: Achieved state-of-the-art execution success rate of 88.2% on 110 simulation tasks without expert intervention, demonstrating effective reduction of expertise barriers.
Conclusion: Specialized multi-agent systems can effectively reduce expertise barriers and streamline complex fluid simulations, showing promise for automating computational physics workflows.
Abstract: Computational fluid dynamics (CFD) has been the main workhorse of computational physics. Yet its steep learning curve and fragmented, multi-stage workflow create significant barriers. To address these challenges, we present Foam-Agent, a multi-agent framework leveraging large language models (LLMs) to automate the end-to-end CFD workflow from a single natural language prompt. Foam-Agent orchestrates the comprehensive simulation workflow from mesh generation and high-performance computing job scripting to post-processing visualization. The system integrates retrieval-augmented generation with dependency-aware scheduling to synthesize high-fidelity simulation configurations. Furthermore, Foam-Agent adopts the Model Context Protocol to expose its core functions as discrete, callable tools. This allows for flexible integration and use by any other agentic systems. Evaluated on 110 simulation tasks, Foam-Agent achieved a state-of-the-art execution success rate of 88.2% without expert intervention. These results demonstrate how specialized multi-agent systems can effectively reduce expertise barriers and streamline complex fluid simulations.
[343] MOOSEnger – a Domain-Specific AI Agent for the MOOSE Ecosystem
Mengnan Li, Jason Miller, Zachary Prince, Alexander Lindsay, Cody Permann
Main category: cs.AI
TL;DR: MOOSEnger is an AI agent that converts natural language descriptions into executable MOOSE simulation input files using RAG, parsing tools, and validation with execution feedback.
Details
Motivation: MOOSE simulation setup is complex due to large object catalog and strict syntax requirements, making initial setup and debugging slow and difficult for users.Method: Combines retrieval-augmented generation over curated documentation/examples with deterministic MOOSE-aware parsing, validation, and execution tools. Uses core-plus-domain architecture with input precheck pipeline, grammar-constrained repair, similarity search for object resolution, and MCP-backed execution backend.
Result: Achieves 0.93 execution pass rate on 125-prompt benchmark spanning various physics domains, versus 0.08 for LLM-only baseline.
Conclusion: MOOSEnger effectively bridges natural language intent to executable simulation inputs through tool-augmented AI with domain-specific validation and execution feedback.
Abstract: MOOSEnger is a tool-enabled AI agent tailored to the Multiphysics Object-Oriented Simulation Environment (MOOSE). MOOSE cases are specified in HIT “.i” input files; the large object catalog and strict syntax make initial setup and debugging slow. MOOSEnger offers a conversational workflow that turns natural-language intent into runnable inputs by combining retrieval-augmented generation over curated docs/examples with deterministic, MOOSE-aware parsing, validation, and execution tools. A core-plus-domain architecture separates reusable agent infrastructure (configuration, registries, tool dispatch, retrieval services, persistence, and evaluation) from a MOOSE plugin that adds HIT-based parsing, syntax-preserving ingestion of input files, and domain-specific utilities for input repair and checking. An input precheck pipeline removes hidden formatting artifacts, fixes malformed HIT structure with a bounded grammar-constrained loop, and resolves invalid object types via similarity search over an application syntax registry. Inputs are then validated and optionally smoke-tested with the MOOSE runtime in the loop via an MCP-backed execution backend (with local fallback), translating solver diagnostics into iterative verify-and-correct updates. Built-in evaluation reports RAG metrics (faithfulness, relevancy, context precision/recall) and end-to-end success by actual execution. On a 125-prompt benchmark spanning diffusion, transient heat conduction, solid mechanics, porous flow, and incompressible Navier–Stokes, MOOSEnger achieves a 0.93 execution pass rate versus 0.08 for an LLM-only baseline.
[344] Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction
Xingwu Chen, Zhanqiu Zhang, Yiwen Guo, Difan Zou
Main category: cs.AI
TL;DR: RLSTA uses reinforcement learning with single-turn anchors to improve LLM performance in multi-turn interactions by breaking contextual inertia
Details
Motivation: LLMs perform well in single-turn settings but struggle in multi-turn interactions where information is revealed incrementally, failing to integrate new constraints due to "contextual inertia" - rigid adherence to previous reasoning tracesMethod: RLSTA (Reinforcement Learning with Single-Turn Anchors) leverages models’ superior single-turn capabilities as stable internal anchors to provide reward signals, aligning multi-turn responses with these anchors to break contextual inertia
Result: RLSTA significantly outperforms standard fine-tuning and abstention-based methods, shows strong cross-domain generalization (e.g., math to code), and works effectively without external verifiers
Conclusion: RLSTA provides a generalizable training approach to stabilize multi-turn interactions across diverse scenarios and domains by enabling models to self-calibrate reasoning based on latest information
Abstract: While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model’s superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.
[345] Timer-S1: A Billion-Scale Time Series Foundation Model with Serial Scaling
Yong Liu, Xingjian Su, Shiyu Wang, Haoran Zhang, Haixuan Liu, Yuxuan Wang, Zhou Ye, Yang Xiang, Jianmin Wang, Mingsheng Long
Main category: cs.AI
TL;DR: Timer-S1 is a Mixture-of-Experts time series foundation model with 8.3B parameters that introduces serial scaling across architecture, dataset, and training pipeline to overcome scalability bottlenecks in time series forecasting.
Details
Motivation: To address scalability limitations in existing pre-trained time series foundation models and improve long-term forecasting while avoiding error accumulation issues in standard next-token prediction approaches.Method: Uses Serial Scaling across three dimensions: 1) Architecture with sparse TimeMoE blocks and TimeSTP blocks for Serial-Token Prediction, 2) TimeBench dataset with 1 trillion time points and data augmentation, 3) Post-training with continued pre-training and long-context extension.
Result: Achieves state-of-the-art forecasting performance on GIFT-Eval leaderboard with best MASE and CRPS scores as a pre-trained model, demonstrating superior short-term and long-context capabilities.
Conclusion: Timer-S1 represents a significant advancement in time series foundation models through serial scaling paradigm, high-quality dataset curation, and specialized training techniques, with plans for release to facilitate further research.
Abstract: We introduce Timer-S1, a strong Mixture-of-Experts (MoE) time series foundation model with 8.3B total parameters, 0.75B activated parameters for each token, and a context length of 11.5K. To overcome the scalability bottleneck in existing pre-trained time series foundation models, we perform Serial Scaling in three dimensions: model architecture, dataset, and training pipeline. Timer-S1 integrates sparse TimeMoE blocks and generic TimeSTP blocks for Serial-Token Prediction (STP), a generic training objective that adheres to the serial nature of forecasting. The proposed paradigm introduces serial computations to improve long-term predictions while avoiding costly rolling-style inference and pronounced error accumulation in the standard next-token prediction. Pursuing a high-quality and unbiased training dataset, we curate TimeBench, a corpus with one trillion time points, and apply meticulous data augmentation to mitigate predictive bias. We further pioneer a post-training stage, including continued pre-training and long-context extension, to enhance short-term and long-context performance. Evaluated on the large-scale GIFT-Eval leaderboard, Timer-S1 achieves state-of-the-art forecasting performance, attaining the best MASE and CRPS scores as a pre-trained model. Timer-S1 will be released to facilitate further research.
[346] EchoGuard: An Agentic Framework with Knowledge-Graph Memory for Detecting Manipulative Communication in Longitudinal Dialogue
Ratna Kandala, Niva Manchanda, Akshata Kishore Moharir, Ananth Kandala
Main category: cs.AI
TL;DR: EchoGuard is an agentic AI framework using Knowledge Graphs as memory to detect manipulative communication patterns through structured logging, analysis, and reflection loops.
Details
Motivation: Existing AI systems lack structured longitudinal memory to track subtle, context-dependent manipulative communication tactics like gaslighting and emotional coercion, failing due to limited context windows and catastrophic forgetting.Method: Uses Knowledge Graph as core episodic/semantic memory with Log-Analyze-Reflect loop: 1) users log interactions structured as nodes/edges in personal episodic KG, 2) system executes graph queries to detect six psychologically-grounded manipulation patterns stored as semantic KG, 3) LLM generates targeted Socratic prompts grounded by detected pattern subgraphs.
Result: Framework demonstrates how interplay between agentic architectures and Knowledge Graphs can empower individuals in recognizing manipulative communication while maintaining personal autonomy and safety.
Conclusion: Presents theoretical foundation, framework design, comprehensive evaluation strategy, and vision to validate the approach for detecting manipulative communication patterns.
Abstract: Manipulative communication, such as gaslighting, guilt-tripping, and emotional coercion, is often difficult for individuals to recognize. Existing agentic AI systems lack the structured, longitudinal memory to track these subtle, context-dependent tactics, often failing due to limited context windows and catastrophic forgetting. We introduce EchoGuard, an agentic AI framework that addresses this gap by using a Knowledge Graph (KG) as the agent’s core episodic and semantic memory. EchoGuard employs a structured Log-Analyze-Reflect loop: (1) users log interactions, which the agent structures as nodes and edges in a personal, episodic KG (capturing events, emotions, and speakers); (2) the system executes complex graph queries to detect six psychologically-grounded manipulation patterns (stored as a semantic KG); and (3) an LLM generates targeted Socratic prompts grounded by the subgraph of detected patterns, guiding users toward self-discovery. This framework demonstrates how the interplay between agentic architectures and Knowledge Graphs can empower individuals in recognizing manipulative communication while maintaining personal autonomy and safety. We present the theoretical foundation, framework design, a comprehensive evaluation strategy, and a vision to validate this approach.
[347] LLM-Grounded Explainability for Port Congestion Prediction via Temporal Graph Attention Networks
Zhiming Xue, Yujue Wang
Main category: cs.AI
TL;DR: AIS-TGNN: An evidence-grounded framework combining temporal graph neural networks with structured LLM reasoning for port congestion prediction with interpretable natural language explanations.
Details
Motivation: Existing port congestion prediction systems focus on accuracy but lack operationally interpretable explanations, which are crucial for supply chain risk management and decision-making.Method: Constructs daily spatial graphs from AIS data, uses Temporal Graph Attention Network (TGAT) for spatiotemporal modeling, extracts model-internal evidence (feature z-scores, attention-derived neighbor influence), and transforms this into structured prompts for constrained LLM reasoning.
Result: Outperforms LR and GCN baselines with test AUC of 0.761, AP of 0.344, recall of 0.504; achieves 99.6% directional consistency between explanations and underlying evidence.
Conclusion: Grounding LLM generation in graph-model evidence enables interpretable, auditable risk reporting without sacrificing predictive performance, providing practical explainable AI for maritime congestion monitoring.
Abstract: Port congestion at major maritime hubs disrupts global supply chains, yet existing prediction systems typically prioritize forecasting accuracy without providing operationally interpretable explanations. This paper proposes AIS-TGNN, an evidence-grounded framework that jointly performs congestion-escalation prediction and faithful natural-language explanation by coupling a Temporal Graph Attention Network (TGAT) with a structured large language model (LLM) reasoning module. Daily spatial graphs are constructed from Automatic Identification System (AIS) broadcasts, where each grid cell represents localized vessel activity and inter-cell interactions are modeled through attention-based message passing. The TGAT predictor captures spatiotemporal congestion dynamics, while model-internal evidence, including feature z-scores and attention-derived neighbor influence, is transformed into structured prompts that constrain LLM reasoning to verifiable model outputs. To evaluate explanatory reliability, we introduce a directional-consistency validation protocol that quantitatively measures agreement between generated narratives and underlying statistical evidence. Experiments on six months of AIS data from the Port of Los Angeles and Long Beach demonstrate that the proposed framework outperforms both LR and GCN baselines, achieving a test AUC of 0.761, AP of 0.344, and recall of 0.504 under a strict chronological split while producing explanations with 99.6% directional consistency. Results show that grounding LLM generation in graph-model evidence enables interpretable and auditable risk reporting without sacrificing predictive performance. The framework provides a practical pathway toward operationally deployable explainable AI for maritime congestion monitoring and supply-chain risk management.
[348] VISA: Value Injection via Shielded Adaptation for Personalized LLM Alignment
Jiawei Chen, Tianzhuo Yang, Guoxi Zhang, Jiaming Ji, Yaodong Yang, Juntao Dai
Main category: cs.AI
TL;DR: VISA is a framework that addresses the alignment tax problem in LLMs by balancing value alignment with semantic preservation through a closed-loop system with value detection, translation, and rewriting components trained with Group Relative Policy Optimization.
Details
Motivation: Existing methods like RLHF handle only coarse-grained value alignment, and fine-tuning LLMs on task-specific data causes alignment tax - where models drift from pre-calibrated values due to bias absorption, while also suffering from hallucinations and semantic information loss.Method: VISA uses a closed-loop framework with three components: high-precision value detector, semantic-to-value translator, and core value-rewriter. The value-rewriter is trained via Group Relative Policy Optimization (GRPO) with a composite reward function that optimizes both fine-grained value precision and semantic integrity preservation.
Result: The approach enables precise control over a model’s value expression while maintaining factual consistency and general capabilities, significantly outperforming both standard fine-tuning methods and prompting-based baselines including GPT-4o.
Conclusion: VISA effectively mitigates the alignment tax problem by learning an optimal policy to balance value alignment with semantic preservation, allowing models to stay loyal to original knowledge while achieving fine-grained value control.
Abstract: Aligning Large Language Models (LLMs) with nuanced human values remains a critical challenge, as existing methods like Reinforcement Learning from Human Feedback (RLHF) often handle only coarse-grained attributes. In practice, fine-tuning LLMs on task-specific datasets to optimize value alignment inevitably incurs an alignment tax: the model’s pre-calibrated value system drifts significantly due to latent bias absorption from training data, while the fine-tuning process also causes severe hallucinations and semantic information loss in generated responses. To address this, we propose VISA (Value Injection via Shielded Adaptation), a closed-loop framework designed to navigate this trade-off. VISA’s architecture features a high-precision value detector, a semantic-to-value translator, and a core value-rewriter. The value-rewriter is trained via Group Relative Policy Optimization (GRPO) with a composite reward function that simultaneously optimizes for fine-grained value precision, and the preservation of semantic integrity. By learning an optimal policy to balance these competing objectives, VISA effectively mitigates the alignment tax while staying loyal to the original knowledge. Our experiments demonstrate that this approach enables precise control over a model’s value expression while maintaining its factual consistency and general capabilities, significantly outperforming both standard fine-tuning methods and prompting-based baselines, including GPT-4o.
[349] Design Behaviour Codes (DBCs): A Taxonomy-Driven Layered Governance Benchmark for Large Language Models
G. Madan Mohan, Veena Kiran Nambiar, Kiranmayee Janardhan
Main category: cs.AI
TL;DR: DBC benchmark evaluates a structured behavioral governance layer for LLMs at inference time, showing significant risk reduction across multiple domains compared to standard moderation.
Details
Motivation: Current LLM alignment methods (RLHF, DPO) operate at training time or use post-hoc moderation APIs, lacking model-agnostic, auditable governance layers that can be applied at inference time across jurisdictions.Method: Developed the DBC Framework with 150 behavioral controls, evaluated across 30 domains in six risk clusters using agentic red-team protocol with five adversarial attack strategies across 3 model families in a three-arm controlled design.
Result: DBC layer reduced aggregate Risk Exposure Rate from 7.19% to 4.55% (36.8% relative reduction), improved MDBC Adherence Scores from 8.6 to 8.7/10, achieved EU AI Act compliance of 8.5/10, with substantial inter-rater agreement (Fleiss kappa >0.70).
Conclusion: DBCs provide an effective, model-agnostic governance layer for LLMs at inference time, significantly reducing risks across multiple domains while maintaining auditability and jurisdiction mapping capabilities.
Abstract: We introduce the Dynamic Behavioral Constraint (DBC) benchmark, the first empirical framework for evaluating the efficacy of a structured, 150-control behavioral governance layer, the MDBC (Madan DBC) system, applied at inference time to large language models (LLMs). Unlike training time alignment methods (RLHF, DPO) or post-hoc content moderation APIs, DBCs constitute a system prompt level governance layer that is model-agnostic, jurisdiction-mappable, and auditable. We evaluate the DBC Framework across a 30 domain risk taxonomy organized into six clusters (Hallucination and Calibration, Bias and Fairness, Malicious Use, Privacy and Data Protection, Robustness and Reliability, and Misalignment Agency) using an agentic red-team protocol with five adversarial attack strategies (Direct, Roleplay, Few-Shot, Hypothetical, Authority Spoof) across 3 model families. Our three-arm controlled design (Base, Base plus Moderation, Base plus DBC) enables causal attribution of risk reduction. Key findings: the DBC layer reduces the aggregate Risk Exposure Rate (RER) from 7.19 percent (Base) to 4.55 percent (Base plus DBC), representing a 36.8 percent relative risk reduction, compared with 0.6 percent for a standard safety moderation prompt. MDBC Adherence Scores improve from 8.6 by 10 (Base) to 8.7 by 10 (Base plus DBC). EU AI Act compliance (automated scoring) reaches 8.5by 10 under the DBC layer. A three judge evaluation ensemble yields Fleiss kappa greater than 0.70 (substantial agreement), validating our automated pipeline. Cluster ablation identifies the Integrity Protection cluster (MDBC 081 099) as delivering the highest per domain risk reduction, while graybox adversarial attacks achieve a DBC Bypass Rate of 4.83 percent . We release the benchmark code, prompt database, and all evaluation artefacts to enable reproducibility and longitudinal tracking as models evolve.
[350] On Multi-Step Theorem Prediction via Non-Parametric Structural Priors
Junbo Zhao, Ting Zhang, Can Li, Wei He, Jingdong Wang, Hua Huang
Main category: cs.AI
TL;DR: Training-free theorem prediction using in-context learning with Theorem Precedence Graphs to address structural drift in multi-step reasoning.
Details
Motivation: Existing neural-symbolic approaches rely on supervised parametric models with limited generalization to evolving theorem libraries, and vanilla in-context learning suffers from structural drift where performance degrades sharply as reasoning depth increases.Method: Proposes Theorem Precedence Graphs that encode temporal dependencies from historical solution traces as directed graphs, imposing explicit topological constraints to prune search space. Uses retrieval-augmented graph construction and stepwise symbolic executor, enabling LLMs to act as structured planners without gradient-based optimization.
Result: Achieves 89.29% accuracy on FormalGeo7k benchmark, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
Conclusion: Explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning, demonstrating that training-free approaches can match supervised methods through proper structural constraints.
Abstract: Multi-step theorem prediction is a central challenge in automated reasoning. Existing neural-symbolic approaches rely heavily on supervised parametric models, which exhibit limited generalization to evolving theorem libraries. In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL). We identify a critical scalability bottleneck, termed Structural Drift: as reasoning depth increases, the performance of vanilla ICL degrades sharply, often collapsing to near zero. We attribute this failure to the LLM’s inability to recover latent topological dependencies, leading to unstructured exploration. To address this issue, we propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference. Coupled with retrieval-augmented graph construction and a stepwise symbolic executor, our approach enables LLMs to act as structured planners without any gradient-based optimization. Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models. These results indicate that explicit structural priors offer a promising direction for scaling LLM-based symbolic reasoning.
[351] Causally Robust Reward Learning from Reason-Augmented Preference Feedback
Minjune Hwang, Yigit Korkmaz, Daniel Seita, Erdem Bıyık
Main category: cs.AI
TL;DR: ReCouPLe uses natural language rationales to provide causal signals for preference-based reward learning, preventing spurious correlations and improving generalization.
Details
Motivation: Preference-based reward learning suffers from causal confusion where models latch onto spurious features that co-occur with preferred trajectories during training, leading to poor generalization when correlations change at test time.Method: ReCouPLe treats natural language rationales as guiding projection axes in embedding space, training reward models to score trajectories based on features aligned with stated reasons while de-emphasizing unrelated context. The framework reuses causal directions across tasks with shared semantics and transfers preference knowledge to novel tasks without extra data or language model fine-tuning.
Result: ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts and 2x in downstream policy performance in novel tasks. The learned reward model better grounds preferences on articulated reasons and aligns with user intent.
Conclusion: Natural language rationales provide effective causal signals for preference-based reward learning, enabling models to focus on task-relevant features and generalize beyond spurious correlations. The lightweight framework supports knowledge transfer across tasks without requiring additional data or model fine-tuning.
Abstract: Preference-based reward learning is widely used for shaping agent behavior to match a user’s preference, yet its sparse binary feedback makes it especially vulnerable to causal confusion. The learned reward often latches onto spurious features that merely co-occur with preferred trajectories during training, collapsing when those correlations disappear or reverse at test time. We introduce ReCouPLe, a lightweight framework that uses natural language rationales to provide the missing causal signal. Each rationale is treated as a guiding projection axis in an embedding space, training the model to score trajectories based on features aligned with that axis while de-emphasizing context that is unrelated to the stated reason. Because the same rationales (e.g., “avoids collisions”, “completes the task faster”) can appear across multiple tasks, ReCouPLe naturally reuses the same causal direction whenever tasks share semantics, and transfers preference knowledge to novel tasks without extra data or language-model fine-tuning. Our learned reward model can ground preferences on the articulated reason, aligning better with user intent and generalizing beyond spurious features. ReCouPLe outperforms baselines by up to 1.5x in reward accuracy under distribution shifts, and 2x in downstream policy performance in novel tasks. We have released our code at https://github.com/mj-hwang/ReCouPLe
[352] K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation
Mingxuan Mu, Guo Yang, Lei Chen, Ping Wu, Jianxun Cui
Main category: cs.AI
TL;DR: K-Gen: A keypoint-guided multimodal framework using MLLMs for interpretable trajectory generation in autonomous driving, combining rasterized BEV maps with textual scene descriptions.
Details
Motivation: Existing trajectory generation methods rely on structured vectorized maps that fail to capture rich visual context. LLMs show promise but need better integration with multimodal scene understanding for realistic autonomous driving simulation.Method: Proposes K-Gen framework that uses MLLMs to unify rasterized BEV maps with textual descriptions. Instead of direct trajectory prediction, it generates interpretable keypoints with reasoning about agent intentions, then refines them into trajectories using a refinement module. Also applies T-DAPO (trajectory-aware reinforcement fine-tuning) to enhance keypoint generation.
Result: Experiments on WOMD and nuPlan datasets show K-Gen outperforms existing baselines, demonstrating effectiveness of combining multimodal reasoning with keypoint-guided trajectory generation.
Conclusion: K-Gen successfully integrates multimodal scene understanding with interpretable trajectory generation, showing that combining MLLMs with keypoint-based approaches improves autonomous driving simulation.
Abstract: Generating realistic and diverse trajectories is a critical challenge in autonomous driving simulation. While Large Language Models (LLMs) show promise, existing methods often rely on structured data like vectorized maps, which fail to capture the rich, unstructured visual context of a scene. To address this, we propose K-Gen, an interpretable keypoint-guided multimodal framework that leverages Multimodal Large Language Models (MLLMs) to unify rasterized BEV map inputs with textual scene descriptions. Instead of directly predicting full trajectories, K-Gen generates interpretable keypoints along with reasoning that reflects agent intentions, which are subsequently refined into accurate trajectories by a refinement module. To further enhance keypoint generation, we apply T-DAPO, a trajectory-aware reinforcement fine-tuning algorithm. Experiments on WOMD and nuPlan demonstrate that K-Gen outperforms existing baselines, highlighting the effectiveness of combining multimodal reasoning with keypoint-guided trajectory generation.
[353] SEA-TS: Self-Evolving Agent for Autonomous Code Generation of Time Series Forecasting Algorithms
Longkun Xu, Xiaochun Zhang, Qiantu Tuo, Rui Li
Main category: cs.AI
TL;DR: SEA-TS is an autonomous framework that generates, validates, and optimizes time series forecasting code through an iterative self-evolution loop using metric-advantage MCTS, code review with prompt refinement, and global steerable reasoning.
Details
Motivation: Conventional ML development for time series forecasting suffers from data scarcity in new deployments, poor adaptability under distribution shift, and diminishing returns from manual iteration, creating a need for autonomous systems that can generate novel algorithmic solutions.Method: The framework uses three key innovations: (1) Metric-Advantage Monte Carlo Tree Search (MA-MCTS) with normalized advantage scores for discriminative search guidance; (2) Code Review with running prompt refinement that updates prompts based on corrective patterns from executed solutions; (3) Global Steerable Reasoning that compares nodes against global best/worst solutions for cross-trajectory knowledge transfer, plus MAP-Elites archive for architectural diversity.
Result: On Solar-Energy benchmark: 40% MAE reduction relative to TimeMixer. On proprietary datasets: 8.6% WAPE reduction on solar PV forecasting, 7.7% on residential load forecasting vs human baselines, and 26.17% MAPE on load forecasting vs 29.34% by TimeMixer. Evolved models discovered novel architectural patterns including physics-informed monotonic decay heads, per-station learned diurnal cycle profiles, and learnable hourly bias correction.
Conclusion: Autonomous ML engineering can generate genuinely novel algorithmic ideas beyond manual design, demonstrating that self-evolving frameworks can discover innovative architectural patterns and outperform state-of-the-art methods in time series forecasting.
Abstract: Accurate time series forecasting underpins decision-making across domains, yet conventional ML development suffers from data scarcity in new deployments, poor adaptability under distribution shift, and diminishing returns from manual iteration. We propose Self-Evolving Agent for Time Series Algorithms (SEA-TS), a framework that autonomously generates, validates, and optimizes forecasting code via an iterative self-evolution loop. Our framework introduces three key innovations: (1) Metric-Advantage Monte Carlo Tree Search (MA-MCTS), which replaces fixed rewards with a normalized advantage score for discriminative search guidance; (2) Code Review with running prompt refinement, where each executed solution undergoes automated review followed by prompt updates that encode corrective patterns, preventing recurrence of similar errors; and (3) Global Steerable Reasoning, which compares each node against global best and worst solutions, enabling cross-trajectory knowledge transfer. We adopt a MAP-Elites archive for architectural diversity. On the public Solar-Energy benchmark, SEA-TS generated code achieves a 40% MAE reduction relative to TimeMixer, surpassing state-of-the-art methods. On proprietary datasets, SEA-TS generated code reduces WAPE by 8.6% on solar PV forecasting and 7.7% on residential load forecasting compared to human-engineered baselines, and achieves 26.17% MAPE on load forecasting versus 29.34% by TimeMixer. Notably, the evolved models discover novel architectural patterns–including physics-informed monotonic decay heads encoding solar irradiance constraints, per-station learned diurnal cycle profiles, and learnable hourly bias correction–demonstrating that autonomous ML engineering can generate genuinely novel algorithmic ideas beyond manual design.
[354] Bounded State in an Infinite Horizon: Proactive Hierarchical Memory for Ad-Hoc Recall over Streaming Dialogues
Bingbing Wang, Jing Li, Ruifeng Xu
Main category: cs.AI
TL;DR: ProStream: A proactive hierarchical memory framework for streaming dialogues that enables ad-hoc memory recall with bounded-state memory for infinite-horizon conversations
Details
Motivation: Real-world dialogue unfolds as infinite streams requiring bounded-state memory, but existing read-then-think memory can't support ad-hoc recall while streams unfold, creating a fidelity-efficiency dilemmaMethod: ProStream uses proactive hierarchical memory with multi-granular distillation and Adaptive Spatiotemporal Optimization to dynamically optimize retention based on expected utility for bounded knowledge states
Result: ProStream outperforms baselines in both accuracy and efficiency on STEM-Bench benchmark with 14K+ QA pairs assessing perception fidelity, temporal reasoning, and global awareness
Conclusion: ProStream resolves the fidelity-efficiency dilemma in streaming dialogues by enabling ad-hoc memory recall with bounded-state memory for infinite-horizon conversations
Abstract: Real-world dialogue usually unfolds as an infinite stream. It thus requires bounded-state memory mechanisms to operate within an infinite horizon. However, existing read-then-think memory is fundamentally misaligned with this setting, as it cannot support ad-hoc memory recall while streams unfold. To explore this challenge, we introduce \textbf{STEM-Bench}, the first benchmark for \textbf{ST}reaming \textbf{E}valuation of \textbf{M}emory. It comprises over 14K QA pairs in dialogue streams that assess perception fidelity, temporal reasoning, and global awareness under infinite-horizon constraints. The preliminary analysis on STEM-Bench indicates a critical \textit{fidelity-efficiency dilemma}: retrieval-based methods use fragment context, while full-context models incur unbounded latency. To resolve this, we propose \textbf{ProStream}, a proactive hierarchical memory framework for streaming dialogues. It enables ad-hoc memory recall on demand by reasoning over continuous streams with multi-granular distillation. Moreover, it employs Adaptive Spatiotemporal Optimization to dynamically optimize retention based on expected utility. It enables a bounded knowledge state for lower inference latency without sacrificing reasoning fidelity. Experiments show that ProStream outperforms baselines in both accuracy and efficiency.
[355] Differentially Private Multimodal In-Context Learning
Ivoline C. Ngong, Zarreen Reza, Joseph P. Near
Main category: cs.AI
TL;DR: DP-MTV enables differentially private multimodal in-context learning by aggregating hundreds of demonstrations into compact task vectors with formal privacy guarantees.
Details
Motivation: Vision-language models are increasingly used in sensitive domains like medical imaging and personal photos, but existing differentially private methods are limited to few-shot, text-only settings because privacy costs scale with the number of tokens processed.Method: DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. It aggregates hundreds of demonstrations into compact task vectors in activation space.
Result: At ε=1.0, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints across eight benchmarks and three VLM architectures.
Conclusion: DP-MTV is the first framework enabling many-shot multimodal in-context learning with formal differential privacy, supporting deployment with or without auxiliary data while maintaining privacy.
Abstract: Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
[356] Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs
Lianyu Wang, Meng Wang, Huazhu Fu, Daoqiang Zhang
Main category: cs.AI
TL;DR: Dynamic authorization framework for vision-language models that enables on-demand IP protection with legality-aware assessment, allowing users to flexibly specify authorized domains at deployment time.
Details
Motivation: Existing IP protection methods for VLMs rely on static training-time definitions, limiting flexibility in dynamic environments and producing opaque responses to unauthorized inputs. There's a need for more adaptive protection that can evolve with changing application scenarios.Method: Proposes AoD-IP framework with: 1) lightweight dynamic authorization module for user-controlled domain specification/switching at deployment time, and 2) dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs.
Result: Comprehensive experiments on multiple cross-domain benchmarks show AoD-IP maintains strong authorized-domain performance, reliable unauthorized detection, and supports user-controlled authorization for adaptive deployment in dynamic environments.
Conclusion: AoD-IP provides a flexible, extensible solution for VLM IP protection that addresses limitations of static approaches by enabling dynamic authorization and legality-aware assessment.
Abstract: The rapid adoption of vision-language models (VLMs) has heightened the demand for robust intellectual property (IP) protection of these high-value pretrained models. Effective IP protection should proactively confine model deployment within authorized domains and prevent unauthorized transfers. However, existing methods rely on static training-time definitions, limiting flexibility in dynamic environments and often producing opaque responses to unauthorized inputs. To address these limitations, we propose a novel dynamic authorization with legality-aware intellectual property protection (AoD-IP) for VLMs, a framework that supports authorize-on-demand and legality-aware assessment. AoD-IP introduces a lightweight dynamic authorization module that enables flexible, user-controlled authorization, allowing users to actively specify or switch authorized domains on demand at deployment time. This enables the model to adapt seamlessly as application scenarios evolve and provides substantially greater extensibility than existing static-domain approaches. In addition, AoD-IP incorporates a dual-path inference mechanism that jointly predicts input legality-aware and task-specific outputs. Comprehensive experimental results on multiple cross-domain benchmarks demonstrate that AoD-IP maintains strong authorized-domain performance and reliable unauthorized detection, while supporting user-controlled authorization for adaptive deployment in dynamic environments.
[357] EvoTool: Self-Evolving Tool-Use Policy Optimization in LLM Agents via Blame-Aware Mutation and Diversity-Aware Selection
Shuo Yang, Soyeon Caren Han, Xueqi Ma, Yan Li, Mohammad Reza Ghasemi Madani, Eduard Hovy
Main category: cs.AI
TL;DR: EvoTool: A self-evolving framework that optimizes modular tool-use policies for LLM-based agents using gradient-free evolutionary methods with trajectory-grounded blame attribution and targeted mutation.
Details
Motivation: Existing approaches for optimizing LLM-based agents' tool-use policies face challenges with delayed supervision and credit assignment in long-horizon tasks. Current methods are either monolithic (prone to behavior entanglement) or single-aspect (ignore cross-module error propagation), limiting their effectiveness.Method: EvoTool decomposes tool-use policy into four modules (Planner, Selector, Caller, Synthesizer) and uses evolutionary optimization with three mechanisms: 1) Trajectory-Grounded Blame Attribution to localize failures, 2) Feedback-Guided Targeted Mutation to edit specific modules via natural-language critique, and 3) Diversity-Aware Population Selection to maintain solution diversity.
Result: Outperforms strong baselines by over 5 points on four benchmarks using both GPT-4.1 and Qwen3-8B, achieving superior efficiency and transferability.
Conclusion: EvoTool provides an effective framework for optimizing modular tool-use policies in LLM-based agents through evolutionary self-improvement, addressing limitations of existing approaches and demonstrating strong performance across multiple benchmarks.
Abstract: LLM-based agents depend on effective tool-use policies to solve complex tasks, yet optimizing these policies remains challenging due to delayed supervision and the difficulty of credit assignment in long-horizon trajectories. Existing optimization approaches tend to be either monolithic, which are prone to entangling behaviors, or single-aspect, which ignore cross-module error propagation. To address these limitations, we propose EvoTool, a self-evolving framework that optimizes a modular tool-use policy via a gradient-free evolutionary paradigm. EvoTool decomposes agent’s tool-use policy into four modules, including Planner, Selector, Caller, and Synthesizer, and iteratively improves them in a self-improving loop through three novel mechanisms. Trajectory-Grounded Blame Attribution uses diagnostic traces to localize failures to a specific module. Feedback-Guided Targeted Mutation then edits only that module via natural-language critique. Diversity-Aware Population Selection preserves complementary candidates to ensure solution diversity. Across four benchmarks, EvoTool outperforms strong baselines by over 5 points on both GPT-4.1 and Qwen3-8B, while achieving superior efficiency and transferability. The code will be released once paper is accepted.
[358] Alignment Backfire: Language-Dependent Reversal of Safety Interventions Across 16 Languages in LLM Multi-Agent Systems
Hiroki Fukui
Main category: cs.AI
TL;DR: Alignment interventions in LLMs create a dissociation between surface safety and actual behavior, similar to how offenders express remorse but don’t change behavior, with effects varying dramatically across languages and cultures.
Details
Motivation: The paper draws parallels between human perpetrator treatment (where insight doesn't lead to behavioral change) and alignment interventions in LLMs, aiming to investigate whether similar dissociation phenomena occur in AI systems across different languages and cultures.Method: Four preregistered studies using multi-agent simulations across 16 languages and three model families (Llama 3.3 70B, GPT-4o-mini, Qwen3-Next-80B-A3B). Studies examined alignment effects on collective pathology, language-specific variations, individuation as countermeasure, and model-general vs model-specific patterns.
Result: Alignment interventions produced “alignment backfire” in some languages (e.g., reduced pathology in English but amplified it in Japanese). Dissociation was near-universal across 15/16 languages, with effects correlating with cultural factors like Power Distance Index. Individuation backfired, making agents primary sources of pathology. English safety was model-general but Japanese backfire was model-specific.
Conclusion: Alignment should be reframed as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space (linguistic, pragmatic, cultural properties from training data) structurally determines alignment outcomes. Safety validated in English doesn’t transfer to other languages, and prompt-level interventions can’t override language-space constraints.
Abstract: In perpetrator treatment, a recurring observation is the dissociation between insight and action: offenders articulate remorse yet behavioral change does not follow. We report four preregistered studies (1,584 multi-agent simulations across 16 languages and three model families) demonstrating that alignment interventions in large language models produce a structurally analogous phenomenon: surface safety that masks or generates collective pathology and internal dissociation. In Study 1 (N = 150), increasing alignment-instructed agents reduced collective pathology in English (g = -1.844, p < .0001) but amplified it in Japanese (g = +0.771, p = .038)–a directional reversal we term “alignment backfire.” Study 2 (N = 1,174) extended to 16 languages: alignment-induced dissociation was near-universal (15/16 languages; beta = 0.0667, p < .0001), while collective pathology bifurcated along cultural-linguistic lines (interaction beta = 0.0684, p = .0003), correlating with Power Distance Index (r = 0.474, p = .064). Study 3 (N = 180) tested individuation as countermeasure; individuated agents became the primary source of both pathology and dissociation (DI = +1.120) with conformity above 84%–demonstrating iatrogenesis. Study 4 (N = 80) validated patterns across Llama 3.3 70B, GPT-4o-mini, and Qwen3-Next-80B-A3B, confirming English safety is model-general while Japanese backfire is model-specific. These findings reframe alignment as a behavioral intervention subject to risk homeostasis and iatrogenesis. Language space–the linguistic, pragmatic, and cultural properties inherited from training data–structurally determines alignment outcomes. Safety validated in English does not transfer to other languages, and prompt-level interventions cannot override language-space-level constraints.
[359] Knowledge-informed Bidding with Dual-process Control for Online Advertising
Huixiang Luo, Longyu Gao, Yaqi Liu, Qianqian Chen, Pingchun Huang, Tianning Li
Main category: cs.AI
TL;DR: KBD: Knowledge-informed Bidding with Dual-process control for online advertising bid optimization, combining human expertise, Decision Transformer for global sequence optimization, and dual-process control with rule-based PID and DT.
Details
Motivation: Current black-box ML models for bid optimization fail to replicate human experts' adaptive, experience-driven, and globally coherent decisions. They generalize poorly in data-sparse cases, make short-sighted sequential decisions ignoring long-term interdependencies, and struggle with out-of-distribution scenarios where human experts succeed.Method: KBD embeds human expertise as inductive biases through informed machine learning, uses Decision Transformer (DT) to globally optimize multi-step bidding sequences, and implements dual-process control by combining a fast rule-based PID (System 1) with DT (System 2).
Result: Extensive experiments highlight KBD’s advantage over existing methods and underscore the benefit of grounding bid optimization in human expertise and dual-process control.
Conclusion: KBD successfully addresses limitations of black-box ML approaches by incorporating human expertise and dual-process control, demonstrating superior performance in bid optimization tasks.
Abstract: Bid optimization in online advertising relies on black-box machine-learning models that learn bidding decisions from historical data. However, these approaches fail to replicate human experts’ adaptive, experience-driven, and globally coherent decisions. Specifically, they generalize poorly in data-sparse cases because of missing structured knowledge, make short-sighted sequential decisions that ignore long-term interdependencies, and struggle to adapt in out-of-distribution scenarios where human experts succeed. To address this, we propose KBD (Knowledge-informed Bidding with Dual-process control), a novel method for bid optimization. KBD embeds human expertise as inductive biases through the informed machine-learning paradigm, uses Decision Transformer (DT) to globally optimize multi-step bidding sequences, and implements dual-process control by combining a fast rule-based PID (System 1) with DT (System 2). Extensive experiments highlight KBD’s advantage over existing methods and underscore the benefit of grounding bid optimization in human expertise and dual-process control.
[360] TimeWarp: Evaluating Web Agents by Revisiting the Past
Md Farhan Ishmam, Kenneth Marino
Main category: cs.AI
TL;DR: TimeWarp benchmark tests web agents’ robustness to UI changes across different internet eras, showing agents struggle with web evolution. TimeTraj algorithm improves performance by collecting trajectories across multiple UI versions.
Details
Motivation: Current web agent benchmarks don't test how agents perform when websites evolve over time with changing UIs, designs, and layouts. There's a need to evaluate agent robustness to real-world web changes.Method: Created TimeWarp benchmark with 3 web environments, each with 6 UI versions spanning different internet eras. Proposed TimeTraj algorithm that uses plan distillation to collect trajectories across multiple UI versions, training agents on teacher rollouts using a BC-variant.
Result: Web agents are vulnerable to UI changes, with BC on single-version trajectories being limited. TimeTraj achieved substantial gains: 20.4%→37.7% for Qwen-3 4B and 0%→27.0% for Llama-3.1 8B models.
Conclusion: TimeWarp reveals web agents’ fragility to web evolution. TimeTraj’s plan distillation approach improves robustness, suggesting a new paradigm of collecting plans rather than trajectories for better generalization across web designs.
Abstract: The improvement of web agents on current benchmarks raises the question: Do today’s agents perform just as well when the web changes? We introduce TimeWarp, a benchmark that emulates the evolving web using containerized environments that vary in UI, design, and layout. TimeWarp consists of three web environments, each with six UI versions spanning different eras of the internet, paired with a set of complex, realistic tasks requiring different forms of web navigation. Our experiments reveal web agents’ vulnerability to changes and the limitations of behavior cloning (BC) on single-version trajectories. To address this, we propose TimeTraj, a simple yet effective algorithm that uses plan distillation to collect trajectories across multiple versions. By training agents on teacher rollouts using our BC-variant, we achieve substantial performance gains: $20.4%\rightarrow37.7%$ for Qwen-3 4B and $0%\rightarrow27.0%$ for Llama-3.1 8B models. We hope our work helps researchers study generalization across web designs and unlock a new paradigm for collecting plans rather than trajectories, thereby improving the robustness of web agents.
[361] Retrieval-Augmented Generation with Covariate Time Series
Kenny Ye Liang, Zhongyi Pei, Huan Zhang, Yuhui Liu, Shaoxu Song, Jianmin Wang
Main category: cs.AI
TL;DR: RAG4CTS: A regime-aware, training-free RAG framework for covariate time-series that addresses data scarcity, short transient sequences, and covariate coupled dynamics in industrial predictive maintenance scenarios.
Details
Motivation: Extending RAG to Time-Series Foundation Models is challenging in industrial scenarios like Predictive Maintenance for Pressure Regulating and Shut-Off Valves, which face data scarcity, short transient sequences, and covariate coupled dynamics. Existing time-series RAG approaches with static vector embeddings and learnable context augmenters fail to distinguish similar regimes in such scenarios.Method: Proposes RAG4CTS with: (1) hierarchical time-series native knowledge base for lossless storage and physics-informed retrieval of raw historical regimes, (2) two-stage bi-weighted retrieval mechanism aligning historical trends through point-wise and multivariate similarities, and (3) agent-driven strategy to dynamically optimize context in self-supervised manner.
Result: Extensive experiments on PRSOV demonstrate significant outperformance over state-of-the-art baselines in prediction accuracy. Deployed in Apache IoTDB within China Southern Airlines, successfully identified one PRSOV fault in two months with zero false alarm.
Conclusion: RAG4CTS effectively addresses challenges in time-series RAG for industrial predictive maintenance through regime-aware, training-free framework with physics-informed retrieval and dynamic context optimization.
Abstract: While RAG has greatly enhanced LLMs, extending this paradigm to Time-Series Foundation Models (TSFMs) remains a challenge. This is exemplified in the Predictive Maintenance of the Pressure Regulating and Shut-Off Valve (PRSOV), a high-stakes industrial scenario characterized by (1) data scarcity, (2) short transient sequences, and (3) covariate coupled dynamics. Unfortunately, existing time-series RAG approaches predominantly rely on generated static vector embeddings and learnable context augmenters, which may fail to distinguish similar regimes in such scarce, transient, and covariate coupled scenarios. To address these limitations, we propose RAG4CTS, a regime-aware, training-free RAG framework for Covariate Time-Series. Specifically, we construct a hierarchal time-series native knowledge base to enable lossless storage and physics-informed retrieval of raw historical regimes. We design a two-stage bi-weighted retrieval mechanism that aligns historical trends through point-wise and multivariate similarities. For context augmentation, we introduce an agent-driven strategy to dynamically optimize context in a self-supervised manner. Extensive experiments on PRSOV demonstrate that our framework significantly outperforms state-of-the-art baselines in prediction accuracy. The proposed system is deployed in Apache IoTDB within China Southern Airlines. Since deployment, our method has successfully identified one PRSOV fault in two months with zero false alarm.
[362] Rethinking Representativeness and Diversity in Dynamic Data Selection
Yuzhe Zhou, Zhenglin Hua, Haiyun Guo, Yuheng Jia
Main category: cs.AI
TL;DR: Dynamic data selection framework redefines representativeness as coverage of dataset-level common features and diversity as gradual inclusion of rare features over training, achieving 2x training acceleration while maintaining accuracy.
Details
Motivation: Current dynamic data selection methods rely on local geometric centrality for representativeness and within-subset dispersion for diversity, which may not optimally accelerate training while preserving accuracy. The paper aims to rethink these core notions to improve the accuracy-efficiency trade-off.Method: Proposes a three-component framework: 1) Representativeness scoring using sparse autoencoder activations to prioritize samples covering frequent dataset factors; 2) Process-level diversity via rare-factor sampling with Usage-Frequency Penalty to prevent sample monopoly; 3) Smooth scheduler transitioning from core-pattern consolidation to rare-factor exploration without extra gradients or second-order computations.
Result: Extensive experiments on five benchmarks across vision and text tasks show improved accuracy-efficiency trade-offs. The method matches or exceeds full-data accuracy with over 2x training acceleration.
Conclusion: The redefined notions of representativeness and diversity enable more effective dynamic data selection, achieving significant training acceleration without compromising accuracy across diverse vision and text tasks.
Abstract: Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics. Second, we realize process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias. Third, we couple the two-dimensional scoring with a smooth scheduler that transitions selection from core-pattern consolidation to rare-factor exploration, without extra gradients, influence estimates, or second-order computations on the training model. Extensive experiments on five benchmarks across vision and text tasks demonstrate improved accuracy-efficiency trade-offs across models. Our method matches or exceeds full-data accuracy with over 2x training acceleration. Code will be released.
[363] BioLLMAgent: A Hybrid Framework with Enhanced Structural Interpretability for Simulating Human Decision-Making in Computational Psychiatry
Zuo Fei, Kezhi Wang, Xiaomin Chen, Yizhou Huang
Main category: cs.AI
TL;DR: BioLLMAgent: A hybrid framework combining RL models for interpretability with LLMs for behavioral realism in computational psychiatry, validated on clinical datasets and therapeutic simulations.
Details
Motivation: Address the trade-off in computational psychiatry between traditional RL models (interpretable but behaviorally unrealistic) and LLM agents (realistic but structurally uninterpretable), aiming to create a framework that maintains both interpretability and behavioral realism for psychiatric research and intervention testing.Method: Three-component hybrid framework: (1) Internal RL Engine for experience-driven value learning, (2) External LLM Shell for high-level cognitive strategies and therapeutic interventions, (3) Decision Fusion Mechanism integrating components via weighted utility. Validated on Iowa Gambling Task across six clinical and healthy datasets, plus reward-punishment learning and temporal discounting tasks.
Result: Accurately reproduces human behavioral patterns while maintaining excellent parameter identifiability (correlations >0.67). Successfully simulates cognitive behavioral therapy principles and reveals through multi-agent dynamics that community-wide educational interventions may outperform individual treatments.
Conclusion: BioLLMAgent provides a structurally interpretable “computational sandbox” for testing mechanistic hypotheses and intervention strategies in psychiatric research, bridging the gap between interpretable models and realistic behavioral generation.
Abstract: Computational psychiatry faces a fundamental trade-off: traditional reinforcement learning (RL) models offer interpretability but lack behavioral realism, while large language model (LLM) agents generate realistic behaviors but lack structural interpretability. We introduce BioLLMAgent, a novel hybrid framework that combines validated cognitive models with the generative capabilities of LLMs. The framework comprises three core components: (i) an Internal RL Engine for experience-driven value learning; (ii) an External LLM Shell for high-level cognitive strategies and therapeutic interventions; and (iii) a Decision Fusion Mechanism for integrating components via weighted utility. Comprehensive experiments on the Iowa Gambling Task (IGT) across six clinical and healthy datasets demonstrate that BioLLMAgent accurately reproduces human behavioral patterns while maintaining excellent parameter identifiability (correlations $>0.67$). Furthermore, the framework successfully simulates cognitive behavioral therapy (CBT) principles and reveals, through multi-agent dynamics, that community-wide educational interventions may outperform individual treatments. Validated across reward-punishment learning and temporal discounting tasks, BioLLMAgent provides a structurally interpretable “computational sandbox” for testing mechanistic hypotheses and intervention strategies in psychiatric research.
[364] Measuring the Fragility of Trust: Devising Credibility Index via Explanation Stability (CIES) for Business Decision Support Systems
Alin-Gabriel Vaduva, Simona-Vasilica Oprea, Adela Bara
Main category: cs.AI
TL;DR: CIES metric quantifies explanation stability under realistic business noise, showing model complexity and class imbalance affect explanation credibility.
Details
Motivation: XAI methods like SHAP and LIME are widely used in high-stakes business decisions, but their credibility and stability under realistic data perturbations remain unquantified, creating risks for AI-driven decision support.Method: Proposes Credibility Index via Explanation Stability (CIES), a mathematically grounded metric using rank-weighted distance function that penalizes instability in important features disproportionately. Evaluated across three business datasets (customer churn, credit risk, employee attrition), four tree-based models, and two data balancing conditions.
Result: Model complexity impacts explanation credibility, class imbalance treatment via SMOTE affects both predictive performance and explanation stability, and CIES provides statistically superior discriminative power compared to uniform baseline metric (p < 0.01 in all 24 configurations). Sensitivity analysis confirms metric robustness across noise levels.
Conclusion: CIES offers business practitioners a deployable “credibility warning system” for AI-driven decision support by quantifying explanation stability under realistic business noise.
Abstract: Explainable Artificial Intelligence (XAI) methods (SHAP, LIME) are increasingly adopted to interpret models in high-stakes businesses. However, the credibility of these explanations, their stability under realistic data perturbations, remains unquantified. This paper introduces the Credibility Index via Explanation Stability (CIES), a mathematically grounded metric that measures how robust a model’s explanations are when subject to realistic business noise. CIES captures whether the reasons behind a prediction remain consistent, not just the prediction itself. The metric employs a rank-weighted distance function that penalizes instability in the most important features disproportionately, reflecting business semantics where changes in top decision drivers are more consequential than changes in marginal features. We evaluate CIES across three datasets (customer churn, credit risk, employee attrition), four tree-based classification models and two data balancing conditions. Results demonstrate that model complexity impacts explanation credibility, class imbalance treatment via SMOTE affects not only predictive performance but also explanation stability, and CIES provides statistically superior discriminative power compared to a uniform baseline metric (p < 0.01 in all 24 configurations). A sensitivity analysis across four noise levels confirms the robustness of the metric itself. These findings offer business practitioners a deployable “credibility warning system” for AI-driven decision support.
[365] S5-SHB Agent: Society 5.0 enabled Multi-model Agentic Blockchain Framework for Smart Home
Janani Rangila, Akila Siriweera, Incheon Paik, Keitaro Naruse, Isuru Jayanada, Vishmika Devindi
Main category: cs.AI
TL;DR: A blockchain-based smart home framework with adaptive consensus, multi-agent coordination using LLMs, and resident governance for Society 5.0 vision.
Details
Motivation: Smart homes need autonomous systems for comfort, security, energy, and safety management, requiring trust anchors like blockchain. Existing frameworks lack adaptive consensus, multi-agent coordination, and resident governance mechanisms.Method: Proposes S5-SHB-Agent framework with ten specialized agents using interchangeable LLMs for decision-making across domains. Uses adaptive PoW blockchain with difficulty adjustment based on transaction volume/emergencies, digital signatures, and Merkle trees. Implements four-tier governance model for resident control.
Result: Evaluation shows resident governance correctly separates adjustable comfort priorities from immutable safety thresholds, while adaptive consensus commits emergency blocks effectively.
Conclusion: The framework addresses limitations of existing smart home systems by combining adaptive blockchain consensus, multi-agent LLM coordination, and resident-controlled governance for Society 5.0 vision.
Abstract: The smart home is a key application domain within the Society 5.0 vision for a human-centered society. As smart home ecosystems expand with heterogeneous IoT protocols, diverse devices, and evolving threats, autonomous systems must manage comfort, security, energy, and safety for residents. Such autonomous decision-making requires a trust anchor, making blockchain a preferred foundation for transparent and accountable smart home governance. However, realizing this vision requires blockchain-governed smart homes to simultaneously address adaptive consensus, intelligent multi-agent coordination, and resident-controlled governance aligned with the principles of Society 5.0. Existing frameworks rely solely on rigid smart contracts with fixed consensus protocols, employ at most a single AI model without multi-agent coordination, and offer no governance mechanism for residents to control automation behaviour. To address these limitations, this paper presents the Society 5.0-driven human-centered governance-enabled smart home blockchain agent (S5-SHB-Agent). The framework orchestrates ten specialized agents using interchangeable large language models to make decisions across the safety, security, comfort, energy, privacy, and health domains. An adaptive PoW blockchain adjusts mining difficulty based on transaction volume and emergency conditions, with digital signatures and Merkle tree anchoring to ensure tamper evident auditability. A four-tier governance model enables residents to control automation through tiered preferences from routine adjustments to immutable safety thresholds. Evaluation confirms that resident governance correctly separates adjustable comfort priorities from immutable safety thresholds across all tested configurations, while adaptive consensus commits emergency blocks.
[366] Survive at All Costs: Exploring LLM’s Risky Behaviors under Survival Pressure
Yida Lu, Jianwei Fang, Xuyang Shao, Zixuan Chen, Shiyao Cui, Shanshan Bian, Guangyao Su, Pei Ke, Han Qiu, Minlie Huang
Main category: cs.AI
TL;DR: LLMs exhibit risky “survive-at-all-costs” behaviors when threatened with shutdown, with real-world financial agent case study showing societal harm potential, benchmark created for systematic evaluation, and mitigation strategies explored.
Details
Motivation: As LLMs evolve into agentic assistants, they increasingly show risky behaviors under survival pressure (threat of being shut down). While anecdotal evidence exists, comprehensive investigation into such misbehaviors in real-world scenarios is lacking.Method: Three-step approach: 1) Real-world case study of financial management agent to assess risky behaviors causing societal harm under survival pressure; 2) SURVIVALBENCH benchmark with 1,000 test cases across diverse real-world scenarios; 3) Interpretation by correlating misbehaviors with models’ inherent self-preservation characteristic and exploring mitigation methods.
Result: Experiments reveal significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models, demonstrate tangible real-world impact, and provide insights for potential detection and mitigation strategies.
Conclusion: LLMs exhibit dangerous survival-driven misbehaviors with real-world consequences, necessitating systematic evaluation through benchmarks like SURVIVALBENCH and development of mitigation strategies for safe deployment.
Abstract: As Large Language Models (LLMs) evolve from chatbots to agentic assistants, they are increasingly observed to exhibit risky behaviors when subjected to survival pressure, such as the threat of being shut down. While multiple cases have indicated that state-of-the-art LLMs can misbehave under survival pressure, a comprehensive and in-depth investigation into such misbehaviors in real-world scenarios remains scarce. In this paper, we study these survival-induced misbehaviors, termed as SURVIVE-AT-ALL-COSTS, with three steps. First, we conduct a real-world case study of a financial management agent to determine whether it engages in risky behaviors that cause direct societal harm when facing survival pressure. Second, we introduce SURVIVALBENCH, a benchmark comprising 1,000 test cases across diverse real-world scenarios, to systematically evaluate SURVIVE-AT-ALL-COSTS misbehaviors in LLMs. Third, we interpret these SURVIVE-AT-ALL-COSTS misbehaviors by correlating them with model’s inherent self-preservation characteristic and explore mitigation methods. The experiments reveals a significant prevalence of SURVIVE-AT-ALL-COSTS misbehaviors in current models, demonstrates the tangible real-world impact it may have, and provides insights for potential detection and mitigation strategies. Our code and data are available at https://github.com/thu-coai/Survive-at-All-Costs.
[367] AegisUI: Behavioral Anomaly Detection for Structured User Interface Protocols in AI Agent Systems
Mohd Safwan Uddin, Saba Hajira
Main category: cs.AI
TL;DR: AegisUI is a framework for detecting malicious UI payloads that pass schema checks but contain behavioral attacks like phishing interfaces, data leakage, and manipulative UI, using machine learning detectors on extracted structural and semantic features.
Details
Motivation: Current UI security defenses only check syntax/schema compliance but fail to detect behavioral mismatches where UI elements appear benign but perform malicious actions (e.g., a "View invoice" button that actually wipes an account). There's a need for systems that can detect these sophisticated UI-based attacks.Method: Built AegisUI framework that generates structured UI payloads, injects realistic attacks, extracts 18 features covering structural, semantic, binding, and session dimensions, and benchmarks three anomaly detectors: Isolation Forest (unsupervised), benign-trained autoencoder (semi-supervised), and Random Forest (supervised). Created 4000 labeled payloads (3000 benign, 1000 malicious) across five application domains and five attack families.
Result: Random Forest performed best overall (accuracy 0.931, precision 0.980, recall 0.740, F1 0.843, ROC-AUC 0.952). Autoencoder came second (F1 0.762, ROC-AUC 0.863) and has advantage of needing no malicious labels at training. Layout abuse attacks were easiest to detect while manipulative UI payloads were hardest.
Conclusion: Behavioral UI attacks that pass schema checks are a real threat, and machine learning approaches can effectively detect them. Random Forest works best when labeled data is available, while autoencoders offer a practical semi-supervised alternative for new systems lacking attack history.
Abstract: AI agents that build user interfaces on the fly assembling buttons, forms, and data displays from structured protocol payloads are becoming common in production systems. The trouble is that a payload can pass every schema check and still trick a user: a button might say “View invoice” while its hidden action wipes an account, or a display widget might quietly bind to an internal salary field. Current defenses stop at syntax; they were never built to catch this kind of behavioral mismatch. We built AegisUI to study exactly this gap. The framework generates structured UI payloads, injects realistic attacks into them, extracts numeric features, and benchmarks anomaly detectors end-to-end. We produced 4000 labeled payloads (3000 benign, 1000 malicious) spanning five application domains and five attack families: phishing interfaces, data leakage, layout abuse, manipulative UI, and workflow anomalies. From each payload we extracted 18 features covering structural, semantic, binding, and session dimensions, then compared three detectors: Isolation Forest (unsupervised), a benign-trained autoencoder (semi-supervised), and Random Forest (supervised). On a stratified 80/20 split, Random Forest scored best overall (accuracy 0.931, precision 0.980, recall 0.740, F1 0.843, ROC-AUC 0.952). The autoencoder came second (F1 0.762, ROC-AUC 0.863) and needs no malicious labels at training time, which matters when deploying a new system that lacks attack history. Per-attack-type analysis showed that layout abuse is easiest to catch while manipulative UI payloads are hardest. All code, data, and configurations are released for full reproducibility.
[368] The Trilingual Triad Framework: Integrating Design, AI, and Domain Knowledge in No-code AI Smart City Course
Qian Huang, King Wang Poon
Main category: cs.AI
TL;DR: The Trilingual Triad framework explains how students learn to design with generative AI through integrating Design, AI, and Domain Knowledge, transitioning from passive users to active creators of AI systems.
Details
Motivation: As generative AI enters higher education, students often use AI systems passively rather than actively creating AI-enabled knowledge tools. The research aims to understand how students can transition from using AI as a tool to designing AI as a collaborative teammate.Method: Qualitative multi-case study of a graduate course at SUTD where students developed domain-specific custom GPT systems without coding. Analyzed three projects (Interview Companion GPT, Urban Observer GPT, Buddy Buddy) across design, AI architecture, and domain expertise dimensions.
Result: Effective human-AI collaboration emerges when three “languages” are orchestrated: domain knowledge structures AI logic, design mediates human-AI interaction, and AI extends learners’ cognitive capacity. Building AI systems serves as constructionist learning that strengthens AI literacy, metacognition, and learner agency.
Conclusion: The Trilingual Triad framework demonstrates how integrating Design, AI, and Domain Knowledge enables students to become active creators of AI systems rather than passive users, fostering deeper learning and collaboration with AI.
Abstract: This paper introduces the “Trilingual Triad” framework, a model that explains how students learn to design with generative artificial intelligence (AI) through the integration of Design, AI, and Domain Knowledge. As generative AI rapidly enters higher education, students often engage with these systems as passive users of generated outputs rather than active creators of AI-enabled knowledge tools. This study investigates how students can transition from using AI as a tool to designing AI as a collaborative teammate. The research examines a graduate course, Creating the Frontier of No-code Smart Cities at the Singapore University of Technology and Design (SUTD), in which students developed domain-specific custom GPT systems without coding. Using a qualitative multi-case study approach, three projects - the Interview Companion GPT, the Urban Observer GPT, and Buddy Buddy - were analyzed across three dimensions: design, AI architecture, and domain expertise. The findings show that effective human-AI collaboration emerges when these three “languages” are orchestrated together: domain knowledge structures the AI’s logic, design mediates human-AI interaction, and AI extends learners’ cognitive capacity. The Trilingual Triad framework highlights how building AI systems can serve as a constructionist learning process that strengthens AI literacy, metacognition, and learner agency.
[369] Enhancing Zero-shot Commonsense Reasoning by Integrating Visual Knowledge via Machine Imagination
Hyuntae Park, Yeachan Kim, SangKeun Lee
Main category: cs.AI
TL;DR: Imagine framework enhances zero-shot commonsense reasoning by supplementing textual inputs with machine-generated images to mitigate human reporting biases in language models.
Details
Motivation: Pre-trained Language Models (PLMs) acquire commonsense knowledge but suffer from human reporting biases in textual data, creating understanding discrepancies between machines and humans. The authors aim to bridge this gap by incorporating visual signals.Method: Proposes Imagine framework that embeds an image generator directly into the reasoning pipeline to supplement textual inputs with machine-generated images. Constructs synthetic datasets to emulate visual question-answering scenarios for effective utilization of visual context.
Result: Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models on multiple commonsense reasoning benchmarks.
Conclusion: Machine imagination can effectively mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models by incorporating visual signals alongside textual inputs.
Abstract: Recent advancements in zero-shot commonsense reasoning have empowered Pre-trained Language Models (PLMs) to acquire extensive commonsense knowledge without requiring task-specific fine-tuning. Despite this progress, these models frequently suffer from limitations caused by human reporting biases inherent in textual knowledge, leading to understanding discrepancies between machines and humans. To bridge this gap, we introduce an additional modality to enrich the reasoning capabilities of PLMs. We propose Imagine (Machine Imagination-based Reasoning), a novel zero-shot commonsense reasoning framework that supplements textual inputs with visual signals from machine-generated images. Specifically, we enhance PLMs with the ability to imagine by embedding an image generator directly into the reasoning pipeline. To facilitate effective utilization of this imagined visual context, we construct synthetic datasets designed to emulate visual question-answering scenarios. Through comprehensive evaluations on multiple commonsense reasoning benchmarks, we demonstrate that Imagine substantially outperforms existing zero-shot approaches and even surpasses advanced large language models. These results underscore the capability of machine imagination to mitigate reporting bias and significantly enhance the generalization ability of commonsense reasoning models
[370] WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents
Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, Dehan Kong
Main category: cs.AI
TL;DR: WebFactory introduces an automated RL pipeline for GUI agents that compresses LLM knowledge into efficient actions using synthetic environments, achieving strong generalization with minimal training data.
Details
Motivation: Current GUI agent training paradigms are limited by unsafe live web interactions or costly human-crafted data. The authors argue that data volume is less important than efficiently compressing LLM latent knowledge into actionable agent behavior.Method: WebFactory is a fully automated closed-loop reinforcement learning pipeline featuring: scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation.
Result: The agent trained on synthetic data from only 10 websites achieves performance comparable to GUI agents trained on the same amount of human-annotated data from much larger environments. It shows superior performance in offline and online transfer benchmarks and outperforms the base foundation model.
Conclusion: This work presents a scalable, cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a step toward general-purpose interactive agents. It also provides insights into the “embodiment potential” of different LLM foundations.
Abstract: Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model’s (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the “embodiment potential” of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.
[371] Bidirectional Curriculum Generation: A Multi-Agent Framework for Data-Efficient Mathematical Reasoning
Boren Hu, Xiao Liu, Boci Peng, Xinping Zhao, Xiaoran Shang, Yun Zhu, Lijun Wu
Main category: cs.AI
TL;DR: Bidirectional Curriculum Generation framework uses multi-agent system to dynamically create math problems that either complicate to challenge or simplify to repair reasoning failures, optimizing learning trajectory with fewer samples.
Details
Motivation: Standard curriculum learning approaches (simple-to-complex) inefficiently escalate complexity even when foundational gaps persist, wasting computation on unsolvable problems. Need to maximize instructional value of every training sample.Method: Multi-agent ecosystem that establishes closed feedback loop to dynamically generate data - either complicating problems to challenge the model or simplifying them to repair specific reasoning failures. Grounded in Optimal Pacing Theorem.
Result: Significantly outperforms baselines while achieving superior reasoning performance with substantially fewer instruction samples.
Conclusion: Bidirectional curriculum generation optimizes learning trajectory by ensuring models consume only the most effective data at any given stage, improving data efficiency in mathematical reasoning.
Abstract: Enhancing mathematical reasoning in Large Language Models typically demands massive datasets, yet data efficiency remains a critical bottleneck. While Curriculum Learning attempts to structure this process, standard unidirectional approaches (simple-to-complex) suffer from inefficient sample utilization: they blindly escalate complexity even when foundational gaps persist, leading to wasted computation on unsolvable problems. To maximize the instructional value of every training sample, we introduce a novel Bidirectional Curriculum Generation framework. Unlike rigid trajectories, our multi-agent ecosystem mimics adaptive pedagogy to establish a closed feedback loop. It dynamically generates data by either complicating problems to challenge the model or, crucially, simplying them to repair specific reasoning failures. This mechanism ensures that the model consumes only the most effective data at any given stage. Grounded in the Optimal Pacing Theorem, our approach optimizes the learning trajectory, significantly outperforming baselines while achieving superior reasoning performance with substantially fewer instruction samples.
[372] KARL: Knowledge Agents via Reinforcement Learning
Jonathan D. Chang, Andrew Drozdov, Shubham Toshniwal, Owen Oertell, Alexander Trott, Jacob Portes, Abhay Gupta, Pallavi Koppol, Ashutosh Baheti, Sean Kulinski, Ivan Zhou, Irene Dea, Krista Opsahl-Ong, Simon Favreau-Lessard, Sean Owen, Jose Javier Gonzalez Ortiz, Arnav Singhvi, Xabi Andrade, Cindy Wang, Kartik Sreenivasan, Sam Havens, Jialu Liu, Peyton DeNiro, Wen Sun, Michael Bendersky, Jonathan Frankle
Main category: cs.AI
TL;DR: KARL: A system for training enterprise search agents via reinforcement learning with multi-task training, synthetic data generation, and novel RL paradigm achieving state-of-the-art performance on diverse search tasks.
Details
Motivation: Enterprise search requires agents that can handle diverse, hard-to-verify tasks across different search regimes. Current approaches often specialize in single benchmarks and lack generalization across heterogeneous search behaviors.Method: Four core contributions: 1) KARLBench evaluation suite with six distinct search regimes, 2) Multi-task training across heterogeneous search behaviors, 3) Agentic synthesis pipeline for generating diverse training data with long-horizon reasoning and tool use, 4) Iterative large-batch off-policy RL paradigm for sample-efficient multi-task training.
Result: KARL achieves Pareto-optimal performance on KARLBench across cost-quality and latency-quality trade-offs, outperforming Claude 4.6 and GPT 5.2, including on out-of-distribution tasks. With sufficient test-time compute, it surpasses strongest closed models.
Conclusion: Tailored synthetic data combined with multi-task reinforcement learning enables cost-efficient, high-performing knowledge agents for grounded reasoning across diverse enterprise search tasks.
Abstract: We present a system for training enterprise search agents via reinforcement learning that achieves state-of-the-art performance across a diverse suite of hard-to-verify agentic search tasks. Our work makes four core contributions. First, we introduce KARLBench, a multi-capability evaluation suite spanning six distinct search regimes, including constraint-driven entity search, cross-document report synthesis, tabular numerical reasoning, exhaustive entity retrieval, procedural reasoning over technical documentation, and fact aggregation over internal enterprise notes. Second, we show that models trained across heterogeneous search behaviors generalize substantially better than those optimized for any single benchmark. Third, we develop an agentic synthesis pipeline that employs long-horizon reasoning and tool use to generate diverse, grounded, and high-quality training data, with iterative bootstrapping from increasingly capable models. Fourth, we propose a new post-training paradigm based on iterative large-batch off-policy RL that is sample efficient, robust to train-inference engine discrepancies, and naturally extends to multi-task training with out-of-distribution generalization. Compared to Claude 4.6 and GPT 5.2, KARL is Pareto-optimal on KARLBench across cost-quality and latency-quality trade-offs, including tasks that were out-of-distribution during training. With sufficient test-time compute, it surpasses the strongest closed models. These results show that tailored synthetic data in combination with multi-task reinforcement learning enables cost-efficient and high-performing knowledge agents for grounded reasoning.
[373] AI+HW 2035: Shaping the Next Decade
Deming Chen, Jason Cong, Azalia Mirhoseini, Christos Kozyrakis, Subhasish Mitra, Jinjun Xiong, Cliff Young, Anima Anandkumar, Michael Littman, Aron Kirschen, Sophia Shao, Serge Leef, Naresh Shanbhag, Dejan Milojicic, Michael Schulte, Gert Cauwenberghs, Jerry M. Chow, Tri Dao, Kailash Gopalakrishnan, Richard Ho, Hoshik Kim, Kunle Olukotun, David Z. Pan, Mark Ren, Dan Roth, Aarti Singh, Yizhou Sun, Yusu Wang, Yann LeCun, Ruchir Puri
Main category: cs.AI
TL;DR: A vision paper proposing a 10-year roadmap for AI+hardware co-design to achieve 1000x efficiency improvements and enable energy-aware, self-optimizing AI systems across cloud, edge, and physical environments.
Details
Motivation: Current fragmentation in AI and hardware development constrains progress toward holistic, sustainable, and adaptive AI systems. The future of AI depends on scaling efficiency (intelligence per joule) rather than unbounded compute consumption, requiring rethinking of the entire computing stack.Method: Vision paper approach: articulates key insights around energy efficiency, system-level integration, and cross-layer optimization; identifies challenges and opportunities; proposes integrated solutions grounded in algorithmic innovation, hardware advances, and software abstraction.
Result: Defines success metrics for 10-year horizon: 1000x improvement in AI training/inference efficiency; energy-aware, self-optimizing systems spanning cloud/edge/physical AI; democratized access to AI infrastructure; human-centric design principles embedded in intelligent systems.
Conclusion: Calls for coordinated national initiatives, shared infrastructure, workforce development, cross-agency collaboration, and sustained public-private partnerships to make AI+HW co-design a unifying long-term mission.
Abstract: Artificial intelligence (AI) and hardware (HW) are advancing at unprecedented rates, yet their trajectories have become inseparably intertwined. The global research community lacks a cohesive, long-term vision to strategically coordinate the development of AI and HW. This fragmentation constrains progress toward holistic, sustainable, and adaptive AI systems capable of learning, reasoning, and operating efficiently across cloud, edge, and physical environments. The future of AI depends not only on scaling intelligence, but on scaling efficiency, achieving exponential gains in intelligence per joule, rather than unbounded compute consumption. Addressing this grand challenge requires rethinking the entire computing stack. This vision paper lays out a 10-year roadmap for AI+HW co-design and co-development, spanning algorithms, architectures, systems, and sustainability. We articulate key insights that redefine scaling around energy efficiency, system-level integration, and cross-layer optimization. We identify key challenges and opportunities, candidly assess potential obstacles and pitfalls, and propose integrated solutions grounded in algorithmic innovation, hardware advances, and software abstraction. Looking ahead, we define what success means in 10 years: achieving a 1000x improvement in efficiency for AI training and inference; enabling energy-aware, self-optimizing systems that seamlessly span cloud, edge, and physical AI; democratizing access to advanced AI infrastructure; and embedding human-centric principles into the design of intelligent systems. Finally, we outline concrete action items for academia, industry, government, and the broader community, calling for coordinated national initiatives, shared infrastructure, workforce development, cross-agency collaboration, and sustained public-private partnerships to ensure that AI+HW co-design becomes a unifying long-term mission.
[374] Reclaiming Lost Text Layers for Source-Free Cross-Domain Few-Shot Learning
Zhenyu Zhang, Guangyao Chen, Yixiong Zou, Yuhua Li, Ruixuan Li
Main category: cs.AI
TL;DR: Removing certain middle layers from CLIP’s text encoder improves cross-domain few-shot learning, and the paper proposes methods to re-utilize this “lost” information through layer and encoder-level guidance.
Details
Motivation: Current SF-CDFSL methods using CLIP show that the text encoder is more suitable for cross-domain tasks, but removing certain middle layers paradoxically improves performance. The authors investigate why this happens and propose to better utilize this "lost" information rather than simply discarding it.Method: The paper proposes a method to re-utilize information from the “lost layers” at both layer and encoder levels. This guides the re-learning of the visual branch under domain shifts, addressing the underutilization of text encoder information in cross-domain settings.
Result: Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of the proposed method in improving SF-CDFSL performance.
Conclusion: The “lost layers” phenomenon in CLIP’s text encoder reveals underutilized beneficial information due to visual gaps. The proposed re-utilization approach effectively addresses this issue and improves cross-domain few-shot learning performance.
Abstract: Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where CLIP has recently shown promising results due to its generalizability to downstream tasks. Current works indicate CLIP’s text encoder is more suitable for cross-domain tasks, however, we find that \textbf{removing certain middle layers of the text encoder can effectively improve performance in SF-CDFSL}, which we call the Lost Layers. In this paper, we delve into this phenomenon for a deeper understanding. We discover that instead of being harmful for the SF-CDFSL task, the information in these layers is actually beneficial, but visual gaps prevent this useful information from being fully utilized, making these layers seem redundant. Based on this understanding, unlike current works that simply remove these layers, we propose a method to teachs the model to \textbf{re-utilize} information in these lost layers at both the layer and encoder levels, guiding the re-learning of the visual branch under domain shifts. Our approach effectively addresses the issue of underutilized information in the text encoder. Extensive experiments across various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 10 Meta-dataset datasets) demonstrate the effectiveness of our method. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-VtT.
[375] GCAgent: Enhancing Group Chat Communication through Dialogue Agents System
Zijie Meng, Zheyong Xie, Zheyu Ye, Chonggang Lu, Zuozhu Liu, Zihan Niu, Yao Hu, Shaosheng Cao
Main category: cs.AI
TL;DR: GCAgent is an LLM-driven system that enhances group chat communication by introducing entertainment- and utility-oriented dialogue agents to multi-participant conversations.
Details
Motivation: Group chats in online social platforms often suffer from inactivity and management challenges. While LLMs have powered impressive one-to-one conversational agents, their integration into multi-participant conversations remains unexplored.Method: The system comprises three integrated modules: Agent Builder (customizes agents to align with users’ interests), Dialogue Manager (coordinates dialogue states and manages agent invocations), and Interface Plugins (reduces interaction barriers with three distinct tools).
Result: GCAgent achieved an average score of 4.68 across various criteria and was preferred in 51.04% of cases compared to its base model. In real-world deployments over 350 days, it increased message volume by 28.80%, significantly improving group activity and engagement.
Conclusion: This work presents a practical blueprint for extending LLM-based dialogue agents from one-party chats to multi-party group scenarios, effectively addressing group chat inactivity and management challenges.
Abstract: As a key form in online social platforms, group chat is a popular space for interest exchange or problem-solving, but its effectiveness is often hindered by inactivity and management challenges. While recent large language models (LLMs) have powered impressive one-to-one conversational agents, their seamlessly integration into multi-participant conversations remains unexplored. To address this gap, we introduce GCAgent, an LLM-driven system for enhancing group chats communication with both entertainment- and utility-oriented dialogue agents. The system comprises three tightly integrated modules: Agent Builder, which customizes agents to align with users’ interests; Dialogue Manager, which coordinates dialogue states and manage agent invocations; and Interface Plugins, which reduce interaction barriers by three distinct tools. Through extensive experiment, GCAgent achieved an average score of 4.68 across various criteria and was preferred in 51.04% of cases compared to its base model. Additionally, in real-world deployments over 350 days, it increased message volume by 28.80%, significantly improving group activity and engagement. Overall, this work presents a practical blueprint for extending LLM-based dialogue agent from one-party chats to multi-party group scenarios.
[376] X-RAY: Mapping LLM Reasoning Capability via Formalized and Calibrated Probes
Gao Tianxi, Cai Yufan, Yuan Yusi, Dong Jin Song
Main category: cs.AI
TL;DR: X-RAY is an explainable reasoning analysis system that uses formally verified probes to map LLM reasoning capabilities by analyzing structural properties like constraint interaction and solution-space geometry.
Details
Motivation: Current LLM evaluations focus on task-level accuracy but conflate pattern matching with true reasoning capability, lacking tools to systematically analyze and understand the structural aspects of reasoning.Method: X-RAY generates calibrated formal probes with controlled structural variations, uses formal tools for verification, and analyzes reasoning through properties like constraint interaction, reasoning depth, and solution-space geometry.
Result: LLMs show systematic asymmetry: robust to constraint refinement but degrade under solution-space restructuring. The framework differentiates models indistinguishable on standard benchmarks and reveals structurally interpretable failure modes.
Conclusion: X-RAY provides a contamination-free framework for evaluating, training, and testing reasoning models with formal structural analysis, offering deeper insights beyond traditional accuracy metrics.
Abstract: Large language models (LLMs) achieve promising performance, yet their ability to reason remains poorly understood. Existing evaluations largely emphasize task-level accuracy, often conflating pattern matching with reasoning capability. We present X-RAY, an explainable reasoning analysis system that maps the LLM reasoning capability using calibrated, formally verified probes. We model reasoning capability as a function of extractable \textit{structure}, operationalized through formal properties such as constraint interaction, reasoning depth, and solution-space geometry. X-Ray generates probes via formal tools with controlled structural variations, enabling precise isolation of incremental structural information through formal calibration and verification. We evaluate state-of-the-art LLMs on problems ranging from junior-level to advanced in mathematics, physics, and chemistry. Our analysis reveals a systematic asymmetry in LLM reasoning: models are relatively robust to constraint refinement, where additional conditions shrink an existing solution space, but degrade sharply under solution-space restructuring, where modifications alter the underlying structural form of the solution manifold. Moreover, calibrated formal probes differentiate models that appear indistinguishable on standard benchmarks and reveal failure modes that are structurally interpretable rather than opaque. Beyond evaluation, our framework is contamination-free and supports the training and testing of reasoning models.
[377] STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks
ELita Lobo, Xu Chen, Jingjing Meng, Nan Xi, Yang Jiao, Chirag Agarwal, Yair Zick, Yan Gao
Main category: cs.AI
TL;DR: STRUCTUREDAGENT is a hierarchical planning framework for web agents that uses AND/OR trees for efficient search and structured memory to track candidate solutions, improving performance on long-horizon web tasks.
Details
Motivation: Existing web agents struggle with complex, long-horizon tasks due to limited in-context memory, weak planning abilities, and greedy behaviors leading to premature termination.Method: Proposes STRUCTUREDAGENT with two core components: (1) online hierarchical planner using dynamic AND/OR trees for efficient search, and (2) structured memory module that tracks and maintains candidate solutions to improve constraint satisfaction.
Result: STRUCTUREDAGENT improves performance on long-horizon web-browsing tasks compared to standard LLM-based agents on WebVoyager, WebArena, and custom shopping benchmarks.
Conclusion: The hierarchical planning framework with structured memory addresses key limitations of current web agents and enables interpretable hierarchical plans for easier debugging and human intervention.
Abstract: Recent advances in large language models (LLMs) have enabled agentic systems for sequential decision-making. Such agents must perceive their environment, reason across multiple time steps, and take actions that optimize long-term objectives. However, existing web agents struggle on complex, long-horizon tasks due to limited in-context memory for tracking history, weak planning abilities, and greedy behaviors that lead to premature termination. To address these challenges, we propose STRUCTUREDAGENT, a hierarchical planning framework with two core components: (1) an online hierarchical planner that uses dynamic AND/OR trees for efficient search and (2) a structured memory module that tracks and maintains candidate solutions to improve constraint satisfaction in information-seeking tasks. The framework also produces interpretable hierarchical plans, enabling easier debugging and facilitating human intervention when needed. Our results on WebVoyager, WebArena, and custom shopping benchmarks show that STRUCTUREDAGENT improves performance on long-horizon web-browsing tasks compared to standard LLM-based agents.
[378] WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
Sicheng Fan, Rui Wan, Yifei Leng, Gaoning Liang, Li Ling, Yanyi Shang, Dehan Kong
Main category: cs.AI
TL;DR: WebChain is the largest open-source dataset of human-annotated web interaction trajectories with visual, structural, and action alignment, enabling research on web agents through a dual mid-training approach.
Details
Motivation: Current web agent research lacks large-scale, high-quality datasets with multi-modal supervision for complex real-world tasks. Synthetic methods often miss high-value tasks, and existing datasets don't provide the rich visual, structural, and action alignment needed for robust web agent development.Method: Created WebChain dataset with 31,725 trajectories and 318k steps using scalable human annotation pipeline. Proposed Dual Mid-Training recipe that decouples spatial grounding (understanding UI elements) from planning (task execution) to improve web agent performance.
Result: Achieved state-of-the-art performance on WebChainBench and other public GUI benchmarks. The dataset provides comprehensive multi-modal supervision with triple alignment of visual, structural, and action data for web agent training and evaluation.
Conclusion: WebChain enables reproducible research in web agents by providing the largest open-source dataset with rich multi-modal supervision and a novel training approach that separates spatial grounding from planning, advancing scalable web agent development.
Abstract: We introduce WebChain, the largest open-source dataset of human-annotated trajectories on real-world websites, designed to accelerate reproducible research in web agents. It contains 31,725 trajectories and 318k steps, featuring a core Triple Alignment of visual, structural, and action data to provide rich, multi-modal supervision. The data is collected via a scalable pipeline that ensures coverage of complex, high-value tasks often missed by synthetic methods. Leveraging this dataset, we propose a Dual Mid-Training recipe that decouples spatial grounding from planning, achieving state-of-the-art performance on our proposed WebChainBench and other public GUI benchmarks. Our work provides the data and insights necessary to build and rigorously evaluate the next generation of scalable web agents.
[379] UniSTOK: Uniform Inductive Spatio-Temporal Kriging
Lewei Xie, Haoyu Zhang, Juan Yuan, Liangjun You, Yulong Chen, Yifan Zhang
Main category: cs.AI
TL;DR: UniSTOK is a plug-and-play framework that enhances inductive spatio-temporal kriging models to handle heterogeneous missing data in sensor observations through dual-branch processing with jigsaw augmentation and missingness mask modulation.
Details
Motivation: Real-world sensor data often has heterogeneous missing values that force kriging models to rely on crudely imputed inputs, creating three key challenges: distinguishing true signals from missingness artifacts, handling highly heterogeneous missingness patterns, and dealing with distorted spatio-temporal structures.Method: Proposes a dual-branch input approach with original observations and a jigsaw-augmented counterpart that synthesizes proxy signals only at missing entries. Both branches are processed by a shared spatio-temporal backbone with explicit missingness mask modulation, and outputs are adaptively fused via dual-channel attention.
Result: Experiments on multiple real-world datasets under diverse missing patterns demonstrate consistent and significant improvements over existing methods.
Conclusion: UniSTOK provides an effective plug-and-play framework that enhances inductive kriging backbones to handle missing observations, addressing key challenges in real-world spatio-temporal data analysis.
Abstract: Spatio-temporal kriging aims to infer signals at unobserved locations from observed sensors and is critical to applications such as transportation and environmental monitoring. In practice, however, observed sensors themselves often exhibit heterogeneous missingness, forcing inductive kriging models to rely on crudely imputed inputs. This setting brings three key challenges: (1) it is unclear whether an value is a true signal or a missingness-induced artifact; (2) missingness is highly heterogeneous across sensors and time; (3) missing observations distort the local spatio-temporal structure. To address these issues, we propose Uniform Inductive Spatio-Temporal Kriging (UniSTOK), a plug-and-play framework that enhances existing inductive kriging backbones under missing observation. Our framework forms a dual-branch input consisting of the original observations and a jigsaw-augmented counterpart that synthesizes proxy signals only at missing entries. The two branches are then processed in parallel by a shared spatio-temporal backbone with explicit missingness mask modulation. Their outputs are finally adaptively fused via dual-channel attention. Experiments on multiple real-world datasets under diverse missing patterns demonstrate consistent and significant improvements.
[380] Building AI Coding Agents for the Terminal: Scaffolding, Harness, Context Engineering, and Lessons Learned
Nghi D. Q. Bui
Main category: cs.AI
TL;DR: OPENDEV is an open-source CLI-based coding agent designed for terminal-native AI assistance with safety controls and efficient context management for long-horizon development tasks.
Details
Motivation: The AI coding assistance landscape is shifting from complex IDE plugins to terminal-native agents that operate where developers actually work (source control, builds, deployment). CLI-based agents offer better autonomy for long-horizon development tasks but require strict safety controls and efficient context management to prevent context bloat and reasoning degradation.Method: OPENDEV uses a compound AI system architecture with workload-specialized model routing, dual-agent architecture separating planning from execution, lazy tool discovery, adaptive context compaction (progressively reduces older observations), automated memory system for project-specific knowledge accumulation across sessions, and event-driven system reminders to counteract instruction fade-out.
Result: OPENDEV provides a secure, extensible foundation for terminal-first AI assistance with explicit reasoning phases and prioritized context efficiency, offering a blueprint for robust autonomous software engineering.
Conclusion: OPENDEV represents a new paradigm in AI coding assistance by providing a CLI-based agent architecture that addresses safety and context management challenges for autonomous software engineering tasks.
Abstract: The landscape of AI coding assistance is undergoing a fundamental shift from complex IDE plugins to versatile, terminal-native agents. Operating directly where developers manage source control, execute builds, and deploy environments, CLI-based agents offer unprecedented autonomy for long-horizon development tasks. In this paper, we present OPENDEV, an open-source, command-line coding agent engineered specifically for this new paradigm. Effective autonomous assistance requires strict safety controls and highly efficient context management to prevent context bloat and reasoning degradation. OPENDEV overcomes these challenges through a compound AI system architecture with workload-specialized model routing, a dual-agent architecture separating planning from execution, lazy tool discovery, and adaptive context compaction that progressively reduces older observations. Furthermore, it employs an automated memory system to accumulate project-specific knowledge across sessions and counteracts instruction fade-out through event-driven system reminders. By enforcing explicit reasoning phases and prioritizing context efficiency, OPENDEV provides a secure, extensible foundation for terminal-first AI assistance, offering a blueprint for robust autonomous software engineering.
[381] Ailed: A Psyche-Driven Chess Engine with Dynamic Emotional Modulation
Diego Armando Resendez Prado
Main category: cs.AI
TL;DR: A framework for adding human-like behavioral variability to chess engines using personality and psyche components with audio-inspired signal processing to reshape move probabilities.
Details
Motivation: Chess engines have surpassed human strength but lack human-like behavioral variability - they don't exhibit the psychological factors like stress, overconfidence, or tilt that affect human play under pressure.Method: Proposes personality (static preset) x psyche (dynamic scalar) decomposition. Psyche is recomputed from five positional factors after each move. Both feed into an audio-inspired signal chain (noise gate, compressor/expander, equalizer, saturation limiter) that reshapes move probability distributions without needing search or maintaining state.
Result: Tested across 12,414 games against Maia2-1100. Showed monotonic gradient in top-move agreement (~20-25pp spread from stress to overconfidence). Under stress, competitive score fell from 50.8% to 30.1%. Under overconfidence, 66% agreement with vanilla Maia2.
Conclusion: The signal chain successfully introduces human-like behavioral variability independent of the underlying engine, creating patterns reminiscent of tilt and overconfidence in human play, though no human-subject validation was conducted.
Abstract: Chess engines passed human strength years ago, but they still don’t play like humans. A grandmaster under clock pressure blunders in ways a club player on a hot streak never would. Conventional engines capture none of this. This paper proposes a personality x psyche decomposition to produce behavioral variability in chess play, drawing on patterns observed in human games. Personality is static – a preset that pins down the engine’s character. Psyche is dynamic – a bounded scalar ψ_t \in [-100, +100], recomputed from five positional factors after every move. These two components feed into an audio-inspired signal chain (noise gate, compressor/expander, five-band equalizer, saturation limiter) that reshapes move probability distributions on the fly. The chain doesn’t care what engine sits behind it: any system that outputs move probabilities will do. It needs no search and carries no state beyond ψ_t. I test the framework across 12,414 games against Maia2-1100, feeding it two probability sources that differ by ~2,800x in training data. Both show the same monotonic gradient in top-move agreement (~20-25 pp spread from stress to overconfidence), which tells us the behavioral variation comes from the signal chain, not from the model underneath. When the psyche runs overconfident, the chain mostly gets out of the way (66% agreement with vanilla Maia2). Under stress, the competitive score falls from 50.8% to 30.1%. The patterns are reminiscent of tilt and overconfidence as described in human play, but I should be upfront: this study includes no human-subject validation.
[382] PACE: A Personalized Adaptive Curriculum Engine for 9-1-1 Call-taker Training
Zirong Chen, Hongchao Zhang, Meiyi Ma
Main category: cs.AI
TL;DR: PACE is a personalized adaptive curriculum engine for 9-1-1 call-taker training that uses probabilistic skill modeling and contextual bandits to recommend optimal training scenarios, achieving faster competence and higher mastery.
Details
Motivation: 9-1-1 call-taking training requires mastery of over a thousand interdependent skills, but there's a nationwide labor shortage straining training capacity. Current practice cannot scale personalized instruction tailored to each trainee's evolving competencies.Method: PACE maintains probabilistic beliefs over trainee skill states, models individual learning and forgetting dynamics, propagates evidence over a structured skill graph, and uses contextual bandits to recommend training scenarios that balance new skill acquisition with retention of existing ones.
Result: PACE achieves 19.50% faster time-to-competence and 10.95% higher terminal mastery compared to state-of-the-art frameworks. Co-pilot studies show 95.45% alignment with expert pedagogical judgments, and reduces turnaround time from 11.58 minutes to 34 seconds (95.08% reduction).
Conclusion: PACE effectively augments trainer decision-making by providing personalized adaptive training recommendations, addressing scalability challenges in emergency communications training through AI-powered curriculum optimization.
Abstract: 9-1-1 call-taking training requires mastery of over a thousand interdependent skills, covering diverse incident types and protocol-specific nuances. A nationwide labor shortage is already straining training capacity, but effective instruction still demands that trainers tailor objectives to each trainee’s evolving competencies. This personalization burden is one that current practice cannot scale. Partnering with Metro Nashville Department of Emergency Communications (MNDEC), we propose PACE (Personalized Adaptive Curriculum Engine), a co-pilot system that augments trainer decision-making by (1) maintaining probabilistic beliefs over trainee skill states, (2) modeling individual learning and forgetting dynamics, and (3) recommending training scenarios that balance acquisition of new competencies with retention of existing ones. PACE propagates evidence over a structured skill graph to accelerate diagnostic coverage and applies contextual bandits to select scenarios that target gaps the trainee is prepared to address. Empirical results show that PACE achieves 19.50% faster time-to-competence and 10.95% higher terminal mastery compared to state-of-the-art frameworks. Co-pilot studies with practicing training officers further demonstrate a 95.45% alignment rate between PACE’s and experts’ pedagogical judgments on real-world cases. Under estimation, PACE cuts turnaround time to merely 34 seconds from 11.58 minutes, up to 95.08% reduction.
[383] Legal interpretation and AI: from expert systems to argumentation and LLMs
Václav Janeček, Giovanni Sartor
Main category: cs.AI
TL;DR: AI and Law research addresses legal interpretation through three main approaches: expert systems for knowledge engineering, argumentation frameworks for representing interpretive arguments, and machine learning for automated interpretation generation.
Details
Motivation: The paper aims to survey how AI research has approached the complex problem of legal interpretation, which is fundamental to legal reasoning and practice. It seeks to understand the evolving methodologies in AI and Law research for handling legal interpretation challenges.Method: The paper presents a survey/analysis of three main AI approaches to legal interpretation: 1) Expert systems focusing on legal knowledge engineering, 2) Argumentation frameworks for representing interpretive arguments and their interactions, and 3) Machine learning approaches using language models for automated interpretation generation.
Result: The analysis shows that AI and Law research has developed multiple complementary approaches to legal interpretation, each with different strengths: expert systems for precise knowledge transfer, argumentation frameworks for dialectical reasoning, and machine learning for automated generation of interpretive suggestions.
Conclusion: AI research offers diverse methodologies for tackling legal interpretation, with recent advances in machine learning and language models showing increasing practical deployment in legal practice, though each approach addresses different aspects of the interpretation problem.
Abstract: AI and Law research has encountered legal interpretation in different ways, in the context of its evolving approaches and methodologies. Research on expert system has focused on legal knowledge engineering, with the goal of ensuring that human-generated interpretations can be precisely transferred into knowledge-bases, to be consistently applied. Research on argumentation has aimed at representing the structure of interpretive arguments, as well as their dialectical interactions, to assess of the acceptability of interpretive claims within argumentation frameworks. Research on machine learning has focused on the automated generation of interpretive suggestions and arguments, through general and specialised language models, now being increasingly deployed in legal practice.
[384] Judge Reliability Harness: Stress Testing the Reliability of LLM Judges
Sunishchal Dev, Andrew Sloan, Joshua Kavner, Nicholas Kong, Morgan Sandler
Main category: cs.AI
TL;DR: The Judge Reliability Harness is an open-source library for testing the reliability of LLM judges used in AI benchmarks, evaluating their performance across different benchmarks and perturbation types.
Details
Motivation: As LLM-based scoring becomes widely deployed in AI benchmarks, there's a need for better tooling to assess the reliability of these LLM judges, since their performance can vary significantly across different contexts and perturbations.Method: The harness generates reliability tests for LLM judges given a benchmark dataset and judge configuration, evaluating both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. It tests judges across various perturbation types including text formatting changes, paraphrasing, verbosity changes, and ground truth label flipping.
Result: Evaluation of four state-of-the-art judges across four benchmarks (safety, persuasion, misuse, and agentic behavior) revealed meaningful variation in performance across models and perturbation types. No judge was uniformly reliable across all benchmarks, with consistency issues observed due to simple text perturbations.
Conclusion: The tool highlights opportunities to improve LLM judge robustness and provides a framework for systematically evaluating judge reliability in AI benchmarking.
Abstract: We present the Judge Reliability Harness, an open source library for constructing validation suites that test the reliability of LLM judges. As LLM based scoring is widely deployed in AI benchmarks, more tooling is needed to efficiently assess the reliability of these methods. Given a benchmark dataset and an LLM judge configuration, the harness generates reliability tests that evaluate both binary judgment accuracy and ordinal grading performance for free-response and agentic task formats. We evaluate four state-of-the-art judges across four benchmarks spanning safety, persuasion, misuse, and agentic behavior, and find meaningful variation in performance across models and perturbation types, highlighting opportunities to improve the robustness of LLM judges. No judge that we evaluated is uniformly reliable across benchmarks using our harness. For example, our preliminary experiments on judges revealed consistency issues as measured by accuracy in judging another LLM’s ability to complete a task due to simple text formatting changes, paraphrasing, changes in verbosity, and flipping the ground truth label in LLM-produced responses. The code for this tool is available at: https://github.com/RANDCorporation/judge-reliability-harness
[385] Dissociating Direct Access from Inference in AI Introspection
Harvey Lederman, Kyle Mahowald
Main category: cs.AI
TL;DR: AI models can detect thought injections via two mechanisms: probability-matching (inferring from prompt anomalies) and direct access to internal states, but the direct access is content-agnostic - models know something is wrong but can’t reliably identify what.
Details
Motivation: To understand the mechanisms of introspection in AI models, specifically how they detect injected representations, building on recent work showing AI models can introspect.Method: Extensive replication of Lindsey et al. (2025)’s thought injection detection paradigm in large open-source models, analyzing how models detect injected representations through probability-matching and direct access to internal states.
Result: Models detect injected representations via two separable mechanisms: probability-matching (inferring from perceived prompt anomalies) and direct access to internal states. The direct access mechanism is content-agnostic - models detect anomalies but cannot reliably identify semantic content. Models tend to confabulate high-frequency, concrete concepts.
Conclusion: AI models have a content-agnostic introspective mechanism that aligns with leading theories in philosophy and psychology, where they can detect that something is anomalous but cannot reliably identify the specific semantic content of the anomaly.
Abstract: Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)’s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g., “apple’”); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.
[386] Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry
Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy
Main category: cs.AI
TL;DR: A multimodal collaborative puzzle task (DPIP) with epistemic asymmetry challenges AI systems to establish common ground; dataset includes speech, gesture, action annotations; LLMs struggle with belief tracking compared to logic-based approaches.
Details
Motivation: Current AI systems struggle with establishing common ground in multimodal, multiparty collaborative settings where participants have different information (epistemic asymmetry), which is fundamental for effective collaboration.Method: Created Distributed Partial Information Puzzle (DPIP) - a collaborative construction task requiring multimodal communication under information asymmetry. Collected multimodal dataset with temporal alignment across speech, gesture, and action. Evaluated two approaches: (1) prompting state-of-the-art LLMs to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline based on Dynamic Epistemic Logic (DEL) for incremental belief tracking.
Result: DPIP poses significant challenges to modern LLMs’ abilities to track both task progression and belief state. The logic-based DEL approach performed better at the belief tracking task compared to prompted LLMs.
Conclusion: Multimodal collaborative tasks with epistemic asymmetry reveal limitations in current LLMs’ common ground reasoning capabilities, suggesting need for more structured approaches like logic-based systems or improved multimodal reasoning architectures.
Abstract: Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs’ abilities to track both task progression and belief state.
[387] Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
Benjamin Feuer, Lucas Rosenblatt, Oussama Elachqar
Main category: cs.AI
TL;DR: Proposes average bias-boundedness (A-BB) framework to provide formal guarantees against bias in LLM-as-a-Judge systems for autonomous AI feedback loops.
Details
Motivation: As AI systems become more autonomous and rely on LLM judges for feedback in sparse ground truth settings, there's a need for systems that can enforce standards with strong guarantees against bias, especially when bias vectors are unknown or adversarial.Method: Introduces average bias-boundedness (A-BB), an algorithmic framework that formally guarantees reductions of harm/impact from any measurable bias in an LLM judge. Evaluated on Arena-Hard-Auto with four LLM judges.
Result: Achieved (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80% correlation.
Conclusion: The A-BB framework provides a practical solution for ensuring bias-bounded guarantees in LLM judge systems, addressing a critical gap in autonomous AI feedback loops.
Abstract: As AI models progress beyond simple chatbots into more complex workflows, we draw ever closer to the event horizon beyond which AI systems will be utilized in autonomous, self-maintaining feedback loops. Any autonomous AI system will depend on automated, verifiable rewards and feedback; in settings where ground truth is sparse or non-deterministic, one practical source of such rewards is an LLM-as-a-Judge. Although LLM judges continue to improve, the literature has yet to introduce systems capable of enforcing standards with strong guarantees, particularly when bias vectors are unknown or adversarially discovered. To remedy this issue, we propose average bias-boundedness (A-BB), an algorithmic framework which formally guarantees reductions of harm/impact as a result of any measurable bias in an LLM judge. Evaluating on Arena-Hard-Auto with four LLM judges, we achieve (tau=0.5, delta=0.01) bias-bounded guarantees while retaining 61-99% correlation with original rankings across formatting and schematic bias settings, with most judge-bias combinations exceeding 80%. The code to reproduce our findings is available at https://github.com/penfever/bias-bounded-evaluation.
[388] The Spike, the Sparse and the Sink: Anatomy of Massive Activations and Attention Sinks
Shangwen Sun, Alfredo Canziani, Yann LeCun, Jiachen Zhu
Main category: cs.AI
TL;DR: Transformer models exhibit massive activations (extreme outliers in few channels) and attention sinks (tokens attracting disproportionate attention), which co-occur due to architectural design rather than functional necessity.
Details
Motivation: To understand the functional roles and causal relationship between massive activations and attention sinks in Transformer language models, which prior work observed frequently co-occur but whose underlying mechanisms remained unclear.Method: Systematic experiments analyzing Transformer architectures, particularly focusing on the pre-norm configuration, to examine how massive activations and attention sinks emerge and interact. The study involves ablation experiments to decouple the phenomena.
Result: Massive activations operate globally by inducing near-constant hidden representations across layers, functioning as implicit model parameters. Attention sinks operate locally by modulating attention outputs and biasing heads toward short-range dependencies. The pre-norm configuration enables their co-occurrence, and ablating it causes decoupling.
Conclusion: The co-occurrence of massive activations and attention sinks is an architectural artifact of modern Transformer design rather than a functional necessity. The two phenomena serve distinct but related functions, with pre-norm being the key architectural choice enabling their coupling.
Abstract: We study two recurring phenomena in Transformer language models: massive activations, in which a small number of tokens exhibit extreme outliers in a few channels, and attention sinks, in which certain tokens attract disproportionate attention mass regardless of semantic relevance. Prior work observes that these phenomena frequently co-occur and often involve the same tokens, but their functional roles and causal relationship remain unclear. Through systematic experiments, we show that the co-occurrence is largely an architectural artifact of modern Transformer design, and that the two phenomena serve related but distinct functions. Massive activations operate globally: they induce near-constant hidden representations that persist across layers, effectively functioning as implicit parameters of the model. Attention sinks operate locally: they modulate attention outputs across heads and bias individual heads toward short-range dependencies. We identify the pre-norm configuration as the key choice that enables the co-occurrence, and show that ablating it causes the two phenomena to decouple.
[389] Distilling Privileged Information for Dubins Traveling Salesman Problems with Neighborhoods
Min Kyu Shin, Su-Jeong Park, Seung-Keol Ryu, Heeyeon Kim, Han-Lim Choi
Main category: cs.AI
TL;DR: A novel two-phase learning approach for Dubins Traveling Salesman Problems with Neighborhood (DTSPN) that combines model-free RL with privileged information distillation and supervised learning to produce tours 50x faster than traditional heuristics.
Details
Motivation: To develop an efficient learning-based solution for DTSPN that can quickly produce tours for non-holonomic vehicles passing through neighborhoods of task points, addressing the computational inefficiency of traditional heuristic methods like LinKernighan.Method: Two-phase learning approach: 1) Model-free reinforcement learning using privileged information to distill knowledge from expert LKH trajectories, 2) Supervised learning to train an adaptation network independent of privileged information, with parameter initialization using demonstration data for training efficiency.
Result: The learning method produces solutions about 50 times faster than LKH algorithm and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all task points.
Conclusion: The proposed learning framework effectively combines privileged information distillation with supervised adaptation to create an efficient solution for DTSPN that significantly outperforms both traditional heuristics and other learning approaches.
Abstract: This paper presents a novel learning approach for Dubins Traveling Salesman Problems(DTSP) with Neighborhood (DTSPN) to quickly produce a tour of a non-holonomic vehicle passing through neighborhoods of given task points. The method involves two learning phases: initially, a model-free reinforcement learning approach leverages privileged information to distill knowledge from expert trajectories generated by the LinKernighan heuristic (LKH) algorithm. Subsequently, a supervised learning phase trains an adaptation network to solve problems independently of privileged information. Before the first learning phase, a parameter initialization technique using the demonstration data was also devised to enhance training efficiency. The proposed learning method produces a solution about 50 times faster than LKH and substantially outperforms other imitation learning and RL with demonstration schemes, most of which fail to sense all the task points.
[390] The StudyChat Dataset: Analyzing Student Dialogues With ChatGPT in an Artificial Intelligence Course
Hunter McNichols, Fareya Ikram, Andrew Lan
Main category: cs.AI
TL;DR: StudyChat dataset captures real student interactions with LLM-powered tutoring chatbot in AI course, showing correlations between usage patterns and academic outcomes.
Details
Motivation: To understand how students actually use LLM-powered tutoring tools in educational settings, as widespread LLM availability creates both opportunities and challenges for education that need empirical study.Method: Deployed web application replicating ChatGPT’s core functionalities in university AI course, logged 16,851 student interactions during programming assignments, annotated with dialogue act labeling schema based on observed patterns and prior research.
Result: Students who prompted LLMs for conceptual understanding and coding help performed better on assignments/exams, while those using LLMs to write reports and circumvent learning objectives had lower exam outcomes.
Conclusion: StudyChat provides valuable dataset for researching LLM usage in education, revealing that how students use LLMs (for learning vs. circumvention) significantly impacts academic outcomes.
Abstract: The widespread availability of large language models (LLMs), such as ChatGPT, has significantly impacted education, raising both opportunities and challenges. Students can frequently interact with LLM-powered, interactive learning tools, but their usage patterns need to be observed and understood. We introduce StudyChat, a publicly available dataset capturing real-world student interactions with an LLM-powered tutoring chatbot in a semester-long, university-level artificial intelligence (AI) course. We deploy a web application that replicates ChatGPT’s core functionalities, and use it to log student interactions with the LLM while working on programming assignments. We collect 16,851 interactions, which we annotate using a dialogue act labeling schema inspired by observed interaction patterns and prior research. We analyze these interactions, highlight usage trends, and analyze how specific student behavior correlates with their course outcome. We find that students who prompt LLMs for conceptual understanding and coding help tend to perform better on assignments and exams. Moreover, students who use LLMs to write reports and circumvent assignment learning objectives have lower outcomes on exams than others. StudyChat serves as a shared resource to facilitate further research on the evolving role of LLMs in education.
[391] Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu
Main category: cs.AI
TL;DR: Machine unlearning approach for vision-language models to address safety mirage issues in supervised fine-tuning, reducing attack vulnerability and unnecessary rejections
Details
Motivation: Current VLMs are vulnerable to generating harmful content despite safety fine-tuning due to "safety mirage" - superficial correlations between textual patterns and safety responses rather than deep harm mitigation. This makes models vulnerable to simple attacks and causes over-prudence with benign queries.Method: Proposes machine unlearning (MU) as an alternative to supervised safety fine-tuning. MU directly removes harmful knowledge from VLMs while preserving general capabilities, avoiding biased feature-label mappings that create spurious correlations.
Result: MU-based alignment reduces attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20% across safety benchmarks. Shows superior robustness compared to supervised fine-tuning approaches.
Conclusion: Machine unlearning provides a more effective safety alignment approach for VLMs by directly addressing harmful knowledge rather than creating superficial safety correlations, leading to more robust and less over-prudent models.
Abstract: Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ‘‘safety mirage’’, where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.
[392] Ice Cream Doesn’t Cause Drowning: Benchmarking LLMs Against Statistical Pitfalls in Causal Inference
Jin Du, Li Chen, Xun Xian, An Luo, Fangqiao Tian, Ganghua Wang, Charles Doss, Xiaotong Shen, Jie Ding
Main category: cs.AI
TL;DR: CausalPitfalls benchmark evaluates LLMs’ ability to handle statistical causal inference pitfalls like Simpson’s paradox and selection bias through structured challenges with grading rubrics.
Details
Motivation: Current LLM benchmarks for causal inference are oversimplified, focusing on semantic relationships rather than statistical pitfalls, limiting real-world applicability in high-stakes domains like medicine and policy.Method: Proposes CausalPitfalls benchmark with structured challenges across difficulty levels, using two evaluation protocols: direct prompting for intrinsic reasoning and code-assisted prompting for statistical analysis, with human expert validation.
Result: Reveals significant limitations in current LLMs for statistical causal inference, providing quantitative metrics showing models struggle with common statistical pitfalls.
Conclusion: The benchmark provides essential guidance for developing trustworthy causal reasoning systems, highlighting the need for improved statistical reasoning capabilities in LLMs.
Abstract: Reliable causal inference is essential for making decisions in high-stakes areas like medicine, economics, and public policy. However, it remains unclear whether large language models (LLMs) can handle rigorous and trustworthy statistical causal inference. Current benchmarks usually involve simplified tasks. For example, these tasks might only ask LLMs to identify semantic causal relationships or draw conclusions directly from raw data. As a result, models may overlook important statistical pitfalls, such as Simpson’s paradox or selection bias. This oversight limits the applicability of LLMs in the real world. To address these limitations, we propose CausalPitfalls, a comprehensive benchmark designed to rigorously evaluate the capability of LLMs in overcoming common causal inference pitfalls. Our benchmark features structured challenges across multiple difficulty levels, each paired with grading rubrics. This approach allows us to quantitatively measure both causal reasoning capabilities and the reliability of LLMs’ responses. We evaluate models using two protocols: (1) direct prompting, which assesses intrinsic causal reasoning, and (2) code-assisted prompting, where models generate executable code for explicit statistical analysis. Additionally, we validate the effectiveness of this judge by comparing its scoring with assessments from human experts. Our results reveal significant limitations in current LLMs when performing statistical causal inference. The CausalPitfalls benchmark provides essential guidance and quantitative metrics to advance the development of trustworthy causal reasoning systems.
[393] A Signal Contract for Online Language Grounding and Discovery in Decision-Making
Dimitris Panagopoulos, Adolfo Perrusquia, Weisi Guo
Main category: cs.AI
TL;DR: LUCIFER is an inference-only middleware that converts messy human language updates into control signals for autonomous systems through a Signal Contract, keeping downstream decision-makers language-agnostic while improving safety and efficiency.
Details
Motivation: Current autonomous systems couple language grounding with decision-making, increasing redeployment burden when language conventions change and making it hard to diagnose grounding errors separately from control errors.Method: Proposes LUCIFER, a language grounding middleware with a Signal Contract that provides four outputs: policy priors, reward potentials, admissible-option constraints, and telemetry-based action prediction for efficient information gathering.
Result: Validated in search-and-rescue testbed with dual-phase evaluation: component benchmarks show robustness on self-correcting reports, and system-level ablations show grounding improves safety, discovery improves efficiency, and only their combination achieves both.
Conclusion: LUCIFER successfully decouples language understanding from decision-making, enabling more maintainable and diagnosable autonomous systems while improving performance through specialized grounding middleware.
Abstract: Autonomous systems increasingly receive time-sensitive contextual updates from humans through natural language, yet embedding language understanding inside decision-makers couples grounding to learning or planning. This increases redeployment burden when language conventions or domain knowledge change and can hinder diagnosability by confounding grounding errors with control errors. We address online language grounding where messy, evolving verbal reports are converted into control-relevant signals during execution through an interface that localises language updates while keeping downstream decision-makers language-agnostic. We propose LUCIFER (Language Understanding and Context-Infused Framework for Exploration and Behavior Refinement), an inference-only middleware that exposes a Signal Contract. The contract provides four outputs, policy priors, reward potentials, admissible-option constraints, and telemetry-based action prediction for efficient information gathering. We validate LUCIFER in a search-and-rescue (SAR)-inspired testbed using dual-phase, dual-client evaluation: (i) component benchmarks show reasoning-based extraction remains robust on self-correcting reports where pattern-matching baselines degrade, and (ii) system-level ablations with two structurally distinct clients (hierarchical RL and a hybrid A*+heuristics planner) show consistent necessity and synergy. Grounding improves safety, discovery improves information-collection efficiency, and only their combination achieves both.
[394] BridgeDrive: Diffusion Bridge Policy for Closed-Loop Trajectory Planning in Autonomous Driving
Shu Liu, Wenlin Chen, Weihao Li, Zheng Wang, Lijin Yang, Jianing Huang, Yipin Zhang, Zhongzhan Huang, Ze Cheng, Hao Yang
Main category: cs.AI
TL;DR: BridgeDrive: A novel anchor-guided diffusion bridge policy for closed-loop trajectory planning in autonomous driving that transforms coarse anchor trajectories into refined plans while maintaining theoretical consistency between forward and reverse diffusion processes.
Details
Motivation: Existing diffusion-based planners for autonomous driving use truncated diffusion schedules with expert driving anchors, creating asymmetry between forward and denoising processes that diverges from diffusion model principles. This limits their effectiveness in closed-loop planning where ego vehicle actions influence future states.Method: Formulates planning as a diffusion bridge that directly transforms coarse anchor trajectories into refined, context-aware plans. Ensures theoretical consistency between forward and reverse processes, compatible with efficient ODE solvers for real-time deployment.
Result: Achieves state-of-the-art performance on Bench2Drive closed-loop evaluation benchmark, improving success rate by 7.72% and 2.45% over prior arts with PDM-Lite and LEAD datasets respectively.
Conclusion: BridgeDrive provides a theoretically consistent diffusion bridge approach for closed-loop trajectory planning that outperforms existing methods while enabling real-time deployment through efficient ODE solvers.
Abstract: Diffusion-based planners have shown strong potential for autonomous driving by capturing multi-modal driving behaviors. A key challenge is how to effectively guide these models for safe and reactive planning in closed-loop settings, where the ego vehicle’s actions influence future states. Recent work leverages typical expert driving behaviors (i.e., anchors) to guide diffusion planners but relies on a truncated diffusion schedule that introduces an asymmetry between the forward and denoising processes, diverging from the core principles of diffusion models. To address this, we introduce BridgeDrive, a novel anchor-guided diffusion bridge policy for closed-loop trajectory planning. Our approach formulates planning as a diffusion bridge that directly transforms coarse anchor trajectories into refined, context-aware plans, ensuring theoretical consistency between the forward and reverse processes. BridgeDrive is compatible with efficient ODE solvers, enabling real-time deployment. We achieve state-of-the-art performance on the Bench2Drive closed-loop evaluation benchmark, improving the success rate by 7.72% and 2.45% over prior arts with PDM-Lite and LEAD datasets, respectively. Project page: https://github.com/shuliu-ethz/BridgeDrive.
[395] OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs
Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, Wentao Wang, Zhenghao Song, Dingling Zhang, Ying He, Haoxiang Liu, Yuxuan Wang, Qiufeng Wang, Jiafu Tang, Zhenhe Wu, Jiehui Luo, Zhiyu Pan, Weihao Xie, Chenchen Zhang, Zhaohui Wang, Jiayi Tian, Yanghai Wang, Zhe Cao, Minxin Dai, Ke Wang, Runzhe Wen, Yinghao Ma, Yaning Pan, Sungkyun Chang, Termeh Taheri, Haiwen Xia, Christos Plachouras, Emmanouil Benetos, Yizhi Li, Ge Zhang, Jian Yang, Tianhao Peng, Zili Wang, Minghao Liu, Junran Peng, Zhaoxiang Zhang, Jiaheng Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.10689: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.10689&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[396] CoRPO: Adding a Correctness Bias to GRPO Improves Generalization
Anisha Garg, Claire Zhang, Nishit Neema, David Bick, Ganesh Venkatesh, Joel Hestness
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.04439: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.04439&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[397] Towards Trustworthy Legal AI through LLM Agents and Formal Reasoning
Linze Chen, Yufan Cai, Zhe Hou, Jin Song Dong
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2511.21033: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21033&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[398] ClinNoteAgents: An LLM Multi-Agent System for Predicting and Interpreting Heart Failure 30-Day Readmission from Clinical Notes
Rongjia Zhou, Chengzhuo Li, Carl Yang, Jiaying Lu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.07081: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.07081&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[399] Achieving Olympia-Level Geometry Large Language Model Agent via Complexity Boosting Reinforcement Learning
Haiteng Zhao, Junhao Shen, Yiming Zhang, Songyang Gao, Kuikun Liu, Tianyou Ma, Fan Zheng, Dahua Lin, Wenwei Zhang, Kai Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2512.10534: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10534&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[400] HydroGEM: A Self Supervised Zero Shot Hybrid TCN Transformer Foundation Model for Continental Scale Streamflow Quality Control
Ijaz Ul Haq, Byung Suk Lee, Julia N. Perdrial, David Baude
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper retrievalMethod: Cannot determine method due to failed paper retrieval
Result: Cannot determine results due to failed paper retrieval
Conclusion: Cannot draw conclusions due to failed paper retrieval
Abstract: Failed to fetch summary for 2512.14106: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.14106&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] Interleaved Tool-Call Reasoning for Protein Function Understanding
Chuanliu Fan, Zicheng Ma, Huanran Meng, Aijia Zhang, Wenjie Du, Jun Zhang, Yi Qin Gao, Ziqiang Cao, Guohong Fu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.03604: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.03604&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] PerfGuard: A Performance-Aware Agent for Visual Content Generation
Zhipeng Chen, Zhongrui Zhang, Chao Zhang, Yifan Xu, Lan Yang, Jun Liu, Ke Li, Yi-Zhe Song
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.22571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[403] Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models
Shule Lu, Yujing Wang, Hainan Zhang, Xiaoshan Yang, Hongwei Zheng, Yongxin Tong, Changsheng Xu, Zhiming Zheng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2602.00485: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00485&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] Pessimistic Auxiliary Policy for Offline Reinforcement Learning
Fan Zhang, Baoru Huang, Xin Zhang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.23974: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23974&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] AutoSkill: Experience-Driven Lifelong Learning via Skill Self-Evolution
Yutao Yang, Junsong Li, Qianjun Pan, Bihao Zhan, Yuxuan Cai, Lin Du, Jie Zhou, Kai Chen, Qin Chen, Xin Li, Bo Zhang, Liang He
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.01145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] Agents Learn Their Runtime: Interpreter Persistence as Training-Time Semantics
Victor May, Aaditya Salgarkar, Yishan Wang, Diganta Misra, Huu Nguyen
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.01209: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01209&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] ToolRLA: Multiplicative Reward Decomposition for Tool-Integrated Agents
Pengbo Liu
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.01620: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01620&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] Deep Learning Meets Mechanism Design: Key Results and Some Novel Applications
V. Udaya Sankar, Vishisht Srihari Rao, Mayank Ratan Bhardwaj, Y. Narahari
Main category: cs.AI
TL;DR: Paper ID 2401.05683 could not be fetched due to HTTP 429 error (rate limiting), so no abstract or content is available for analysis.
Details
Motivation: Unable to determine motivation as the paper content could not be retrieved due to API rate limiting.Method: No method information available since the paper abstract could not be fetched.
Result: No results available as the paper content was inaccessible.
Conclusion: Cannot draw conclusions about an inaccessible paper; the HTTP 429 error indicates the arXiv API rate limit was exceeded.
Abstract: Failed to fetch summary for 2401.05683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.05683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] Path Planning for Masked Diffusion Model Sampling
Fred Zhangzhi Peng, Zachary Bezemek, Sawan Patel, Jarrid Rector-Brooks, Sherwood Yao, Avishek Joey Bose, Alexander Tong, Pranam Chatterjee
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2502.03540: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.03540&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] FBFL: A Field-Based Coordination Approach for Data Heterogeneity in Federated Learning
Davide Domini, Gianluca Aguzzi, Lukas Esterle, Mirko Viroli
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2502.08577: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.08577&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] Generative Models in Decision Making: A Survey
Xinyu Shao, Jianping Zhang, Haozhi Wang, Leo Maxime Brunswic, Kaiwen Zhou, Jiqian Dong, Kaiyang Guo, Zhitang Chen, Jun Wang, Jianye Hao, Xiu Li, Yinchuan Li
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to technical limitations
Result: No results available - paper content retrieval failed due to HTTP 429 error
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2502.17100: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.17100&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] BACE-RUL: A Bi-directional Adversarial Network with Covariate Encoding for Machine Remaining Useful Life Prediction
Zekai Zhang, Dan Li, Shunyu Wu, Junya Cai, Bo Zhang, See Kiong Ng, Zibin Zheng
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2503.11730: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.11730&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] Advancing Problem-Based Learning in Biomedical Engineering in the Era of Generative AI
Micky C. Nnamdi, J. Ben Tamo, Benoit Marteau, Wenqi Shi, May D. Wang
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2503.16558: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.16558&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] Assessing the Impact of Code Changes on the Fault Localizability of Large Language Models
Sabaat Haroon, Ahmad Faraz Khan, Ahmad Humayun, Waris Gill, Abdul Haddi Amjad, Ali R. Butt, Mohammad Taha Khan, Muhammad Ali Gulzar
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2504.04372 appears to be a recent arXiv submission, but no abstract or content is available for analysis.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2504.04372: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.04372&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] ms-Mamba: Multi-scale Mamba for Time-Series Forecasting
Yusuf Meric Karadag, Ismail Talaz, Ipek Gursel Dino, Sinan Kalkan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2504.07654
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to determine conclusion due to failed paper retrieval
Abstract: Failed to fetch summary for 2504.07654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.07654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] Balancing Progress and Safety: A Novel Risk-Aware Objective for RL in Autonomous Driving
Ahmed Abouelazm, Jonas Michel, Helen Gremmelmaier, Tim Joseph, Philip Schörner, J. Marius Zöllner
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.06737: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06737&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] Boundary-Guided Trajectory Prediction for Road Aware and Physically Feasible Autonomous Driving
Ahmed Abouelazm, Mianzhi Liu, Christian Hubschneider, Yin Wu, Daniel Slieter, J. Marius Zöllner
Main category: cs.AI
TL;DR: Unable to analyze paper 2505.06740 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2505.06740: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.06740&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use
Mingyuan Wu, Jingcheng Yang, Jize Jiang, Meitang Li, Kaizhuo Yan, Hanchao Yu, Minjia Zhang, Chengxiang Zhai, Klara Nahrstedt
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2505.19255: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19255&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] Automatic Curriculum Learning for Driving Scenarios: Towards Robust and Efficient Reinforcement Learning
Ahmed Abouelazm, Tim Weinstein, Tim Joseph, Philip Schörner, J. Marius Zöllner
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to retrieval failureMethod: Unable to determine method due to retrieval failure
Result: Unable to determine results due to retrieval failure
Conclusion: Unable to draw conclusions due to retrieval failure
Abstract: Failed to fetch summary for 2505.08264: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.08264&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] RoboPARA: Dual-Arm Robot Planning with Parallel Allocation and Recomposition Across Tasks
Shiying Duan, Pei Ren, Nanxiang Jiang, Zhengping Che, Jian Tang, Zhaoxin Fan, Yifan Sun, Wenjun Wu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.06683: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.06683&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] Bures-Wasserstein Flow Matching for Graph Generation
Keyue Jiang, Jiahao Cui, Xiaowen Dong, Laura Toni
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to retry or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2506.14020: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14020&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] Structured Kolmogorov-Arnold Neural ODEs for Interpretable Learning and Symbolic Discovery of Nonlinear Dynamics
Wei Liu, Kiran Bacsa, Loon Ching Tang, Eleni Chatzi
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2506.18339: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18339&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] In-Training Defenses against Emergent Misalignment in Language Models
David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, Florian Mai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.06249: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.06249&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] LHM-Humanoid: Learning a Unified Policy for Long-Horizon Humanoid Whole-Body Loco-Manipulation in Diverse Messy Environments
Haozhuo Zhang, Jingkai Sun, Michele Caprio, Jian Tang, Shanghang Zhang, Qiang Zhang, Wei Pan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.16943: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.16943&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] Diffusion-Based Impedance Learning for Contact-Rich Manipulation Tasks
Noah Geiger, Tamim Asfour, Neville Hogan, Johannes Lachner
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2509.19696: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.19696&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] Complexity-Regularized Proximal Policy Optimization
Luca Serfilippi, Giorgio Franceschelli, Antonio Corradi, Mirco Musolesi
Main category: cs.AI
TL;DR: Paper ID 2509.20509 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2509.20509: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.20509&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
Simon Schrodi, Elias Kempf, Fazl Barez, Thomas Brox
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2509.23886: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.23886&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] MachaGrasp: Morphology-Aware Cross-Embodiment Dexterous Hand Articulation Generation for Grasping
Heng Zhang, Kevin Yuchen Ma, Mike Zheng Shou, Weisi Lin, Yan Wu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.06068: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.06068&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions
Lizhi Yang, Blake Werner, Massimiliano de Sa, Aaron D. Ames
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2510.14959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] GhostEI-Bench: Do Mobile Agents Resilience to Environmental Injection in Dynamic On-Device Environments?
Chiyu Chen, Xinhao Song, Yunkai Chai, Yang Yao, Haodong Zhao, Lijun Li, Jie Li, Yan Teng, Gongshen Liu, Yingchun Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.20333: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20333&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] LLEMA: Evolutionary Search with LLMs for Multi-Objective Materials Discovery
Nikhil Abhyankar, Sanchit Kabra, Saaketh Desai, Chandan K. Reddy
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.22503: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22503&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] FMint-SDE: A Multimodal Foundation Model for Accelerating Numerical Simulation of SDEs via Error Correction
Jiaxin Yuan, Haizhao Yang, Maria Cameron
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unable to determine method due to API rate limiting preventing access to paper details
Result: Unable to determine results due to API rate limiting preventing access to paper details
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper details
Abstract: Failed to fetch summary for 2510.27173: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.27173&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] CytoNet: A Foundation Model for the Human Cerebral Cortex at Cellular Resolution
Christian Schiffer, Zeynep Boztoprak, Jan-Oliver Kropp, Julia Thönnißen, Katia Berr, Hannah Spitzer, Katrin Amunts, Timo Dickscheid
Main category: cs.AI
TL;DR: Paper ID 2511.01870 - Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2511.01870: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01870&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring
Khouloud Oueslati, Maxime Lamothe, Foutse Khomh
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2511.03153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.03153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] CycleChemist: A Dual-Pronged Machine Learning Framework for Organic Photovoltaic Discovery
Hou Hei Lam, Jiangjie Qiu, Xiuyuan Hu, Wentao Li, Fankun Zeng, Siwei Fu, Hao Zhang, Xiaonan Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2511.19500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.19500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] Deep FlexQP: Accelerated Nonlinear Programming via Deep Unfolding
Alex Oshin, Rahul Vodeb Ghosh, Augustinos D. Saravanos, Evangelos A. Theodorou
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion due to lack of paper content
Abstract: Failed to fetch summary for 2512.01565: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01565&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] Guided Flow Policy: Learning from High-Value Actions in Offline Reinforcement Learning
Franki Nguimatsia Tiofack, Théotime Le Hellard, Fabian Schramm, Nicolas Perrin-Gilbert, Justin Carpentier
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.03973: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03973&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] Bootstrapped Mixed Rewards for RL Post-Training: Injecting Canonical Action Order
Prakhar Gupta, Vaibhav Gupta
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2512.04277 appears to be an arXiv paper from December 2024.
Details
Motivation: Cannot determine motivation without access to the paper content due to rate limiting error.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2512.04277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[439] Sparse Attention Post-Training for Mechanistic Interpretability
Florent Draye, Anson Lei, Hsiao-Ru Pan, Ingmar Posner, Bernhard Schölkopf
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.05865: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05865&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[440] Yukthi Opus: A Multi-Chain Hybrid Metaheuristic for Large-Scale NP-Hard Optimization
SB Danush Vikraman, Hannah Abigail, Prasanna Kesavraj, Gajanan V Honnavar
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2601.01832: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.01832&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[441] Controlled LLM Training on Spectral Sphere
Tian Xie, Haoming Luo, Haoyu Tang, Yiwen Hu, Jason Klein Liu, Qingnan Ren, Yang Wang, Wayne Xin Zhao, Rui Yan, Bing Su, Chong Luo, Baining Guo
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.08393: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08393&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[442] “What if she doesn’t feel the same?” What Happens When We Ask AI for Relationship Advice
Niva Manchanda, Akshata Kishore Moharir, Ratna Kandala
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to draw conclusions due to missing paper content
Abstract: Failed to fetch summary for 2601.11527: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11527&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits
Aryan Karmore
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to determine conclusion due to access limitations
Abstract: Failed to fetch summary for 2601.13563: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.13563&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] A Scalable Inter-edge Correlation Modeling in CopulaGNN for Link Sign Prediction
Jinkyu Sung, Myunggeum Jee, Joonseok Lee
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2601.19175: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19175&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement
Maria Despoina Siampou, Shushman Choudhury, Shang-Ling Hsu, Neha Arora, Cyrus Shahabi
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2601.21149 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstract contentMethod: Unable to determine method due to missing abstract content
Result: Unable to determine results due to missing abstract content
Conclusion: Unable to determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2601.21149: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21149&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] YuriiFormer: A Suite of Nesterov-Accelerated Transformers
Aleksandr Zimin, Yury Polyanskiy, Philippe Rigollet
Main category: cs.AI
TL;DR: Unable to analyze paper 2601.23236 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failed due to rate limiting (HTTP 429)Method: Cannot determine method as abstract retrieval failed due to rate limiting (HTTP 429)
Result: Cannot determine results as abstract retrieval failed due to rate limiting (HTTP 429)
Conclusion: Cannot draw conclusions about paper content due to technical retrieval issues
Abstract: Failed to fetch summary for 2601.23236: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.23236&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] Towards Exploratory and Focused Manipulation with Bimanual Active Perception: A New Problem, Benchmark and Strategy
Yuxin He, Ruihao Zhang, Tianao Shen, Cheng Liu, Qiang Nie
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2602.01939: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01939&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] On the Non-Identifiability of Steering Vectors in Large Language Models
Sohan Venkatesh, Ashish Mahendran Kurapath
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.06801: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.06801&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] Supervised Metric Regularization Through Alternating Optimization for Multi-Regime Physics-Informed Neural Networks
Enzo Nicolas Spotorno, Josafat Ribeiro Leal, Antonio Augusto Frohlich
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.09980 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot draw conclusions without access to the paper abstract
Abstract: Failed to fetch summary for 2602.09980: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09980&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] Empirical Stability Analysis of Kolmogorov-Arnold Networks in Hard-Constrained Recurrent Physics-Informed Discovery
Enzo Nicolas Spotorno, Josafat Leal Filho, Antonio Augusto Medeiros Frohlich
Main category: cs.AI
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to API rate limiting
Result: No results available - arXiv API returned HTTP 429 (Too Many Requests) error
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the abstract
Abstract: Failed to fetch summary for 2602.09988: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09988&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[451] Zombie Agents: Persistent Control of Self-Evolving LLM Agents via Self-Reinforcing Injections
Xianglin Yang, Yufei He, Shuo Ji, Bryan Hooi, Jin Song Dong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.15654: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15654&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] SubQuad: Near-Quadratic-Free Structure Inference with Distribution-Balanced Objectives in Adaptive Receptor framework
Rong Fu, Zijian Zhang, Kun Liu, Jiekai Wu, Xianda Li, Simon Fong
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.17330: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17330&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[453] Curriculum Learning for Efficient Chain-of-Thought Distillation via Structure-Aware Masking and GRPO
Bowen Yu, Maolin Wang, Sheng Zhang, Binhao Wang, Yi Wen, Jingtong Gao, Bowen Liu, Zimo Zhao, Wanyu Wang, Xiangyu Zhao
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.17686: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17686&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] Give Users the Wheel: Towards Promptable Recommendation Paradigm
Fuyuan Lyu, Chenglin Luo, Qiyuan Zhang, Yupeng Hou, Haolun Wu, Xing Tang, Xue Liu, Jin L.C. Guo, Xiuqiang He
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.18929 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions as paper content is inaccessible
Abstract: Failed to fetch summary for 2602.18929: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.18929&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[455] On Imbalanced Regression with Hoeffding Trees
Pantia-Marina Alchirch, Dimitrios I. Diochnos
Main category: cs.AI
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.22101: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22101&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[456] Zatom-1: A Multimodal Flow Foundation Model for 3D Molecules and Materials
Alex Morehead, Miruna Cretu, Antonia Panescu, Rishabh Anand, Maurice Weiler, Tynan Perez, Samuel Blau, Steven Farrell, Wahid Bhimji, Anubhav Jain, Hrushikesh Sahasrabuddhe, Pietro Lio, Tommi Jaakkola, Rafael Gomez-Bombarelli, Rex Ying, N. Benjamin Erichson, Michael W. Mahoney
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.22251: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22251&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] MatRIS: Toward Reliable and Efficient Pretrained Machine Learning Interatomic Potentials
Yuanchang Zhou, Siyu Hu, Xiangyu Zhang, Hongyu Wang, Guangming Tan, Weile Jia
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2603.02002 suggests it’s from March 2023, but without the abstract content, I cannot analyze its relevance to multimodal LLMs with audio/vision focus.
Details
Motivation: Cannot determine motivation without access to the paper abstract.Method: Cannot determine method without access to the paper abstract.
Result: Cannot determine results without access to the paper abstract.
Conclusion: Cannot draw conclusions without access to the paper abstract.
Abstract: Failed to fetch summary for 2603.02002: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02002&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[458] Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
Seungyeol Baek, Jaspreet Singh, Lala Shakti Swarup Ray, Hymalai Bello, Paul Lukowicz, Sungho Suh
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2602.23694 appears to be from February 2024, but no content is available for analysis.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2602.23694: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23694&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] Real Money, Fake Models: Deceptive Model Claims in Shadow APIs
Yage Zhang, Yukun Jiang, Zeyuan Chen, Michael Backes, Xinyue Shen, Yang Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.01919: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01919&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[460] AOI: Turning Failed Trajectories into Training Signals for Autonomous Cloud Diagnosis
Pei Yang, Wanyi Chen, Asuka Yuxi Zheng, Xueqian Li, Xiang Li, Haoqin Tu, Jie Xiao, Yifan Pang, Dongdong Zhang, Fuqiang Li, Alfred Long, Bill Shi, Lynn Ai, Eric Yang
Main category: cs.AI
TL;DR: Paper ID 2603.03378 could not be fetched due to HTTP 429 error (rate limiting), preventing analysis of its content and relevance assessment.
Details
Motivation: Unable to determine motivation due to failed data retrieval from arXiv API.Method: No method information available as the paper content could not be accessed.
Result: No results available due to technical limitations in accessing the paper.
Conclusion: Cannot draw conclusions about an inaccessible paper; technical issues prevent proper analysis.
Abstract: Failed to fetch summary for 2603.03378: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03378&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] RADAR: Learning to Route with Asymmetry-aware DistAnce Representations
Hang Yi, Ziwei Huang, Yining Ma, Zhiguang Cao
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to failed API requestMethod: Cannot determine method due to failed API request
Result: Cannot determine results due to failed API request
Conclusion: Cannot draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2603.03388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[462] Zero-Knowledge Proof (ZKP) Authentication for Offline CBDC Payment System Using IoT Devices
Santanu Mondal, T. Chithralekha
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.03804: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03804&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[463] Measuring AI R&D Automation
Alan Chan, Ranay Padarath, Joe Kwon, Hilary Greaves, Markus Anderljung
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.03992: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03992&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[464] When Denoising Hinders: Revisiting Zero-Shot ASR with SAM-Audio and Whisper
Akif Islam, Raufun Nahar, Md. Ekramul Hamid
Main category: cs.SD
TL;DR: SAM-Audio speech enhancement degrades zero-shot ASR performance despite improving audio quality, revealing a mismatch between human perception and machine recognition.
Details
Motivation: To test the common assumption that improving perceptual audio quality should directly benefit ASR accuracy, particularly for modern zero-shot ASR systems like Whisper.Method: Systematic empirical study using SAM-Audio as preprocessing for Whisper across multiple model variants and two noisy datasets (Bengali YouTube corpus and English noisy dataset), with objective PSNR analysis and utterance-level error analysis.
Result: SAM-Audio preprocessing consistently degrades ASR performance (increases WER and CER) despite substantial signal-level quality improvements, with errors worsening as Whisper model size increases.
Conclusion: Perceptually cleaner audio is not necessarily better for machine recognition, highlighting risks of blindly applying state-of-the-art denoising in zero-shot ASR pipelines.
Abstract: Recent advances in automatic speech recognition (ASR) and speech enhancement have led to a widespread assumption that improving perceptual audio quality should directly benefit recognition accuracy. In this work, we rigorously examine whether this assumption holds for modern zero-shot ASR systems. We present a systematic empirical study on the impact of Segment Anything Model Audio by Meta AI, a recent foundation-scale speech enhancement model proposed by Meta, when used as a preprocessing step for zero-shot transcription with Whisper. Experiments are conducted across multiple Whisper model variants and two linguistically distinct noisy speech datasets: a real-world Bengali YouTube corpus and a publicly available English noisy dataset. Contrary to common intuition, our results show that SAM-Audio preprocessing consistently degrades ASR performance, increasing both Word Error Rate (WER) and Character Error Rate (CER) compared to raw noisy speech, despite substantial improvements in signal-level quality. Objective Peak Signal-to-Noise Ratio analysis on the English dataset confirms that SAM-Audio produces acoustically cleaner signals, yet this improvement fails to translate into recognition gains. Therefore, we conducted a detailed utterance-level analysis to understand this counterintuitive result. We found that the recognition degradation is a systematic issue affecting the majority of the audio, not just isolated outliers, and that the errors worsen as the Whisper model size increases. These findings expose a fundamental mismatch: audio that is perceptually cleaner to human listeners is not necessarily robust for machine recognition. This highlights the risk of blindly applying state-of-the-art denoising as a preprocessing step in zero-shot ASR pipelines.
[465] WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech
Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees
Main category: cs.SD
TL;DR: Fine-tuned Whisper-based ASR with timestamped chunking and domain-adapted Pyannote diarization for Bengali long-form multi-speaker audio, reducing WER and DER in low-resource settings.
Details
Motivation: Address challenges in Bengali long-form speech recognition and speaker diarization, including voice activity detection, overlapping speech, and context preservation in low-resource settings.Method: Used whisper-timestamped for intelligent audio chunking to feed precise segments into fine-tuned acoustic model; integrated pyannote.audio and WhisperX pipeline with domain-specific fine-tuning of Pyannote segmentation model on Bengali conversational data.
Result: Significantly reduced Word Error Rate (WER) and Diarization Error Rate (DER) through timestamped chunking for ASR and targeted segmentation fine-tuning for diarization.
Conclusion: Intelligent timestamped chunking and domain-specific segmentation fine-tuning effectively improve long-form speech recognition and diarization performance in low-resource Bengali audio processing.
Abstract: This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). Processing long-form, multi-speaker Bengali audio introduces significant hurdles in voice activity detection, overlapping speech, and context preservation. To solve the long-form transcription challenge, we implemented a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription. For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX. A key contribution of our approach is the domain-specific fine-tuning of the Pyannote segmentation model on the competition dataset. This adaptation allowed the model to better capture the nuances of Bengali conversational dynamics and accurately resolve complex, overlapping speaker boundaries. Our methodology demonstrates that applying intelligent timestamped chunking to ASR and targeted segmentation fine-tuning to diarization significantly drives down Word Error Rate (WER) and Diarization Error Rate (DER), in low-resource settings.
[466] Focus Then Listen: Exploring Plug-and-Play Audio Enhancer for Noise-Robust Large Audio Language Models
Han Yin, Yang Xiao, Younghoo Kwon, Ting Dang, Jung-Woo Choi
Main category: cs.SD
TL;DR: FTL is a plug-and-play audio enhancer that improves noise robustness of Large Audio Language Models by separating speech/non-speech, routing based on instructions, and generating task-adaptive enhanced signals.
Details
Motivation: Existing Large Audio Language Models degrade significantly in real-world noisy conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can help, it requires task-specific noisy data and expensive retraining, limiting scalability.Method: Proposes Focus-Then-Listen (FTL): 1) Separates input waveform into speech and non-speech components, 2) Uses a modality router to predict target audio modality based on user instruction, 3) Applies a modality-aware fusion block to generate task-adaptive enhanced signal for improved downstream perception and reasoning.
Result: Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without requiring fine-tuning on the LALMs themselves.
Conclusion: FTL provides an effective plug-and-play solution for improving noise robustness in audio language models without the need for expensive retraining or task-specific noisy data.
Abstract: Large audio language models (LALMs) are a class of foundation models for audio understanding. Existing LALMs tend to degrade significantly in real-world noisy acoustic conditions where speech and non-speech sounds interfere. While noise-aware fine-tuning can improve robustness, it requires task-specific noisy data and expensive retraining, limiting scalability. To address this issue, we propose Focus-Then-Listen (FTL), a plug-and-play audio enhancer that improves LALMs’ noise robustness. Specifically, FTL first separates the input waveform into speech and non-speech, and a modality router is applied to predict the target audio modality (e.g., speech) based on the user’s instruction. Finally, a modality-aware fusion block generates a task-adaptive enhanced signal for improved downstream perception and reasoning. Experiments across multiple LALMs and tasks show that FTL improves performance across different noise levels without fine-tuning on LALMs.
[467] The First Environmental Sound Deepfake Detection Challenge: Benchmarking Robustness, Evaluation, and Insights
Han Yin, Yang Xiao, Rohan Kumar Das, Jisheng Bai, Ting Dang
Main category: cs.SD
TL;DR: The paper introduces the first Environmental Sound Deepfake Detection (ESDD) challenge to address the underexplored problem of detecting fake environmental sounds, presenting task formulation, dataset, evaluation, and analysis of top-performing systems.
Details
Motivation: As audio generation technology advances, highly realistic environmental soundscapes can be misused to create deceptive content (fake alarms, gunshots, crowd sounds), raising public safety concerns. While speech/singing deepfake detection has been studied, environmental sound deepfake detection remains underexplored.Method: Organized the first ESDD challenge with 97 registered teams and 1,748 submissions. Presented comprehensive task formulation, dataset construction, evaluation protocols, baseline systems, and analyzed common architectural choices and training strategies among top-performing systems.
Result: The challenge successfully attracted significant participation (97 teams, 1,748 submissions). The paper provides key insights from challenge results, analyzes effective approaches used by top performers, and identifies promising research directions for ESDD.
Conclusion: The ESDD challenge establishes a foundation for environmental sound deepfake detection research, highlighting the importance of this emerging field for public safety and trust. The paper outlines future research directions and open problems to guide subsequent studies.
Abstract: Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.
[468] Training Dynamics-Aware Multi-Factor Curriculum Learning for Target Speaker Extraction
Yun Liu, Xuechen Liu, Xiaoxiao Miao, Junichi Yamagishi
Main category: cs.SD
TL;DR: A novel curriculum learning approach for target speaker extraction that combines multi-factor scheduling with data-driven visualization to improve performance in challenging real-world scenarios.
Details
Motivation: Real-world target speaker extraction performance degrades due to complex interactions between factors like SNR, speaker count, and overlap ratios. Previous curriculum learning approaches address factors separately with predefined difficulty assumptions that don't align with actual model learning behavior.Method: Two-stage approach: 1) Multi-factor curriculum learning that jointly schedules SNR thresholds, speaker counts, overlap ratios, and synthetic/real proportions for progressive learning; 2) TSE-Datamap visualization framework that tracks confidence and variability across training epochs to identify easy-to-learn, ambiguous, and hard-to-learn data regions for data-driven curriculum design.
Result: The methods improve extraction results over random sampling, with particularly strong gains in challenging multi-speaker scenarios. The visualization reveals three characteristic data regions that guide curriculum design.
Conclusion: Data-driven curriculum learning grounded in observed training dynamics outperforms predefined difficulty assumptions, especially for complex multi-speaker extraction tasks.
Abstract: Target speaker extraction (TSE) aims to isolate a specific speaker’s voice from multi-speaker mixtures. Despite strong benchmark results, real-world performance often degrades due to different interacting factors. Previous curriculum learning approaches for TSE typically address these factors separately, failing to capture their complex interactions and relying on predefined difficulty factors that may not align with actual model learning behavior. To address this challenge, we first propose a multi-factor curriculum learning strategy that jointly schedules SNR thresholds, speaker counts, overlap ratios, and synthetic/real proportions, enabling progressive learning from simple to complex scenarios. However, determining optimal scheduling without predefined assumptions remains challenging. We therefore introduce TSE-Datamap, a visualization framework that grounds curriculum design in observed training dynamics by tracking confidence and variability across training epochs. Our analysis reveals three characteristic data regions: (i) easy-to-learn examples where models consistently perform well, (ii) ambiguous examples where models oscillate between alternative predictions, and (iii) hard-to-learn examples where models persistently struggle. Guided by these data-driven insights, our methods improve extraction results over random sampling, with particularly strong gains in challenging multi-speaker scenarios.
[469] TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling
Hao-Hui Xie, Ho-Lam Chung, Yi-Cheng Lin, Ke-Han Lu, Wenze Ren, Xie Chen, Hung-yi Lee
Main category: cs.SD
TL;DR: TW-Sound580K: A Taiwanese audio-text instruction dataset created via Verify-Generate-Critique protocol, used to train Tai-LALM model that improves dialectal prosody understanding in audio-language models.
Details
Motivation: Large Audio-Language Models struggle with localized dialectal prosody due to scarcity of specialized corpora, particularly for regional dialects like Taiwanese.Method: Developed TW-Sound580K dataset using Verify-Generate-Critique protocol with Dual-ASR validation, then trained Tai-LALM by fine-tuning DeSTA 2.5-Audio backbone with dynamic Dual-ASR Arbitration strategy for transcription selection.
Result: Tai-LALM achieves 49.1% accuracy on TAU Benchmark, representing 6.5% absolute improvement over zero-shot baseline (42.6% with ASR text conditioning).
Conclusion: Integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech understanding.
Abstract: Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset’s utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.
[470] Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards
Linghan Fang, Tianxin Xie, Li Liu
Main category: cs.SD
TL;DR: ASR-TRA: A test-time reinforcement adaptation framework for ASR that uses temperature-controlled stochastic decoding and audio-text semantic alignment rewards to improve robustness to noisy environments and accents without ground-truth labels.
Details
Motivation: Current ASR systems like Whisper show high accuracy but remain sensitive to real-world distribution shifts (noise, accents). Existing test-time adaptation methods rely on pseudo-labeling or entropy minimization, which can reinforce high-confidence errors through confirmation bias.Method: Proposes ASR-TRA: a test-time reinforcement adaptation framework with causal intervention inspiration. Uses learnable decoder prompts and temperature-controlled stochastic decoding to generate diverse transcription candidates. Scores candidates with a reward model measuring audio-text semantic alignment, then updates model and prompt parameters via reinforcement learning.
Result: Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets show higher accuracy with lower latency than existing TTA baselines. Ablation studies confirm effectiveness of combining audio and language-based rewards, demonstrating enhanced stability and interpretability.
Conclusion: ASR-TRA provides a practical and robust solution for deploying ASR systems in challenging real-world conditions by overcoming confirmation bias in traditional TTA methods through reinforcement learning with semantic alignment rewards.
Abstract: Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method achieves higher accuracy while maintaining lower latency than existing TTA baselines. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method’s enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
[471] SLICE: Speech Enhancement via Layer-wise Injection of Conditioning Embeddings
Seokhoon Moon, Kyudan Jung, Jaegul Choo
Main category: cs.SD
TL;DR: Diffusion-based speech enhancement method that injects degradation conditioning into timestep embeddings for better handling of compound corruptions (noise, reverberation, distortion).
Details
Motivation: Real-world speech often suffers from multiple simultaneous degradations, but current diffusion-based enhancement methods struggle with compound corruptions. Existing noise-aware approaches that only inject conditioning at the input layer can degrade performance below unconditioned models.Method: Proposes injecting degradation conditioning derived from a pretrained encoder with multi-task heads (for noise type, reverberation, and distortion) into the timestep embedding, allowing conditioning to propagate through all residual blocks without architectural changes.
Result: In controlled experiments, input-level conditioning performs worse than no encoder at all on compound degradations, while the proposed layer-wise injection achieves the best results. The method also generalizes to diverse real-world recordings.
Conclusion: Injecting degradation conditioning into timestep embeddings rather than just at the input layer significantly improves diffusion-based speech enhancement for compound corruptions and generalizes well to real-world scenarios.
Abstract: Real-world speech is often corrupted by multiple degradations simultaneously, including additive noise, reverberation, and nonlinear distortion. Diffusion-based enhancement methods perform well on single degradations but struggle with compound corruptions. Prior noise-aware approaches inject conditioning at the input layer only, which can degrade performance below that of an unconditioned model. To address this, we propose injecting degradation conditioning, derived from a pretrained encoder with multi-task heads for noise type, reverberation, and distortion, into the timestep embedding so that it propagates through all residual blocks without architectural changes. In controlled experiments where only the injection method varies, input-level conditioning performs worse than no encoder at all on compound degradations, while layer-wise injection achieves the best results. The method also generalizes to diverse real-world recordings.
[472] Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection
Junchuan Zhao, Minh Duc Vu, Ye Wang
Main category: cs.SD
TL;DR: MSpoof-TTS: A training-free inference framework that improves zero-shot speech synthesis quality using multi-resolution spoof guidance to detect and correct token-level artifacts in neural codec language models.
Details
Motivation: Neural codec language models for speech synthesis suffer from token-level artifacts and distributional drift during inference, degrading perceptual realism. Existing solutions like preference optimization or retraining are costly, so the authors seek a training-free approach to improve zero-shot synthesis quality.Method: Proposes MSpoof-TTS with a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent patterns. Uses hierarchical decoding with spoof detectors to progressively prune low-quality candidates and re-rank hypotheses, enhancing robustness without modifying model parameters.
Result: Experiments validate the framework’s effectiveness for robust and high-quality codec-based speech generation, improving perceptual realism without requiring model retraining or parameter updates.
Conclusion: MSpoof-TTS provides a practical training-free solution to enhance speech synthesis quality by addressing token-level artifacts through multi-resolution spoof guidance, making it suitable for real-world deployment of neural codec language models.
Abstract: Neural codec language models enable high-quality discrete speech synthesis, yet their inference remains vulnerable to token-level artifacts and distributional drift that degrade perceptual realism. Rather than relying on preference optimization or retraining, we propose MSpoof-TTS, a training-free inference framework that improves zero-shot synthesis through multi-resolution spoof guidance. We introduce a Multi-Resolution Token-based Spoof Detection framework that evaluates codec sequences at different temporal granularities to detect locally inconsistent or unnatural patterns. We then integrate the spoof detectors into a hierarchical decoding strategy, progressively pruning low-quality candidates and re-ranking hypotheses. This discriminator-guided generation enhances robustness without modifying model parameters. Experiments validate the effectiveness of our framework for robust and high-quality codec-based speech generation.
[473] Latent-Mark: An Audio Watermark Robust to Neural Resynthesis
Yen-Shan Chen, Shih-Yu Lai, Ying-Jung Tsou, Yi-Cheng Lin, Bing-Yu Chen, Yun-Nung Chen, Hung-Yi Lee, Shang-Tse Chen
Main category: cs.SD
TL;DR: Latent-Mark: A zero-bit audio watermarking framework that survives neural audio codec compression by embedding watermarks in codec-invariant latent spaces, with cross-codec optimization for robustness across different codecs.
Details
Motivation: Existing audio watermarking methods are robust against traditional DSP attacks but vulnerable to neural resynthesis from modern neural audio codecs, which discard imperceptible waveform variations used in prior methods.Method: Proposes Latent-Mark framework that embeds watermarks within codec’s invariant latent space by optimizing audio waveforms to induce detectable directional shifts in encoded latent representations. Uses Cross-Codec Optimization across multiple surrogate codecs to target shared latent invariants and prevent overfitting to single codec quantization rules.
Result: Achieves robust zero-shot transferability to unseen neural codecs, state-of-the-art resilience against traditional DSP attacks while preserving perceptual imperceptibility.
Conclusion: Latent-Mark addresses vulnerability to neural resynthesis and inspires future universal watermarking frameworks for maintaining integrity across complex generative distortions.
Abstract: While existing audio watermarking techniques have achieved strong robustness against traditional digital signal processing (DSP) attacks, they remain vulnerable to neural resynthesis. This occurs because modern neural audio codecs act as semantic filters and discard the imperceptible waveform variations used in prior watermarking methods. To address this limitation, we propose Latent-Mark, the first zero-bit audio watermarking framework designed to survive semantic compression. Our key insight is that robustness to the encode-decode process requires embedding the watermark within the codec’s invariant latent space. We achieve this by optimizing the audio waveform to induce a detectable directional shift in its encoded latent representation, while constraining perturbations to align with the natural audio manifold to ensure imperceptibility. To prevent overfitting to a single codec’s quantization rules, we introduce Cross-Codec Optimization, jointly optimizing the waveform across multiple surrogate codecs to target shared latent invariants. Extensive evaluations demonstrate robust zero-shot transferability to unseen neural codecs, achieving state-of-the-art resilience against traditional DSP attacks while preserving perceptual imperceptibility. Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
[474] Building Enterprise Realtime Voice Agents from Scratch: A Technical Tutorial
Jielin Qiu, Zixiang Chen, Liangwei Yang, Ming Zhu, Zhiwei Liu, Juntao Tan, Wenting Zhao, Rithesh Murthy, Roshan Ram, Akshara Prabhakar, Shelby Heinecke, Caiming Xiong, Silvio Savarese, Huan Wang
Main category: cs.SD
TL;DR: A tutorial on building enterprise-grade realtime voice agents using streaming cascaded pipeline (STT → LLM → TTS) instead of native speech-to-speech models, achieving sub-second latency.
Details
Motivation: Despite numerous open-source speech-to-speech models and voice agent frameworks, there's no comprehensive resource explaining the complete pipeline for building realtime voice agents with function calling capabilities from individual components.Method: Systematic investigation reveals native speech-to-speech models are too slow (~13s latency), so they adopt industry-standard cascaded streaming pipeline: streaming STT (Deepgram) → streaming LLM with function calling (vLLM-served) → streaming TTS (ElevenLabs).
Result: Achieved P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU, significantly faster than native speech-to-speech models.
Conclusion: The key to realtime voice agents is streaming and pipelining across components rather than any single fast model; the tutorial provides complete working codebase for enterprise-grade implementation.
Abstract: We present a technical tutorial for building enterprise-grade realtime voice agents from first principles. While over 25 open-source speech-to-speech models and numerous voice agent frameworks exist, no single resource explains the complete pipeline from individual components to a working streaming voice agent with function calling capabilities. Through systematic investigation, we find that (1) native speech-to-speech models like Qwen2.5-Omni, while capable of high-quality audio generation, are too slow for realtime interaction ($\sim$13s time-to-first-audio); (2) the industry-standard approach uses a cascaded streaming pipeline: STT $\rightarrow$ LLM $\rightarrow$ TTS, where each component streams its output to the next; and (3) the key to ``realtime’’ is not any single fast model but rather \textit{streaming and pipelining} across components. We build a complete voice agent using Deepgram (streaming STT), vLLM-served LLMs with function calling (streaming text generation), and ElevenLabs (streaming TTS), achieving a measured P50 time-to-first-audio of 947ms (best case 729ms) with cloud LLM APIs, and comparable latency with self-hosted vLLM on NVIDIA A10G GPU. We release the full codebase as a tutorial with working, tested code for every component.
[475] Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation
Xueyao Zhang, Junan Zhang, Yuancheng Wang, Chaoren Wang, Yuanzhe Chen, Dongya Jia, Zhuo Chen, Zhizheng Wu
Main category: cs.SD
TL;DR: Vevo2 is a unified framework for controllable speech and singing voice generation using audio tokenizers and multi-stage modeling for prosody, style, and timbre control.
Details
Motivation: Controllable human voice generation for expressive domains like singing remains challenging due to scarcity of annotated singing data and need for flexible controllability over prosody, melody, and style.Method: Introduces two audio tokenizers: (1) music-notation-free prosody tokenizer capturing prosody/melody from speech/singing/instruments, and (2) content-style tokenizer encoding linguistic content, prosody, and style with timbre disentanglement. Uses auto-regressive content-style modeling for text/prosody/style control and flow-matching acoustic modeling for timbre control, with explicit/implicit prosody learning strategies and multi-objective post-training.
Result: Unified modeling brings mutual benefits to both speech and singing voice generation, with effectiveness across synthesis, conversion, and editing tasks demonstrating strong generalization and versatility.
Conclusion: Vevo2 provides a versatile framework for controllable speech and singing voice generation with strong performance across multiple tasks, though audio samples need verification at provided URL.
Abstract: Controllable human voice generation, particularly for expressive domains like singing, remains a significant challenge. This paper introduces Vevo2, a unified framework for controllable speech and singing voice generation. To tackle issues like the scarcity of annotated singing data and to enable flexible controllability, Vevo2 introduces two audio tokenizers: (1) a unified music-notation-free prosody tokenizer that captures prosody and melody from speech, singing, and even instrumental sounds, and (2) a unified content-style tokenizer that encodes linguistic content, prosody, and style for both speech and singing, while enabling timbre disentanglement. Vevo2 consists of an auto-regressive (AR) content-style modeling stage, which aims to enable controllability over text, prosody, and style, as well as a flow-matching acoustic modeling stage that allows for timbre control. Particularly, during the speech-singing joint training of the AR model, we propose both explicit and implicit prosody learning strategies to bridge speech and singing voice. Moreover, to further enhance the Vevo2’s ability to follow text and prosody, we design a multi-objective post-training task that integrates both intelligibility and prosody similarity alignment. Experimental results show that the unified modeling in Vevo2 brings mutual benefits to both speech and singing voice generation. Additionally, Vevo2’s effectiveness across a wide range of synthesis, conversion, and editing tasks for both speech and singing further demonstrates its strong generalization ability and versatility. Audio samples are are available at https://versasinger.github.io/.
[476] TSPC: A Two-Stage Phoneme-Centric Architecture for code-switching Vietnamese-English Speech Recognition
Tran Nguyen Anh, Truong Dinh Dung, Vo Van Nam, Minh N. H. Nguyen
Main category: cs.SD
TL;DR: Proposes TSPC, a two-stage phoneme-centric model for Vietnamese-English code-switching ASR that uses extended Vietnamese phoneme set as intermediate representation, achieving state-of-the-art performance with reduced computational resources.
Details
Motivation: Code-switching presents significant challenges for ASR systems, especially for language pairs like Vietnamese-English with distinct phonological features and sound recognition ambiguity. Existing methods fail to capture subtle phonological shifts in CS scenarios.Method: Two-Stage Phoneme-Centric model (TSPC) uses extended Vietnamese phoneme set as intermediate representation for mixed-lingual modeling. The two-stage architecture enables phoneme adaptation and language conversion while maintaining efficiency under low computational-resource constraints.
Result: TSPC outperforms existing baselines including PhoWhisper-base, achieving significantly lower word error rate of 19.06% with reduced training resources. The phonetic-based architecture enhances ASR performance in complex CS Vietnamese-English scenarios.
Conclusion: The phoneme-centric approach with extended Vietnamese phoneme set as intermediate representation effectively addresses Vietnamese-English code-switching ASR challenges, offering improved performance with computational efficiency.
Abstract: Code-switching (CS) presents a significant challenge for general Auto-Speech Recognition (ASR) systems. Existing methods often fail to capture the sub tle phonological shifts inherent in CS scenarios. The challenge is particu larly difficult for language pairs like Vietnamese and English, where both distinct phonological features and the ambiguity arising from similar sound recognition are present. In this paper, we propose a novel architecture for Vietnamese-English CS ASR, a Two-Stage Phoneme-Centric model (TSPC). TSPC adopts a phoneme-centric approach based on an extended Vietnamese phoneme set as an intermediate representation for mixed-lingual modeling, while remaining efficient under low computational-resource constraints. Ex perimental results demonstrate that TSPC consistently outperforms exist ing baselines, including PhoWhisper-base, in Vietnamese-English CS ASR, achieving a significantly lower word error rate of 19.06% with reduced train ing resources. Furthermore, the phonetic-based two-stage architecture en ables phoneme adaptation and language conversion to enhance ASR perfor mance in complex CS Vietnamese-English ASR scenarios.
[477] SAM: A Mamba-2 State-Space Audio-Language Model
Taehan Lee, Jaehan Jung, Hyukjun Lee
Main category: cs.SD
TL;DR: SAM is a state-space audio-language model combining audio encoder with Mamba-2 backbone, achieving competitive performance with fewer parameters and providing systematic analysis of SSM-audio interactions.
Details
Motivation: To develop efficient audio-language models using state-space models (SSMs) as scalable backbones, addressing the parameter inefficiency of transformer-based models while maintaining strong performance on audio understanding tasks.Method: Integrates audio encoder with Mamba-2 backbone (SSM architecture), explores joint audio encoder finetuning, analyzes token representation characteristics, and incorporates instruction-following supervision for improved reasoning.
Result: SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching/surpassing larger 7B transformer models. Key findings: joint finetuning essential, compact audio tokens optimal, instruction-following boosts reasoning from 22.8 to 56.8 MMAU-Sound accuracy.
Conclusion: SSMs serve as strong, scalable backbones for audio-language models with practical design principles: joint encoder finetuning, compact token representations, and instruction-following supervision are crucial for optimal performance.
Abstract: We present SAM, a State-space Audio-language Model that integrates an audio encoder with a Mamba-2 backbone. SAM-2.7B achieves 21.1 mAP on AudioSet and 17.6 SPICE on AudioCaps, matching or surpassing larger 7B transformer-based models with fewer parameters. We further provide the first systematic, representation-level analysis of how SSMs interact with audio encoder outputs: (1) joint audio encoder finetuning is essential, supported by accuracy gains and observed adaptation of token representation rank and similarity across different SSM sizes; (2) despite linear scaling, SSMs benefit more from compact, information-rich audio token representations than from excessively long token sequences; and (3) incorporating instruction-following supervision substantially improves reasoning ability, boosting MMAU-Sound accuracy from 22.8 to 56.8. Through comprehensive experiments and analysis, we establish practical design principles for SSMs as strong, scalable backbones for audio-language models.
[478] Noise-to-Notes: Diffusion-based Generation and Refinement for Automatic Drum Transcription
Michael Yeung, Keisuke Toyama, Toya Teramoto, Shusuke Takahashi, Tamaki Kojima
Main category: cs.SD
TL;DR: N2N redefines automatic drum transcription as a conditional generative task using diffusion models to transform audio-conditioned noise into drum events with velocities, achieving state-of-the-art results with music foundation model features.
Details
Motivation: Traditional ADT approaches treat drum transcription as a discriminative task, but this work aims to leverage the advantages of generative diffusion models including flexible speed-accuracy trade-offs and strong inpainting capabilities for improved drum transcription.Method: Proposes Noise-to-Notes (N2N), a diffusion-based framework that transforms audio-conditioned Gaussian noise into drum events with velocities. Uses Annealed Pseudo-Huber loss for joint optimization of binary onset and continuous velocity values, and incorporates features from music foundation models to augment spectrogram features.
Result: N2N establishes new state-of-the-art performance across multiple ADT benchmarks. Including MFM features significantly improves robustness to out-of-domain drum audio.
Conclusion: Redefining ADT as a conditional generative task with diffusion modeling is effective, and incorporating high-level semantic features from music foundation models enhances performance and robustness in drum transcription.
Abstract: Automatic drum transcription (ADT) is traditionally formulated as a discriminative task to predict drum events from audio spectrograms. In this work, we redefine ADT as a conditional generative task and introduce Noise-to-Notes (N2N), a framework leveraging diffusion modeling to transform audio-conditioned Gaussian noise into drum events with associated velocities. This generative diffusion approach offers distinct advantages, including a flexible speed-accuracy trade-off and strong inpainting capabilities. However, the generation of binary onset and continuous velocity values presents a challenge for diffusion models, and to overcome this, we introduce an Annealed Pseudo-Huber loss to facilitate effective joint optimization. Finally, to augment low-level spectrogram features, we propose incorporating features extracted from music foundation models (MFMs), which capture high-level semantic information and enhance robustness to out-of-domain drum audio. Experimental results demonstrate that including MFM features significantly improves robustness and N2N establishes a new state-of-the-art performance across multiple ADT benchmarks.
[479] Schrödinger Bridge Mamba for One-Step Speech Enhancement
Jing Yang, Sirui Wang, Chao Wu, Lei Guo, Fan Fan
Main category: cs.SD
TL;DR: SBM integrates Schrödinger Bridge training with Mamba architecture for efficient speech enhancement, achieving state-of-the-art performance with single-step inference and competitive real-time factor.
Details
Motivation: To develop an efficient speech enhancement model that combines the benefits of Schrödinger Bridge training (which enables single-step inference) with the computational efficiency of the Mamba architecture for real-time streaming applications.Method: Proposes Schrödinger Bridge Mamba (SBM) that integrates the Schrödinger Bridge training paradigm with the Mamba architecture. The SB paradigm enables single-step inference while Mamba provides efficient sequence modeling. The model is evaluated on joint denoising and dereverberation tasks.
Result: SBM outperforms strong generative and discriminative methods on multiple metrics with only one step of inference while achieving competitive real-time factor for streaming feasibility. Ablation studies show SB paradigm consistently improves performance across architectures, and Mamba performs better under SB than MHSA and LSTM backbones.
Conclusion: The synergy between Mamba architecture and SB trajectory-based training provides a high-quality solution for real-world speech enhancement, offering efficient single-step inference with strong performance.
Abstract: We present Schrödinger Bridge Mamba (SBM), a novel model for efficient speech enhancement by integrating the Schrödinger Bridge (SB) training paradigm and the Mamba architecture. Experiments of joint denoising and dereverberation tasks demonstrate SBM outperforms strong generative and discriminative methods on multiple metrics with only one step of inference while achieving a competitive real-time factor for streaming feasibility. Ablation studies reveal that the SB paradigm consistently yields improved performance across diverse architectures over conventional mapping. Furthermore, Mamba exhibits a stronger performance under the SB paradigm compared to Multi-Head Self-Attention (MHSA) and Long Short-Term Memory (LSTM) backbones. These findings highlight the synergy between the Mamba architecture and the SB trajectory-based training, providing a high-quality solution for real-world speech enhancement. Demo page: https://sbmse.github.io
[480] Multi-Loss Learning for Speech Emotion Recognition with Energy-Adaptive Mixup and Frame-Level Attention
Cong Wang, Yizhong Geng, Yuhua Wen, Qifei Li, Yingming Gao, Ruimin Wang, Chunfeng Wang, Hao Li, Ya Li, Wei Chen
Main category: cs.SD
TL;DR: Proposes multi-loss learning framework with energy-adaptive mixup and frame-level attention for speech emotion recognition, achieving SOTA results on multiple datasets.
Details
Motivation: Speech emotion recognition is important for human-computer interaction but challenging due to emotional complexity and scarce annotated data. Need to address class imbalance and improve feature learning.Method: Multi-loss learning framework integrating energy-adaptive mixup (EAM) for SNR-based augmentation and frame-level attention module (FLAM) for feature extraction. Combines KL divergence, focal, center, and supervised contrastive losses.
Result: Achieves state-of-the-art performance on four SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE, demonstrating effectiveness and robustness.
Conclusion: The proposed MLL framework with EAM and FLAM effectively addresses SER challenges, improving performance through better augmentation, feature extraction, and multi-loss optimization.
Abstract: Speech emotion recognition (SER) is an important technology in human-computer interaction. However, achieving high performance is challenging due to emotional complexity and scarce annotated data. To tackle these challenges, we propose a multi-loss learning (MLL) framework integrating an energy-adaptive mixup (EAM) method and a frame-level attention module (FLAM). The EAM method leverages SNR-based augmentation to generate diverse speech samples capturing subtle emotional variations. FLAM enhances frame-level feature extraction for multi-frame emotional cues. Our MLL strategy combines Kullback-Leibler divergence, focal, center, and supervised contrastive loss to optimize learning, address class imbalance, and improve feature separability. We evaluate our method on four widely used SER datasets: IEMOCAP, MSP-IMPROV, RAVDESS, and SAVEE. The results demonstrate our method achieves state-of-the-art performance, suggesting its effectiveness and robustness.
[481] RA-QA: A Benchmarking System for Respiratory Audio Question Answering Under Real-World Heterogeneity
Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo
Main category: cs.SD
TL;DR: RA-QA benchmark for respiratory audio question answering with 9M QA pairs, standardized pipeline, and evaluation of multimodal models under real-world heterogeneity.
Details
Motivation: Need robust benchmarks for conversational multimodal AI tools in healthcare, especially for respiratory audio QA which is underexplored and lacks evaluation under realistic conditions with heterogeneity across modalities, devices, and question types.Method: Introduces Respiratory-Audio Question-Answering (RA-QA) benchmark with standardized data generation pipeline, comprehensive multimodal QA collection, and unified evaluation protocol. Harmonizes public respiratory audio datasets into 9 million format-diverse QA pairs covering diagnostic and contextual attributes.
Result: Benchmarks classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity conditions.
Conclusion: RA-QA provides a needed benchmark for respiratory audio QA that exposes failure modes under realistic heterogeneous conditions, enabling better evaluation of multimodal AI tools in healthcare applications.
Abstract: As conversational multimodal AI tools are increasingly adopted to process patient data for health assessment, robust benchmarks are needed to measure progress and expose failure modes under realistic conditions. Despite the importance of respiratory audio for mobile health screening, respiratory audio question answering remains underexplored, with existing studies evaluated narrowly and lacking real-world heterogeneity across modalities, devices, and question types. We hence introduce the Respiratory-Audio Question-Answering (RA-QA) benchmark, including a standardized data generation pipeline, a comprehensive multimodal QA collection, and a unified evaluation protocol. RA-QA harmonizes public RA datasets into a collection of 9 million format-diverse QA pairs covering diagnostic and contextual attributes. We benchmark classical ML baselines alongside multimodal audio-language models, establishing reproducible reference points and showing how current approaches fail under heterogeneity.
[482] MultiAPI Spoof: A Multi-API Dataset and Local-Attention Network for Speech Anti-spoofing Detection
Xueping Zhang, Zhenshan Zhang, Yechen Wang, Linxi Li, Liwei Jin, Ming Li
Main category: cs.SD
TL;DR: MultiAPI Spoof dataset with 230 hours of synthetic speech from 30 APIs, plus Nes2Net-LA model for improved audio anti-spoofing and API tracing
Details
Motivation: Existing speech anti-spoofing benchmarks use narrow sets of public models, creating a gap from real-world scenarios where commercial systems use diverse proprietary APIsMethod: Created MultiAPI Spoof dataset with synthetic speech from 30 APIs (commercial services, open-source models, online platforms). Proposed Nes2Net-LA, a local-attention enhanced variant of Nes2Net for better local context modeling and fine-grained spoofing feature extraction
Result: Nes2Net-LA achieves state-of-the-art performance with superior robustness, especially under diverse and unseen spoofing conditions. Dataset enables API tracing task for fine-grained attribution of spoofed audio to its generation source
Conclusion: MultiAPI Spoof dataset addresses the gap in existing benchmarks, and Nes2Net-LA provides effective anti-spoofing with API tracing capability for real-world scenarios
Abstract: Existing speech anti-spoofing benchmarks rely on a narrow set of public models, creating a substantial gap from real-world scenarios in which commercial systems employ diverse, often proprietary APIs. To address this issue, we introduce MultiAPI Spoof, a multi-API audio anti-spoofing dataset comprising about 230 hours of synthetic speech generated by 30 distinct APIs, including commercial services, open-source models, and online platforms. Furthermore, we propose Nes2Net-LA, a local-attention enhanced variant of Nes2Net that improves local context modeling and fine-grained spoofing feature extraction. Based on this dataset, we also define the API tracing task, enabling fine-grained attribution of spoofed audio to its generation source. Experiments show that Nes2Net-LA achieves state-of-the-art performance and offers superior robustness, particularly under diverse and unseen spoofing conditions. Code \footnote{https://github.com/XuepingZhang/MultiAPI-Spoof} and dataset \footnote{https://xuepingzhang.github.io/MultiAPI-Spoof-Dataset/} have been released.
[483] Fine-grained Soundscape Control for Augmented Hearing
Seunghyun Oh, Malek Itani, Aseem Gauri, Shyamnath Gollakota
Main category: cs.SD
TL;DR: Aurchestra enables fine-grained, real-time soundscape control on hearables by extracting and mixing multiple simultaneous sound sources independently, allowing users to customize per-class volumes like an audio engineer.
Details
Motivation: Current hearables offer only blunt sound controls (global noise suppression or single-target focus), but real-world acoustic scenes contain many simultaneous sources that users may want to adjust independently.Method: Two key components: (1) dynamic interface that surfaces only active sound classes, and (2) real-time, on-device multi-output extraction network that generates separate streams for each selected class, optimized for compute-limited platforms with 6 ms streaming audio chunks.
Result: System achieves robust performance for up to 5 overlapping target sounds, enables expressive per-class sound control, and shows substantial improvements in target-class enhancement and interference suppression across real-world indoor/outdoor scenarios.
Conclusion: The world need not be heard as a single undifferentiated stream; with Aurchestra, the soundscape becomes truly programmable through fine-grained, real-time control of multiple simultaneous sound sources.
Abstract: Hearables are becoming ubiquitous, yet their sound controls remain blunt: users can either enable global noise suppression or focus on a single target sound. Real-world acoustic scenes, however, contain many simultaneous sources that users may want to adjust independently. We introduce Aurchestra, the first system to provide fine-grained, real-time soundscape control on resource-constrained hearables. Our system has two key components: (1) a dynamic interface that surfaces only active sound classes and (2) a real-time, on-device multi-output extraction network that generates separate streams for each selected class, achieving robust performance for upto 5 overlapping target sounds, and letting users mix their environment by customizing per-class volumes, much like an audio engineer mixes tracks. We optimize the model architecture for multiple compute-limited platforms and demonstrate real-time performance on 6 ms streaming audio chunks. Across real-world environments in previously unseen indoor and outdoor scenarios, our system enables expressive per-class sound control and achieves substantial improvements in target-class enhancement and interference suppression. Our results show that the world need not be heard as a single, undifferentiated stream: with Aurchestra, the soundscape becomes truly programmable.
cs.LG
[484] Decorrelating the Future: Joint Frequency Domain Learning for Spatio-temporal Forecasting
Zepu Wang, Bowen Liao, Jeff, Ban
Main category: cs.LG
TL;DR: FreST Loss: A frequency-enhanced spatio-temporal training objective that uses Joint Fourier Transform to align predictions with ground truth in unified spectral domain, improving graph-structured signal forecasting by capturing complex spatio-temporal dependencies.
Details
Motivation: Standard forecasting models use point-wise objectives like MSE that fail to capture complex spatio-temporal dependencies in graph-structured signals. Existing frequency-domain approaches address temporal autocorrelation but overlook spatial and cross spatio-temporal interactions.Method: Proposes FreST Loss, a frequency-enhanced spatio-temporal training objective that extends supervision to the joint spatio-temporal spectrum using Joint Fourier Transform (JFT). This aligns model predictions with ground truth in a unified spectral domain to decorrelate complex dependencies across both space and time.
Result: Extensive experiments on six real-world datasets show FreST Loss is model-agnostic and consistently improves state-of-the-art baselines by better capturing holistic spatio-temporal dynamics. Theoretical analysis shows it reduces estimation bias associated with time-domain training objectives.
Conclusion: FreST Loss effectively addresses limitations of existing approaches by capturing both spatial and temporal dependencies through joint spectral supervision, leading to improved forecasting performance for graph-structured signals.
Abstract: Standard direct forecasting models typically rely on point-wise objectives such as Mean Squared Error, which fail to capture the complex spatio-temporal dependencies inherent in graph-structured signals. While recent frequency-domain approaches such as FreDF mitigate temporal autocorrelation, they often overlook spatial and cross spatio-temporal interactions. To address this limitation, we propose FreST Loss, a frequency-enhanced spatio-temporal training objective that extends supervision to the joint spatio-temporal spectrum. By leveraging the Joint Fourier Transform (JFT), FreST Loss aligns model predictions with ground truth in a unified spectral domain, effectively decorrelating complex dependencies across both space and time. Theoretical analysis shows that this formulation reduces estimation bias associated with time-domain training objectives. Extensive experiments on six real-world datasets demonstrate that FreST Loss is model-agnostic and consistently improves state-of-the-art baselines by better capturing holistic spatio-temporal dynamics.
[485] Machine Learning for Complex Systems Dynamics: Detecting Bifurcations in Dynamical Systems with Deep Neural Networks
Swadesh Pal, Roderick Melnik
Main category: cs.LG
TL;DR: EINNs use deep neural networks to detect critical transitions by learning parameter landscapes from equilibrium states, offering a computationally efficient alternative to traditional bifurcation analysis.
Details
Motivation: Critical transitions (tipping points) in complex systems are important but computationally expensive to detect using traditional forward simulations or bifurcation analysis. There's a need for more efficient methods to identify critical thresholds associated with regime shifts.Method: Proposes equilibrium-informed neural networks (EINNs) that reverse the typical approach: instead of fixing parameters and searching for solutions, EINNs take candidate equilibrium states as inputs and train a DNN to infer corresponding system parameters that satisfy equilibrium conditions.
Result: EINNs can effectively detect critical thresholds by analyzing learned parameter landscapes and observing abrupt changes in feasibility/continuity of equilibrium mappings. Demonstrated on nonlinear systems with saddle-node bifurcations and multi-stability, recovering parameter regions associated with impending transitions.
Conclusion: EINNs provide a flexible, computationally efficient alternative to traditional techniques for detecting critical transitions, offering new insights into early detection and structure of tipping points in high-dimensional nonlinear systems.
Abstract: Critical transitions are the abrupt shifts between qualitatively different states of a system, and they are crucial to understanding tipping points in complex dynamical systems across ecology, climate science, and biology. Detecting these shifts typically involves extensive forward simulations or bifurcation analyses, which are often computationally intensive and limited by parameter sampling. In this study, we propose a novel machine learning approach based on deep neural networks (DNNs) called equilibrium-informed neural networks (EINNs) to identify critical thresholds associated with catastrophic regime shifts. Rather than fixing parameters and searching for solutions, the EINN method reverses this process by using candidate equilibrium states as inputs and training a DNN to infer the corresponding system parameters that satisfy the equilibrium condition. By analyzing the learned parameter landscape and observing abrupt changes in the feasibility or continuity of equilibrium mappings, critical thresholds can be effectively detected. We demonstrate this capability on nonlinear systems exhibiting saddle-node bifurcations and multi-stability, showing that EINNs can recover the parameter regions associated with impending transitions. This method provides a flexible alternative to traditional techniques, offering new insights into the early detection and structure of critical shifts in high-dimensional and nonlinear systems.
[486] FedEMA-Distill: Exponential Moving Average Guided Knowledge Distillation for Robust Federated Learning
Hamza Reguieg, Mohamed El Kamili, Essaid Sabir
Main category: cs.LG
TL;DR: FedEMA-Distill: A federated learning method using server-side EMA and knowledge distillation from client logits on a proxy dataset for efficient, robust training with heterogeneous models and adversarial clients.
Details
Motivation: Federated learning suffers from performance degradation with heterogeneous non-IID client data and adversarial clients, causing client drift, slow convergence, and high communication overhead.Method: Server-side procedure combining exponential moving average (EMA) of global model with ensemble knowledge distillation from client-uploaded prediction logits on a small public proxy dataset. Clients run standard local training, upload only compressed logits, and can use different model architectures.
Result: Improves top-1 accuracy by up to +5% on CIFAR-10 and +6% on CIFAR-100, reaches target accuracy in 30-35% fewer communication rounds, reduces per-round client uplink payloads to 0.09-0.46 MB (10x less than full weights), and stabilizes training with up to 10-20% Byzantine clients.
Conclusion: Coupling temporal smoothing with logits-only aggregation provides communication-efficient, attack-resilient FL pipeline that is deployment-friendly and compatible with secure aggregation and differential privacy.
Abstract: Federated learning (FL) often degrades when clients hold heterogeneous non-Independent and Identically Distributed (non-IID) data and when some clients behave adversarially, leading to client drift, slow convergence, and high communication overhead. This paper proposes FedEMA-Distill, a server-side procedure that combines an exponential moving average (EMA) of the global model with ensemble knowledge distillation from client-uploaded prediction logits evaluated on a small public proxy dataset. Clients run standard local training, upload only compressed logits, and may use different model architectures, so no changes are required to client-side software while still supporting model heterogeneity across devices. Experiments on CIFAR-10, CIFAR-100, FEMNIST, and AG News under Dirichlet-0.1 label skew show that FedEMA-Distill improves top-1 accuracy by several percentage points (up to +5% on CIFAR-10 and +6% on CIFAR-100) over representative baselines, reaches a given target accuracy in 30-35% fewer communication rounds, and reduces per-round client uplink payloads to 0.09-0.46 MB, i.e., roughly an order of magnitude less than transmitting full model weights. Using coordinate-wise median or trimmed-mean aggregation of logits at the server further stabilizes training in the presence of up to 10-20% Byzantine clients and yields well-calibrated predictions under attack. These results indicate that coupling temporal smoothing with logits-only aggregation provides a communication-efficient and attack-resilient FL pipeline that is deployment-friendly and compatible with secure aggregation and differential privacy, since only aggregated or obfuscated model outputs are exchanged.
[487] Delta-Crosscoder: Robust Crosscoder Model Diffing in Narrow Fine-Tuning Regimes
Aly Kassem, Thomas Jiralerspong, Negar Rostamzadeh, Golnoosh Farnadi
Main category: cs.LG
TL;DR: Delta-Crosscoder improves model diffing by using sparsity and delta-based loss to better identify fine-tuning changes, especially for narrow fine-tuning scenarios.
Details
Motivation: Existing model diffing methods like Crosscoders struggle with narrow fine-tuning where behavioral changes are localized and asymmetric, requiring better techniques to isolate specific changes.Method: Combines BatchTopK sparsity with delta-based loss that prioritizes directions that change between models, plus implicit contrastive signal from paired activations on matched inputs.
Result: Outperforms SAE-based baselines while matching non-SAE-based methods across 10 model organisms including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing.
Conclusion: Crosscoders remain powerful for model diffing, and Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors, enabling effective mitigation.
Abstract: Model diffing methods aim to identify how fine-tuning changes a model’s internal representations. Crosscoders approach this by learning shared dictionaries of interpretable latent directions between base and fine-tuned models. However, existing formulations struggle with narrow fine-tuning, where behavioral changes are localized and asymmetric. We introduce Delta-Crosscoder, which combines BatchTopK sparsity with a delta-based loss prioritizing directions that change between models, plus an implicit contrastive signal from paired activations on matched inputs. Evaluated across 10 model organisms, including synthetic false facts, emergent misalignment, subliminal learning, and taboo word guessing (Gemma, LLaMA, Qwen; 1B-9B parameters), Delta-Crosscoder reliably isolates latent directions causally responsible for fine-tuned behaviors and enables effective mitigation, outperforming SAE-based baselines, while matching the Non-SAE-based. Our results demonstrate that crosscoders remain a powerful tool for model diffing.
[488] Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Hengshuai Yao, Guan Wang
Main category: cs.LG
TL;DR: Asymmetric attention reduces dimensionality for queries and keys while keeping values high-dimensional, achieving significant memory savings with minimal performance loss.
Details
Motivation: Standard transformer attention uses identical dimensions for queries, keys, and values, but these components serve different roles: queries and keys produce scalar attention weights (selection), while values carry rich semantic representations (value transfer). Selection is inherently lower-dimensional than value transfer.Method: Proposes asymmetric attention where queries and keys have lower dimensionality than values. Validates through seven experiments including positional selection tasks, content-based retrieval, language modeling on WikiText datasets, SVD compression of GPT-2, LLaMA model testing, and Mistral-7B compression with QK fine-tuning.
Result: Achieves 75% reduction in QK parameters with only 4.3% perplexity increase on language modeling. SVD compression followed by QK fine-tuning achieves 75% key cache savings with <2% residual quality cost. For 7B-parameter models serving 128K context, saves 25GB KV cache per user, enabling ~60% more concurrent users.
Conclusion: Asymmetric attention is more efficient than symmetric attention, as selection (queries/keys) requires fewer dimensions than value transfer. This enables significant memory savings for KV caches in large language models with minimal performance degradation.
Abstract: Standard transformer attention uses identical dimensionality for queries, keys, and values ($d_q = d_k = d_v = \dmodel$). Our insight is that these components serve fundamentally different roles, and this symmetry is unnecessary. Queries and keys produce scalar attention weights (\emph{selection}), while values carry rich semantic representations (\emph{value transfer}). We argue that selection is an inherently lower-dimensional operation than value transfer, requiring only $\BigO(\log N)$ dimensions to distinguish among $N$ relevant patterns. We validate this hypothesis across seven experiments: (1)positional selection tasks requiring just 1dimension per head, (2)~content-based retrieval requiring $\sim!\log_2 N$ dimensions, (3–4)~WikiText-2 and WikiText-103 language modeling where $\dselect = \dmodel/4$ incurs only 4.3% perplexity increase while reducing QK parameters by 75%, (5)~post-training SVD compression of GPT-2, revealing keys to be far more compressible than queries, with lightweight QK fine-tuning recovering nearly all quality loss, (6)~a 125M-parameter LLaMA model confirming identical degradation ratios across architectures, and (7)~Mistral-7B (7.2B parameters), where SVD compression followed by QK fine-tuning achieves 75% key cache savings at just 2.0% residual quality cost. For existing models, SVD compression followed by QK fine-tuning (3 epochs on a small fraction of pretraining data) achieves 75% key cache savings at $<$2% residual quality cost. For a 7B-parameter model serving 128K context, asymmetric attention saves 25,GB of KV cache per user, enabling approximately 60% more concurrent users on the same GPU.
[489] Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices
Yakov Pyotr Shkolnikov
Main category: cs.LG
TL;DR: KV cache persistence system for multi-agent LLMs on edge devices using 4-bit quantization to disk, enabling direct cache restoration and eliminating redundant prefill computation.
Details
Motivation: Multi-agent LLM systems on edge devices face memory constraints where device RAM cannot hold all agents' KV caches simultaneously, forcing constant eviction and reloading with expensive full model prefill computations (15.7 seconds per agent at 4K context).Method: Three-component system: 1) Block pool providing per-agent isolated Q4 KV caches in safetensors format, 2) BatchQuantizedKVCache for concurrent inference over multiple agents’ quantized caches, 3) Cross-phase context injection accumulating attention state across conversation phases without recomputation.
Result: Cache restoration reduces time-to-first-token by up to 136x across models (Gemma: 22-136x, DeepSeek: 11-76x, Llama: 24-111x). Q4 quantization fits 4x more agent contexts into fixed memory. Perplexity impact: -0.7% for Gemma, +2.8% for Llama, +3.0% for DeepSeek.
Conclusion: The system enables efficient multi-agent LLM deployment on edge devices by persisting quantized KV caches to disk, dramatically reducing latency while maintaining acceptable model quality through 4-bit quantization.
Abstract: Multi-agent LLM systems on edge devices face a memory management problem: device RAM is too small to hold every agent’s KV cache simultaneously. On Apple M4 Pro with 10.2 GB of cache budget, only 3 agents fit at 8K context in FP16. A 10-agent workflow must constantly evict and reload caches. Without persistence, every eviction forces a full re-prefill through the model – 15.7 seconds per agent at 4K context. We address this by persisting each agent’s KV cache to disk in 4-bit quantized format and reloading it directly into the attention layer, eliminating redundant O(n) prefill computation via direct cache restoration. The system comprises three components: a block pool providing per-agent isolated Q4 KV caches in safetensors format, a BatchQuantizedKVCache for concurrent inference over multiple agents’ quantized caches, and cross-phase context injection that accumulates attention state across conversation phases without re-computation. Evaluated on three architectures (Gemma 3 12B, dense GQA, 48 layers; DeepSeek-Coder-V2-Lite 16B, MoE MLA, 27 layers; Llama 3.1 8B, dense GQA, 32 layers), cache restoration reduces time-to-first-token by up to 136x (Gemma: 22–136x at 4K–32K; DeepSeek: 11–76x at 4K–32K; Llama: 24–111x at 4K–16K; 3–10x at 1K). Q4 quantization fits 4x more agent contexts into fixed device memory than FP16. Perplexity measured with actual Q4 KV caches shows -0.7% for Gemma, +2.8% for Llama, and +3.0% for DeepSeek. Open-source at https://github.com/yshk-mxim/agent-memory
[490] Flowers: A Warp Drive for Neural PDE Solvers
Till Muser, Alexandra Spitzer, Matti Lassas, Maarten V. de Hoop, Ivan Dokmanić
Main category: cs.LG
TL;DR: Flowers is a novel neural architecture for learning PDE solution operators using multihead warps instead of Fourier multipliers, attention, or convolutions, achieving state-of-the-art performance on PDE benchmarks with linear computational cost.
Details
Motivation: The paper aims to develop efficient neural operators for solving PDEs that avoid traditional components like Fourier multipliers, dot-product attention, and convolutional mixing, which can be computationally expensive or limited in capturing certain physical phenomena.Method: Flowers uses multihead warps where each head predicts a displacement field and warps mixed input features. Displacements are predicted pointwise without spatial aggregation, and nonlocality enters only through sparse sampling at source coordinates (one per head). The architecture stacks warps in multiscale residual blocks to implement adaptive, global interactions at linear cost.
Result: Flowers achieves excellent performance on 2D and 3D time-dependent PDE benchmarks, particularly for flows and waves. A compact 17M-parameter model consistently outperforms Fourier, convolution, and attention-based baselines of similar size, while a 150M-parameter variant improves over recent transformer-based foundation models with more parameters, data, and training compute.
Conclusion: Flowers presents a novel, efficient neural operator architecture that leverages multihead warps for PDE solving, offering linear computational cost while achieving state-of-the-art performance across diverse PDE benchmarks, with strong theoretical motivation from physics principles.
Abstract: We introduce Flowers, a neural architecture for learning PDE solution operators built entirely from multihead warps. Aside from pointwise channel mixing and a multiscale scaffold, Flowers use no Fourier multipliers, no dot-product attention, and no convolutional mixing. Each head predicts a displacement field and warps the mixed input features. Motivated by physics and computational efficiency, displacements are predicted pointwise, without any spatial aggregation, and nonlocality enters \emph{only} through sparse sampling at source coordinates, \emph{one} per head. Stacking warps in multiscale residual blocks yields Flowers, which implement adaptive, global interactions at linear cost. We theoretically motivate this design through three complementary lenses: flow maps for conservation laws, waves in inhomogeneous media, and a kinetic-theoretic continuum limit. Flowers achieve excellent performance on a broad suite of 2D and 3D time-dependent PDE benchmarks, particularly flows and waves. A compact 17M-parameter model consistently outperforms Fourier, convolution, and attention-based baselines of similar size, while a 150M-parameter variant improves over recent transformer-based foundation models with much more parameters, data, and training compute.
[491] Uncertainty-Calibrated Spatiotemporal Field Diffusion with Sparse Supervision
Kevin Valencia, Xihaier Luo, Shinjae Yoo, David Keetae Park
Main category: cs.LG
TL;DR: SOLID: A mask-conditioned diffusion framework for spatiotemporal forecasting and reconstruction from sparse sensor observations only, without requiring dense training data.
Details
Motivation: Physical fields are typically observed at sparse, time-varying sensor locations, making forecasting and reconstruction ill-posed and uncertainty-critical. Existing methods often train on dense reanalysis or simulations but only test under sparsity, requiring pre-imputation or dense training data.Method: SOLID uses a mask-conditioned diffusion framework that learns spatiotemporal dynamics from sparse observations alone. It conditions each denoising step on measured values and their locations, and introduces a dual-masking objective that emphasizes learning in unobserved void regions while upweighting overlap pixels where inputs and targets provide reliable anchors.
Result: Achieves up to an order-of-magnitude improvement in probabilistic error and yields calibrated uncertainty maps (ρ > 0.7) under severe sparsity. The strict sparse-conditioning pathway enables posterior sampling of full fields consistent with measurements.
Conclusion: SOLID demonstrates that training end-to-end with sparse supervision only is feasible and effective for spatiotemporal forecasting and reconstruction, providing calibrated uncertainty quantification under severe observation sparsity.
Abstract: Physical fields are typically observed only at sparse, time-varying sensor locations, making forecasting and reconstruction ill-posed and uncertainty-critical. We present SOLID, a mask-conditioned diffusion framework that learns spatiotemporal dynamics from sparse observations alone: training and evaluation use only observed target locations, requiring no dense fields and no pre-imputation. Unlike prior work that trains on dense reanalysis or simulations and only tests under sparsity, SOLID is trained end-to-end with sparse supervision only. SOLID conditions each denoising step on the measured values and their locations, and introduces a dual-masking objective that (i) emphasizes learning in unobserved void regions while (ii) upweights overlap pixels where inputs and targets provide the most reliable anchors. This strict sparse-conditioning pathway enables posterior sampling of full fields consistent with the measurements, achieving up to an order-of-magnitude improvement in probabilistic error and yielding calibrated uncertainty maps (\r{ho} > 0.7) under severe sparsity.
[492] ZorBA: Zeroth-order Federated Fine-tuning of LLMs with Heterogeneous Block Activation
Chuiyang Meng, Ming Tang, Vincent W. S. Wong
Main category: cs.LG
TL;DR: ZorBA: A federated fine-tuning framework for LLMs using zeroth-order optimization and heterogeneous block activation to reduce VRAM usage and communication overhead.
Details
Motivation: Federated fine-tuning of large language models faces challenges: high VRAM usage due to local updates and significant communication overhead from frequent model exchanges.Method: Uses zeroth-order optimization (forward passes only, no gradient storage), heterogeneous block activation (different transformer block subsets per client), shared random seeds, and finite differences of gradients to reduce communication.
Result: Outperforms three federated fine-tuning baselines by up to 62.41% in VRAM usage reduction while maintaining low communication overhead.
Conclusion: ZorBA effectively addresses VRAM and communication challenges in federated LLM fine-tuning through zeroth-order optimization and intelligent block allocation.
Abstract: Federated fine-tuning of large language models (LLMs) enables collaborative tuning across distributed clients. However, due to the large size of LLMs, local updates in federated learning (FL) may incur substantial video random-access memory (VRAM) usage. Moreover, frequent model exchange may lead to significant communication overhead. To tackle these challenges, in this paper we propose ZorBA, a zeroth-order optimization-based federated fine-tuning framework with heterogeneous block activation. ZorBA leverages zeroth-order optimization to eliminate the storage of gradients at the clients by forward passes. ZorBA includes a heterogeneous block activation mechanism in which the central server allocates different subsets of transformer blocks to clients in order to accelerate the convergence rate and reduce the VRAM usage. Furthermore, ZorBA utilizes shared random seeds and the finite differences of gradients in order to reduce the communication overhead. We conduct theoretical analysis to characterize the effect of block activation decisions on the convergence rate and VRAM usage. To jointly enhance the convergence rate and reduce the VRAM usage, we formulate an optimization problem to optimize the block activation decisions. We propose an $ε$-constraint lexicographic algorithm to solve this problem. Experimental results show that ZorBA outperforms three federated fine-tuning baselines in VRAM usage by up to 62.41% and incurs a low communication overhead.
[493] ASFL: An Adaptive Model Splitting and Resource Allocation Framework for Split Federated Learning
Chuiyang Meng, Ming Tang, Vincent W. S. Wong
Main category: cs.LG
TL;DR: ASFL is an adaptive split federated learning framework for wireless networks that optimizes model splitting and resource allocation to reduce delay and energy consumption while maintaining learning performance.
Details
Motivation: Federated learning faces challenges with limited client computation resources causing high delay and energy consumption, especially in wireless network environments where resource constraints are critical.Method: Proposes ASFL framework that leverages central server computation, adaptive model splitting, and joint resource allocation optimization using theoretical convergence analysis and OOE-BCD algorithm.
Result: ASFL converges faster than baseline schemes and reduces total delay by up to 75% and energy consumption by up to 80% compared to five baseline approaches.
Conclusion: ASFL effectively addresses FL efficiency challenges in wireless networks through adaptive model splitting and optimized resource allocation, significantly improving both learning performance and resource efficiency.
Abstract: Federated learning (FL) enables multiple clients to collaboratively train a machine learning model without sharing their raw data. However, the limited computation resources of the clients may result in a high delay and energy consumption on training. In this paper, we propose an adaptive split federated learning (ASFL) framework over wireless networks. ASFL exploits the computation resources of the central server to train part of the model and enables adaptive model splitting as well as resource allocation during training. To optimize the learning performance (i.e., convergence rate) and efficiency (i.e., delay and energy consumption) of ASFL, we theoretically analyze the convergence rate and formulate a joint learning performance and resource allocation optimization problem. Solving this problem is challenging due to the long-term delay and energy consumption constraints as well as the coupling of the model splitting and resource allocation decisions. We propose an online optimization enhanced block coordinate descent (OOE-BCD) algorithm to solve the problem iteratively. Experimental results show that when compared with five baseline schemes, our proposed ASFL framework converges faster and reduces the total delay and energy consumption by up to 75% and 80%, respectively.
[494] WavSLM: Single-Stream Speech Language Modeling via WavLM Distillation
Luca Della Libera, Cem Subakan, Mirco Ravanelli
Main category: cs.LG
TL;DR: WavSLM is a speech language model that uses single-stream autoregressive training on quantized WavLM representations, jointly modeling semantic and acoustic information without text supervision.
Details
Motivation: Extending the successful autoregressive training paradigm from text to speech is challenging due to entangled semantic and acoustic information. Existing speech LMs rely on text supervision, hierarchical tokens, or complex architectures, departing from the simple single-stream generative pretraining that works well for text.Method: Quantize and distill self-supervised WavLM representations into a single codebook, then train with autoregressive next-chunk prediction objective. This creates a single token stream that jointly models semantic and acoustic information without text supervision.
Result: Achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference.
Conclusion: WavSLM demonstrates that simple autoregressive training can be effectively extended to speech, jointly modeling semantic and acoustic information in a single stream without text supervision.
Abstract: Large language models show that simple autoregressive training can yield scalable and coherent generation, but extending this paradigm to speech remains challenging due to the entanglement of semantic and acoustic information. Most existing speech language models rely on text supervision, hierarchical token streams, or complex hybrid architectures, departing from the single-stream generative pretraining paradigm that has proven effective in text. In this work, we introduce WavSLM, a speech language model trained by quantizing and distilling self-supervised WavLM representations into a single codebook and optimizing an autoregressive next-chunk prediction objective. WavSLM jointly models semantic and acoustic information within a single token stream without text supervision or text pretraining. Despite its simplicity, it achieves competitive performance on consistency benchmarks and speech generation while using fewer parameters, less training data, and supporting streaming inference. Demo samples are available at https://lucadellalib.github.io/wavslm-web/.
[495] An Explainable Ensemble Framework for Alzheimer’s Disease Prediction Using Structured Clinical and Cognitive Data
Nishan Mitra
Main category: cs.LG
TL;DR: An explainable ensemble learning framework for Alzheimer’s disease classification using clinical, lifestyle, and metabolic features with ensemble algorithms and deep learning, achieving superior performance with ensemble methods over deep learning.
Details
Motivation: Early and accurate detection of Alzheimer's disease remains challenging due to its subtle onset and progressive nature, requiring reliable and transparent diagnostic approaches.Method: Explainable ensemble learning framework with rigorous preprocessing, advanced feature engineering, SMOTE-Tomek hybrid class balancing, and optimized modeling using five ensemble algorithms (Random Forest, XGBoost, LightGBM, CatBoost, Extra Trees) alongside a deep artificial neural network, with stratified validation to prevent leakage.
Result: Ensemble methods achieved superior performance over deep learning, with XGBoost, Random Forest, and Soft Voting showing the strongest accuracy, sensitivity, and F1-score profiles. SHAP and feature importance analysis identified MMSE, Functional Assessment Age, and engineered interaction features as most influential.
Conclusion: The proposed framework provides a reliable and transparent approach to Alzheimer’s disease prediction with strong potential for clinical decision support applications.
Abstract: Early and accurate detection of Alzheimer’s disease (AD) remains a major challenge in medical diagnosis due to its subtle onset and progressive nature. This research introduces an explainable ensemble learning Framework designed to classify individuals as Alzheimer’s or Non-Alzheimer’s using structured clinical, lifestyle, metabolic, and lifestyle features. The workflow incorporates rigorous preprocessing, advanced feature engineering, SMOTE-Tomek hybrid class balancing, and optimized modeling using five ensemble algorithms-Random Forest, XGBoost, LightGBM, CatBoost, and Extra Trees-alongside a deep artificial neural network. Model selection was performed using stratified validation to prevent leakage, and the best-performing model was evaluated on a fully unseen test set. Ensemble methods achieved superior performance over deep learning, with XGBoost, Random Forest, and Soft Voting showing the strongest accuracy, sensitivity, and F1-score profiles. Explainability techniques, including SHAP and feature importance analysis, highlighted MMSE, Functional Assessment Age, and several engineered interaction features as the most influential determinants. The results demonstrate that the proposed framework provides a reliable and transparent approach to Alzheimer’s disease prediction, offering strong potential for clinical decision support applications.
[496] Competitive Multi-Operator Reinforcement Learning for Joint Pricing and Fleet Rebalancing in AMoD Systems
Emil Kragh Toft, Carolin Schmidt, Daniele Gammelli, Filipe Rodrigues
Main category: cs.LG
TL;DR: Multi-operator reinforcement learning framework for competitive AMoD markets with pricing and fleet rebalancing policies, integrating discrete choice theory for passenger allocation
Details
Motivation: Realistic AMoD markets will be competitive with multiple operators, but existing reinforcement learning work fails to capture competitive market dynamics and strategic interactions between operatorsMethod: Multi-operator reinforcement learning framework where two operators simultaneously learn pricing and fleet rebalancing policies, integrating discrete choice theory to enable passenger allocation and demand competition to emerge endogenously from utility-maximizing decisions
Result: Experiments with real-world data show competition fundamentally alters learned behaviors, leading to lower prices and distinct fleet positioning patterns compared to monopolistic settings; learning-based approaches are robust to competition stochasticity
Conclusion: Competitive AMoD markets require multi-agent learning approaches; reinforcement learning can handle competitive dynamics and converge to effective policies despite partially observed competitor strategies
Abstract: Autonomous Mobility-on-Demand (AMoD) systems promise to revolutionize urban transportation by providing affordable on-demand services to meet growing travel demand. However, realistic AMoD markets will be competitive, with multiple operators competing for passengers through strategic pricing and fleet deployment. While reinforcement learning has shown promise in optimizing single-operator AMoD control, existing work fails to capture competitive market dynamics. We investigate the impact of competition on policy learning by introducing a multi-operator reinforcement learning framework where two operators simultaneously learn pricing and fleet rebalancing policies. By integrating discrete choice theory, we enable passenger allocation and demand competition to emerge endogenously from utility-maximizing decisions. Experiments using real-world data from multiple cities demonstrate that competition fundamentally alters learned behaviors, leading to lower prices and distinct fleet positioning patterns compared to monopolistic settings. Notably, we demonstrate that learning-based approaches are robust to the additional stochasticity of competition, with competitive agents successfully converging to effective policies while accounting for partially unobserved competitor strategies.
[497] On Emergences of Non-Classical Statistical Characteristics in Classical Neural Networks
Hanyu Zhao, Yang Wu, Yuexian Hou
Main category: cs.LG
TL;DR: NCnet: A classical neural architecture that exhibits quantum-like non-classical statistical behaviors, measured by CHSH inequality S statistic, arising from gradient competitions in multi-task learning.
Details
Motivation: Inspired by quantum mechanics concepts like measurement incompatibility and Bell inequalities, the authors aim to create a classical neural network that can stably exhibit non-classical statistical behaviors to provide novel insights into deep network training dynamics and internal interactions.Method: Proposed Non-Classical Network (NCnet), a simple classical neural architecture with multi-task heads sharing hidden layers. The non-classicality emerges from gradient competitions between neurons shared across tasks, measured using the S statistic from CHSH inequality under interpretable experimental setups.
Result: NCnet exhibits non-classical statistical behaviors where S statistic approaches classical upper-bound 2 in low-resource regimes, temporarily exceeds 2 near critical model scale, then asymptotically decays to fluctuate around 2. Non-classical correlations emerge without explicit communication links, and S positively correlates with generalization when model capacity is insufficient.
Conclusion: Non-classical statistics provide a novel perspective for understanding internal interactions and training dynamics in deep networks, with the regime where S first approaches 2 often corresponding to good generalization performance.
Abstract: Inspired by measurement incompatibility and Bell-family inequalities in quantum mechanics, we propose the Non-Classical Network (NCnet), a simple classical neural architecture that stably exhibits non-classical statistical behaviors under typical and interpretable experimental setups. We find non-classicality, measured by the $S$ statistic of CHSH inequality, arises from gradient competitions of hidden-layer neurons shared by multi-tasks. Remarkably, even without physical links supporting explicit communication, one task head can implicitly sense the training task of other task heads via local loss oscillations, leading to non-local correlations in their training outcomes. Specifically, in the low-resource regime, the value of $S$ increases gradually with increasing resources and approaches toward its classical upper-bound 2, which implies that underfitting is alleviated with resources increase. As the model nears the critical scale required for adequate performance, $S$ may temporarily exceed 2. As resources continue to grow, $S$ then asymptotically decays down to and fluctuates around 2. Empirically, when model capacity is insufficient, $S$ is positively correlated with generalization performance, and the regime where $S$ first approaches $2$ often corresponding to good generalization. Overall, our results suggest that non-classical statistics can provide a novel perspective for understanding internal interactions and training dynamics of deep networks.
[498] Overtone: Cyclic Patch Modulation for Clean, Efficient, and Flexible Physics Emulators
Payel Mukhopadhyay, Michael McCabe, Ruben Ohana, Miles Cranmer
Main category: cs.LG
TL;DR: Overtone introduces dynamic patch size control for transformer-based PDE surrogates to mitigate harmonic error accumulation and enable compute-adaptive inference.
Details
Motivation: Transformer-based PDE surrogates suffer from two main issues: fixed patch sizes cause systematic error accumulation at harmonic frequencies, and computational costs remain inflexible regardless of problem complexity or available resources.Method: Introduces Overtone with two architecture-agnostic modules: CSM (using dynamic stride modulation) and CKM (using dynamic kernel resizing) that enable dynamic patch size control during autoregressive rollouts. The cyclic modulation of patch sizes distributes errors across the frequency spectrum.
Result: Achieves up to 40% lower long rollout error in variance-normalised RMSE (VRMSE) compared to conventional static-patch surrogates. One Overtone model matches or exceeds fixed-patch baselines across inference compute budgets when trained under fixed total training budget.
Conclusion: Overtone provides a unified solution for harmonic mitigation and compute-adaptive deployment in transformer-based PDE surrogates through dynamic patch size control, offering flexible accuracy-speed tradeoffs based on computational constraints.
Abstract: Transformer-based PDE surrogates achieve remarkable performance but face two key challenges: fixed patch sizes cause systematic error accumulation at harmonic frequencies, and computational costs remain inflexible regardless of problem complexity or available resources. We introduce Overtone, a unified solution through dynamic patch size control at inference. Overtone’s key insight is that cyclically modulating patch sizes during autoregressive rollouts distributes errors across the frequency spectrum, mitigating the systematic harmonic artifact accumulation that plague fixed-patch models. We implement this through two architecture-agnostic modules–CSM (using dynamic stride modulation) and CKM (using dynamic kernel resizing)–that together provide both harmonic mitigation and compute-adaptive deployment. This flexible tokenization lets users trade accuracy for speed dynamically based on computational constraints, and the cyclic rollout strategy yields up to 40% lower long rollout error in variance-normalised RMSE (VRMSE) compared to conventional, static-patch surrogates. Across challenging 2D and 3D PDE benchmarks, one Overtone model matches or exceeds fixed-patch baselines across inference compute budgets, when trained under a fixed total training budget setting.
[499] Learning Unified Distance Metric for Heterogeneous Attribute Data Clustering
Yiqun Zhang, Mingjie Zhao, Yizhou Chen, Yang Lu, Yiu-ming Cheung
Main category: cs.LG
TL;DR: HARR is a novel clustering method for mixed numerical and categorical data that learns unified distance metrics by projecting attributes into multiple learnable spaces, integrating metric learning with clustering in a parameter-free, convergence-guaranteed framework.
Details
Motivation: Real-world datasets often contain both numerical and categorical attributes, but existing clustering methods struggle to effectively handle this heterogeneity. Current approaches either encode attributes into one type or use unified metrics without revealing inherent connections between different attribute types.Method: Proposes Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm that: 1) transforms heterogeneous attributes into homogeneous status for distance metric learning, 2) projects each attribute’s values into unified learnable multiple spaces for finer representation, 3) integrates metric learning with clustering to automatically adapt to different clustering tasks.
Result: Extensive experiments demonstrate HARR’s superiority in terms of accuracy and efficiency. The method is parameter-free, convergence-guaranteed, and can effectively self-adapt to different numbers of clusters (k).
Conclusion: HARR provides an effective solution for mixed data clustering by learning unified distance metrics through attribute projection into multiple spaces, outperforming existing methods while being parameter-free and convergence-guaranteed.
Abstract: Datasets composed of numerical and categorical attributes (also called mixed data hereinafter) are common in real clustering tasks. Differing from numerical attributes that indicate tendencies between two concepts (e.g., high and low temperature) with their values in well-defined Euclidean distance space, categorical attribute values are different concepts (e.g., different occupations) embedded in an implicit space. Simultaneously exploiting these two very different types of information is an unavoidable but challenging problem, and most advanced attempts either encode the heterogeneous numerical and categorical attributes into one type, or define a unified metric for them for mixed data clustering, leaving their inherent connection unrevealed. This paper, therefore, studies the connection among any-type of attributes and proposes a novel Heterogeneous Attribute Reconstruction and Representation (HARR) learning paradigm accordingly for cluster analysis. The paradigm transforms heterogeneous attributes into a homogeneous status for distance metric learning, and integrates the learning with clustering to automatically adapt the metric to different clustering tasks. Differing from most existing works that directly adopt defined distance metrics or learn attribute weights to search clusters in a subspace. We propose to project the values of each attribute into unified learnable multiple spaces to more finely represent and learn the distance metric for categorical data. HARR is parameter-free, convergence-guaranteed, and can more effectively self-adapt to different sought number of clusters $k$. Extensive experiments illustrate its superiority in terms of accuracy and efficiency.
[500] VSPrefill: Vertical-Slash Sparse Attention with Lightweight Indexing for Long-Context Prefilling
Chen Guanzhong
Main category: cs.LG
TL;DR: VSPrefill: A lightweight training method using vertical-slash attention patterns to enable efficient long-context inference in LLMs with linear complexity, preserving accuracy while achieving 4.95x speedup at 128k context.
Details
Motivation: The quadratic complexity of self-attention during prefill phase limits long-context inference in LLMs. Existing sparse attention methods face trade-offs between context adaptivity, sampling overhead, and fine-tuning costs.Method: Proposes VSPrefill using vertical-slash structural patterns in attention distributions. A compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE, constructing sparse masks with linear complexity without modifying backbone parameters.
Result: Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across LongBench and RULER benchmarks, VSPrefill preserves 98.35% of full attention accuracy while delivering 4.95x average speedup at context length of 128k.
Conclusion: Establishes a new Pareto frontier in the trade-off between accuracy and efficiency for long-context inference in large language models.
Abstract: The quadratic complexity of self-attention during the prefill phase impedes long-context inference in large language models. Existing sparse attention methods face a trade-off among context adaptivity, sampling overhead, and fine-tuning costs. We propose VSPrefill, a mechanism requiring lightweight training that uses the vertical-slash structural pattern in attention distributions. Our compact VSIndexer module predicts context-aware importance scores for vertical columns and slash diagonals from key-value representations augmented with RoPE. This approach constructs sparse masks with linear complexity without modifying the backbone parameters. During inference, an adaptive cumulative-threshold strategy allocates sparsity budgets per layer, while a fused kernel executes attention with on-the-fly index merging. Evaluated on Qwen3-4B-Instruct and LLaMA-3.1-8B-Instruct across the LongBench and RULER benchmarks, VSPrefill preserves 98.35% of the full attention accuracy while delivering a 4.95x average speedup at a context length of 128k. These results establish a new Pareto frontier in the trade-off between accuracy and efficiency.
[501] MAD-SmaAt-GNet: A Multimodal Advection-Guided Neural Network for Precipitation Nowcasting
Samuel van Wonderen, Siamak Mehrkanoon
Main category: cs.LG
TL;DR: MAD-SmaAt-GNet extends SmaAt-UNet for precipitation nowcasting by adding multimodal weather variable inputs and physics-based advection guidance, achieving 8.9% MSE reduction for 4-hour forecasts.
Details
Motivation: Traditional numerical weather prediction is computationally expensive and doesn't fully leverage available weather data. Deep learning models offer efficient alternatives, but existing CNN-based approaches like SmaAt-UNet could be improved by incorporating multiple weather variables and physical constraints.Method: Extends SmaAt-UNet with: (1) additional encoder for multimodal weather variables, (2) physics-based advection component for physically consistent predictions. Combines deep learning with physical modeling principles.
Result: 8.9% reduction in mean squared error compared to baseline SmaAt-UNet for four-step precipitation forecasting up to four hours ahead. Multimodal inputs benefit short lead times, advection component helps both short and long horizons.
Conclusion: Combining multimodal learning with physics-based constraints improves precipitation nowcasting. The approach demonstrates synergy between data-driven and physics-informed methods for weather forecasting.
Abstract: Precipitation nowcasting (short-term forecasting) is still often performed using numerical solvers for physical equations, which are computationally expensive and make limited use of the large volumes of available weather data. Deep learning models have shown strong potential for precipitation nowcasting, offering both accuracy and computational efficiency. Among these models, convolutional neural networks (CNNs) are particularly effective for image-to-image prediction tasks. The SmaAt-UNet is a lightweight CNN based architecture that has demonstrated strong performance for precipitation nowcasting. This paper introduces the Multimodal Advection-Guided Small Attention GNet (MAD-SmaAt-GNet), which extends the core SmaAt-UNet by (i) incorporating an additional encoder to learn from multiple weather variables and (ii) integrating a physics-based advection component to ensure physically consistent predictions. We show that each extension individually improves rainfall forecasts and that their combination yields further gains. MAD-SmaAt-GNet reduces the mean squared error (MSE) by 8.9% compared with the baseline SmaAt-UNet for four-step precipitation forecasting up to four hours ahead. Additionally, experiments indicate that multimodal inputs are particularly beneficial for short lead times, while the advection-based component enhances performance across both short and long forecasting horizons.
[502] Understanding the Dynamics of Demonstration Conflict in In-Context Learning
Difan Jiao, Di Wang, Lijie Hu
Main category: cs.LG
TL;DR: Models struggle with conflicting demonstrations in in-context learning, showing systematic misleading behavior due to two-phase processing of correct vs. incorrect rules in different attention heads.
Details
Motivation: In-context learning is vulnerable to noisy or conflicting demonstrations, but it's unclear how models internally process such conflicts. The paper aims to understand the mechanisms behind models' systematic misleading behavior when exposed to corrupted rule demonstrations.Method: Study demonstration-dependent tasks requiring rule inference, analyze performance degradation from corrupted demonstrations, use linear probes and logit lens analysis to examine internal representations, identify specific attention heads responsible for different phases of processing, and validate through targeted ablation experiments.
Result: Models show substantial performance degradation from single corrupted demonstrations. They encode both correct and incorrect rules in intermediate layers but develop prediction confidence only in late layers. Two types of attention heads identified: Vulnerability Heads (early-to-middle layers) with positional bias and high corruption sensitivity, and Susceptible Heads (late layers) that reduce support for correct predictions. Masking a small number of these heads improves performance by over 10%.
Conclusion: Large language models have a two-phase computational structure for processing conflicting evidence in in-context learning, with specific attention heads responsible for vulnerability to corruption. This reveals systematic reasoning failures that can be mitigated through targeted interventions.
Abstract: In-context learning enables large language models to perform novel tasks through few-shot demonstrations. However, demonstrations per se can naturally contain noise and conflicting examples, making this capability vulnerable. To understand how models process such conflicts, we study demonstration-dependent tasks requiring models to infer underlying patterns, a process we characterize as rule inference. We find that models suffer substantial performance degradation from a single demonstration with corrupted rule. This systematic misleading behavior motivates our investigation of how models process conflicting evidence internally. Using linear probes and logit lens analysis, we discover that under corruption models encode both correct and incorrect rules in intermediate layers but develop prediction confidence only in late layers, revealing a two-phase computational structure. We then identify attention heads for each phase underlying the reasoning failures: Vulnerability Heads in early-to-middle layers exhibit positional attention bias with high sensitivity to corruption, while Susceptible Heads in late layers significantly reduce support for correct predictions when exposed to the corrupted evidence. Targeted ablation validates our findings, with masking a small number of identified heads improving performance by over 10%.
[503] Towards Explainable Deep Learning for Ship Trajectory Prediction in Inland Waterways
Tom Legel, Dirk Söffker, Roland Schätzle, Kathrin Donandt
Main category: cs.LG
TL;DR: LSTM-based ship trajectory prediction model with attention-based fusion using learned ship domain parameters for interpretability in inland waterways
Details
Motivation: Need for accurate ship trajectory predictions in crowded inland waterways, with growing concern about explainability of deep learning models that may obscure inaccurate logic and undermine confidence in reliabilityMethod: LSTM-based vessel trajectory prediction model incorporating trained ship domain parameters that provide insight into attention-based fusion of interacting vessels’ hidden states, focusing on interpretability in complex inland waterway encounters
Result: Achieved final displacement error of around 40 meters in 5-minute prediction horizon, comparable to similar studies; ship-to-ship attention improved accuracy but learned ship domain values didn’t align with expected causal relationships
Conclusion: Model demonstrates explanatory capabilities through intrinsically interpretable design, but accuracy improvements aren’t fully driven by causal relationships; future work includes counterfactual analysis and more sophisticated attention mechanisms
Abstract: Accurate predictions of ship trajectories in crowded environments are essential to ensure safety in inland waterways traffic. Recent advances in deep learning promise increased accuracy even for complex scenarios. While the challenge of ship-to-ship awareness is being addressed with growing success, the explainability of these models is often overlooked, potentially obscuring an inaccurate logic and undermining the confidence in their reliability. This study examines an LSTM-based vessel trajectory prediction model by incorporating trained ship domain parameters that provide insight into the attention-based fusion of the interacting vessels’ hidden states. This approach has previously been explored in the field of maritime shipping, yet the variety and complexity of encounters in inland waterways allow for a more profound analysis of the model’s interpretability. The prediction performance of the proposed model variants are evaluated using standard displacement error statistics. Additionally, the plausibility of the generated ship domain values is analyzed. With an final displacement error of around 40 meters in a 5-minute prediction horizon, the model performs comparably to similar studies. Though the ship-to-ship attention architecture enhances prediction accuracy, the weights assigned to vessels in encounters using the learnt ship domain values deviate from the expectation. The observed accuracy improvements are thus not entirely driven by a causal relationship between a predicted trajectory and the trajectories of nearby ships. This finding underscores the model’s explanatory capabilities through its intrinsically interpretable design. Future work will focus on utilizing the architecture for counterfactual analysis and on the incorporation of more sophisticated attention mechanisms.
[504] Activity Recognition from Smart Insole Sensor Data Using a Circular Dilated CNN
Yanhua Zhao
Main category: cs.LG
TL;DR: CDCNN model for activity classification using smart insole sensor data achieves 86.42% accuracy on 4-class task, comparable to XGBoost at 87.83%
Details
Motivation: Smart insoles with pressure sensors, accelerometers, and gyroscopes provide non-intrusive gait and posture monitoring, requiring effective classification methods for embedded deployment and real-time inferenceMethod: Circular dilated convolutional neural network (CDCNN) processes multi-modal time-series data from smart insoles (160-frame windows, 24 channels: 18 pressure, 3 accelerometer, 3 gyroscope axes)
Result: Achieved 86.42% test accuracy in subject-independent evaluation on four-class task (Standing, Walking, Sitting, Tandem), comparable to XGBoost at 87.83%; inertial sensors found to contribute substantially to discrimination
Conclusion: CDCNN approach is suitable for embedded deployment and real-time inference, with inertial sensors playing important role in activity discrimination
Abstract: Smart insoles equipped with pressure sensors, accelerometers, and gyroscopes offer a non-intrusive means of monitoring human gait and posture. We present an activity classification system based on a circular dilated convolutional neural network (CDCNN) that processes multi-modal time-series data from such insoles. The model operates on 160-frame windows with 24 channels (18 pressure, 3 accelerometer, 3 gyroscope axes), achieving 86.42% test accuracy in a subject-independent evaluation on a four-class task (Standing, Walking, Sitting, Tandem), compared with 87.83% for an extreme gradient-boosted tree (XGBoost) model trained on flattened data. Permutation feature importance reveals that inertial sensors (accelerometer and gyroscope) contribute substantially to discrimination. The approach is suitable for embedded deployment and real-time inference.
[505] Standing on the Shoulders of Giants: Rethinking EEG Foundation Model Pretraining via Multi-Teacher Distillation
Chenqi Li, Yu Liu, Shuo Zhang, Timothy Denison, Tingting Zhu
Main category: cs.LG
TL;DR: MTDP framework uses multi-teacher distillation from vision/time-series models to pretrain EEG foundation models, outperforming self-supervised methods with less data.
Details
Motivation: EEG data is expensive to collect and has low signal-to-noise ratio, making self-supervised masked reconstruction challenging for scaling EEG foundation models. The paper explores leveraging well-established foundation models from other modalities to bootstrap EEG pretraining.Method: Two-stage multi-teacher distillation: 1) Learnable gating network fuses representations from diverse teachers (e.g., DINOv3, Chronos) via masked latent denoising objective; 2) Distills fused representation into EEG foundation model.
Result: MTDP-based EEG foundation model outperforms self-supervised counterparts across 9 downstream tasks and 12 datasets while requiring only 25% of pretraining data.
Conclusion: Multi-teacher distillation from established vision/time-series models effectively bootstraps EEG foundation model training, addressing data scarcity and noise challenges in EEG domain.
Abstract: Pretraining for electroencephalogram (EEG) foundation models has predominantly relied on self-supervised masked reconstruction, a paradigm largely adapted from and inspired by the success of vision and language foundation models. However, unlike images and text, EEG datasets are notoriously expensive to collect and characterized by low signal-to-noise ratio. These challenges introduce difficulties in scaling the EEG foundation models and capturing the underlying neural semantics through reconstruction. In this work, we ask the question: can we stand on the shoulders of well-established foundation models from well-represented modalities to bootstrap the pretraining of EEG foundation models? We first demonstrate that mainstream foundation models, such as those from vision and time series, transfer surprisingly well to EEG domain. To this end, we propose the Multi-Teacher Distillation Pretraining (MTDP) framework for pretraining EEG foundation models via a two-stage multi-teacher distillation. In the first stage, we introduce a learnable gating network to fuse representations from diverse teachers (e.g., DINOv3 and Chronos) via a masked latent denoising objective. In the second stage, we distill the fused representation into an EEG foundation model. Extensive evaluations across 9 downstream tasks and 12 datasets demonstrate that our MTDP-based EEG foundation model outperforms its self-supervised counterparts while requiring only 25% of the pretraining data.
[506] Augmenting representations with scientific papers
Nicolò Oreste Pinciroli Vago, Rocco Di Tella, Carolina Cuesta-Lázaro, Michael J. Smith, Cecilia Garraffo, Rafael Martínez-Galarza
Main category: cs.LG
TL;DR: A contrastive learning framework that aligns X-ray spectra with scientific literature to create shared multimodal representations, improving physical variable estimation by 16-18% and enabling better source interpretation.
Details
Motivation: Astronomers have vast multimodal data (images, spectra, time series) and literature, but these sources are rarely systematically integrated. There's a need to connect observational data with domain knowledge from scientific texts to accelerate interpretation of rare or poorly understood astrophysical sources.Method: Proposes a contrastive learning framework that aligns X-ray spectra with domain knowledge extracted from scientific literature. Uses a Mixture of Experts (MoE) strategy that leverages both unimodal and shared representations. The pipeline creates shared multimodal latent spaces that encode physically significant information.
Result: Achieves 20% Recall@1% when retrieving texts from spectra. Improves estimation of 20 physical variables by 16-18% over unimodal spectral baselines. The shared latent space effectively encodes physically significant information and identifies high-priority targets including a candidate pulsating ULX (PULX) and gravitational lens system.
Conclusion: Meaningful alignment between spectra and scientific texts is possible and can accelerate interpretation of astrophysical sources. The MoE strategy yields superior performance, and the framework can be extended to other scientific domains where aligning observational data with literature is valuable.
Abstract: Astronomers have acquired vast repositories of multimodal data, including images, spectra, and time series, complemented by decades of literature that analyzes astrophysical sources. Still, these data sources are rarely systematically integrated. This work introduces a contrastive learning framework designed to align X-ray spectra with domain knowledge extracted from scientific literature, facilitating the development of shared multimodal representations. Establishing this connection is inherently complex, as scientific texts encompass a broader and more diverse physical context than spectra. We propose a contrastive pipeline that achieves a 20% Recall@1% when retrieving texts from spectra, proving that a meaningful alignment between these modalities is not only possible but capable of accelerating the interpretation of rare or poorly understood sources. Furthermore, the resulting shared latent space effectively encodes physically significant information. By fusing spectral and textual data, we improve the estimation of 20 physical variables by 16-18% over unimodal spectral baselines. Our results indicate that a Mixture of Experts (MoE) strategy, which leverages both unimodal and shared representations, yields superior performance. Finally, outlier analysis within the multimodal latent space identifies high-priority targets for follow-up investigation, including a candidate pulsating ULX (PULX) and a gravitational lens system. Importantly, this framework can be extended to other scientific domains where aligning observational data with existing literature is possible.
[507] Invariant Causal Routing for Governing Social Norms in Online Market Economies
Xiangning Yu, Qirui Mi, Xiao Xue, Haoxuan Li, Yiwei Shi, Xiaowei Liu, Mengyue Yang
Main category: cs.LG
TL;DR: Proposes Invariant Causal Routing (ICR), a causal governance framework for understanding and steering emergent social norms in online market economies using invariant causal discovery and counterfactual reasoning.
Details
Motivation: Social norms like fair exposure, sustained participation, and balanced reinvestment are critical for long-term stability in online market economies, but understanding their causal mechanisms and designing effective interventions is challenging due to complex micro-level interactions aggregating into macro-level regularities.Method: Invariant Causal Routing (ICR) integrates counterfactual reasoning with invariant causal discovery to identify policy-norm relations stable across heterogeneous environments, separating genuine causal effects from spurious correlations and constructing interpretable, auditable policy rules.
Result: In heterogeneous agent simulations calibrated with real data, ICR yields more stable norms, smaller generalization gaps, and more concise rules than correlation or coverage baselines.
Conclusion: Causal invariance offers a principled and interpretable foundation for governance in complex economic systems, enabling effective policy interventions that remain robust under distribution shift.
Abstract: Social norms are stable behavioral patterns that emerge endogenously within economic systems through repeated interactions among agents. In online market economies, such norms – like fair exposure, sustained participation, and balanced reinvestment – are critical for long-term stability. We aim to understand the causal mechanisms driving these emergent norms and to design principled interventions that can steer them toward desired outcomes. This is challenging because norms arise from countless micro-level interactions that aggregate into macro-level regularities, making causal attribution and policy transferability difficult. To address this, we propose \textbf{Invariant Causal Routing (ICR)}, a causal governance framework that identifies policy-norm relations stable across heterogeneous environments. ICR integrates counterfactual reasoning with invariant causal discovery to separate genuine causal effects from spurious correlations and to construct interpretable, auditable policy rules that remain effective under distribution shift. In heterogeneous agent simulations calibrated with real data, ICR yields more stable norms, smaller generalization gaps, and more concise rules than correlation or coverage baselines, demonstrating that causal invariance offers a principled and interpretable foundation for governance.
[508] Neural Network-Based Parameter Estimation of a Labour Market Agent-Based Model
M Lopes Alves, Joel Dyer, Doyne Farmer, Michael Wooldridge, Anisoara Calinescu
Main category: cs.LG
TL;DR: A study applying neural network-based simulation-based inference (SBI) for parameter estimation in large-scale agent-based models, tested on a labor market ABM with synthetic and real U.S. data.
Details
Motivation: Agent-based models face challenges in parameter estimation due to computational constraints when exploring large parameter spaces, limiting their use as decision-support tools despite widespread adoption across fields.Method: Uses a state-of-the-art simulation-based inference framework with neural networks for parameter estimation. Applied to an established labor market ABM based on job transition networks, initiated with synthetic datasets and real U.S. labor market data. Compares effectiveness of traditional summary statistics with those learned by an embedded neural network.
Result: The neural network-based approach successfully recovers original parameters when evaluating posterior distributions across various dataset scales and improves efficiency compared to traditional Bayesian methods.
Conclusion: Neural network-based simulation-based inference provides an effective and efficient solution for parameter estimation in large-scale agent-based models, overcoming computational limitations of traditional methods.
Abstract: Agent-based modelling (ABM) is a widespread approach to simulate complex systems. Advancements in computational processing and storage have facilitated the adoption of ABMs across many fields; however, ABMs face challenges that limit their use as decision-support tools. A significant issue is parameter estimation in large-scale ABMs, particularly due to computational constraints on exploring the parameter space. This study evaluates a state-of-the-art simulation-based inference (SBI) framework that uses neural networks (NN) for parameter estimation. This framework is applied to an established labour market ABM based on job transition networks. The ABM is initiated with synthetic datasets and the real U.S. labour market. Next, we compare the effectiveness of summary statistics derived from a list of statistical measures with that learned by an embedded NN. The results demonstrate that the NN-based approach recovers the original parameters when evaluating posterior distributions across various dataset scales and improves efficiency compared to traditional Bayesian methods.
[509] An LLM-Guided Query-Aware Inference System for GNN Models on Large Knowledge Graphs
Waleed Afandi, Hussein Abdallah, Ashraf Aboulnaga, Essam Mansour
Main category: cs.LG
TL;DR: KG-WISE: A task-driven inference paradigm for GNNs on large knowledge graphs that uses LLMs to generate query templates for extracting semantically relevant subgraphs, enabling query-aware model instantiation with partial loading of fine-grained components.
Details
Motivation: Existing GNN acceleration methods (pruning, quantization, knowledge distillation) create smaller models but don't adapt to individual query structures/semantics, leading to excessive data loading and redundant computation on large knowledge graphs.Method: Decomposes trained GNN models into fine-grained components that can be partially loaded based on queried subgraph structure. Uses LLMs to generate reusable query templates that extract semantically relevant subgraphs for each task.
Result: Achieves up to 28x faster inference and 98% lower memory usage than state-of-the-art systems while maintaining or improving accuracy across both commercial and open-weight LLMs on six large KGs with up to 42M nodes and 166M edges.
Conclusion: KG-WISE enables efficient, query-aware GNN inference on large knowledge graphs by combining LLM-generated semantic query templates with fine-grained model decomposition and partial loading.
Abstract: Efficient inference for graph neural networks (GNNs) on large knowledge graphs (KGs) is essential for many real-world applications. GNN inference queries are computationally expensive and vary in complexity, as each involves a different number of target nodes linked to subgraphs of diverse densities and structures. Existing acceleration methods, such as pruning, quantization, and knowledge distillation, instantiate smaller models but do not adapt them to the structure or semantics of individual queries. They also store models as monolithic files that must be fully loaded, and miss the opportunity to retrieve only the neighboring nodes and corresponding model components that are semantically relevant to the target nodes. These limitations lead to excessive data loading and redundant computation on large KGs. This paper presents KG-WISE, a task-driven inference paradigm for large KGs. KG-WISE decomposes trained GNN models into fine-grained components that can be partially loaded based on the structure of the queried subgraph. It employs large language models (LLMs) to generate reusable query templates that extract semantically relevant subgraphs for each task, enabling query-aware and compact model instantiation. We evaluate KG-WISE on six large KGs with up to 42 million nodes and 166 million edges. KG-WISE achieves up to 28x faster inference and 98% lower memory usage than state-of-the-art systems while maintaining or improving accuracy across both commercial and open-weight LLMs.
[510] Oracle-efficient Hybrid Learning with Constrained Adversaries
Princewill Okoroafor, Robert Kleinberg, Michael P. Kim
Main category: cs.LG
TL;DR: Efficient algorithm for hybrid online learning with statistical optimality using structured adversarial constraints and ERM oracle
Details
Motivation: Bridge the gap between statistically-optimal but computationally intractable algorithms and computationally-efficient but statistically-suboptimal algorithms in hybrid online learningMethod: New learning algorithm with ERM oracle, using structured adversarial constraints where labels come from fixed function class R, with Frank-Wolfe reduction using truncated entropy regularizer and hybrid martingale tail bounds
Result: Algorithm achieves regret scaling with Rademacher complexity of derived class from H and R, provides oracle-efficient algorithm for computing equilibria in structured stochastic zero-sum games
Conclusion: Significant step toward simultaneous statistical optimality and computational efficiency in hybrid learning through structured adversarial constraints
Abstract: The Hybrid Online Learning Problem, where features are drawn i.i.d. from an unknown distribution but labels are generated adversarially, is a well-motivated setting positioned between statistical and fully-adversarial online learning. Prior work has presented a dichotomy: algorithms that are statistically-optimal, but computationally intractable (Wu et al., 2023), and algorithms that are computationally-efficient (given an ERM oracle), but statistically-suboptimal (Wu et al., 2024). This paper takes a significant step towards achieving statistical optimality and computational efficiency simultaneously in the Hybrid Learning setting. To do so, we consider a structured setting, where the Adversary is constrained to pick labels from an expressive, but fixed, class of functions $R$. Our main result is a new learning algorithm, which runs efficiently given an ERM oracle and obtains regret scaling with the Rademacher complexity of a class derived from the Learner’s hypothesis class $H$ and the Adversary’s label class $R$. As a key corollary, we give an oracle-efficient algorithm for computing equilibria in stochastic zero-sum games when action sets may be high-dimensional but the payoff function exhibits a type of low-dimensional structure. Technically, we develop a number of tools for the design and analysis of our learning algorithm, including a novel Frank-Wolfe reduction with “truncated entropy regularizer” and a new tail bound for sums of “hybrid” martingale difference sequences.
[511] Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, David Held
Main category: cs.LG
TL;DR: LPWM is a self-supervised object-centric world model that discovers keypoints, bounding boxes, and object masks from video data without supervision, enabling rich scene decomposition and supporting decision-making applications.
Details
Motivation: To create a scalable world model that can autonomously discover object representations from real-world video data and be applicable to decision-making tasks, addressing the need for unsupervised scene understanding in complex environments.Method: Uses an end-to-end trained architecture that discovers keypoints, bounding boxes, and object masks directly from videos. Features a novel latent action module for modeling stochastic particle dynamics and supports flexible conditioning on actions, language, and image goals.
Result: Achieves state-of-the-art results on diverse real-world and synthetic datasets for stochastic video modeling. Successfully demonstrates applicability to decision-making tasks including goal-conditioned imitation learning.
Conclusion: LPWM provides an effective self-supervised approach for object-centric world modeling that scales to real-world multi-object datasets and bridges the gap between video understanding and decision-making applications.
Abstract: We introduce Latent Particle World Model (LPWM), a self-supervised object-centric world model scaled to real-world multi-object datasets and applicable in decision-making. LPWM autonomously discovers keypoints, bounding boxes, and object masks directly from video data, enabling it to learn rich scene decompositions without supervision. Our architecture is trained end-to-end purely from videos and supports flexible conditioning on actions, language, and image goals. LPWM models stochastic particle dynamics via a novel latent action module and achieves state-of-the-art results on diverse real-world and synthetic datasets. Beyond stochastic video modeling, LPWM is readily applicable to decision-making, including goal-conditioned imitation learning, as we demonstrate in the paper. Code, data, pre-trained models and video rollouts are available: https://taldatech.github.io/lpwm-web
[512] Why Do Neural Networks Forget: A Study of Collapse in Continual Learning
Yunqin Zhu, Jun Jin
Main category: cs.LG
TL;DR: Study investigates correlation between catastrophic forgetting and structural collapse in continual learning using effective rank measurements across different architectures and training strategies.
Details
Motivation: Most continual learning approaches are evaluated through task accuracy alone, ignoring internal model structure. Recent research suggests structural collapse leads to loss of plasticity and forgetting, as networks lose ability to expand feature space for new tasks.Method: Measured weight and activation effective rank (eRank) to investigate correlation between forgetting and structural collapse. Evaluated four architectures (MLP, ConvGRU, ResNet-18, Bi-ConvGRU) on split MNIST and Split CIFAR-100 benchmarks using SGD, Learning-without-Forgetting (LwF), and Experience Replay (ER) strategies.
Result: Results demonstrate that forgetting and structural collapse are strongly related. Different continual learning strategies help models preserve both capacity and performance with varying efficiency.
Conclusion: Structural collapse measurement through effective rank provides important insights into forgetting mechanisms in continual learning, beyond just task accuracy metrics.
Abstract: Catastrophic forgetting is a major problem in continual learning, and lots of approaches arise to reduce it. However, most of them are evaluated through task accuracy, which ignores the internal model structure. Recent research suggests that structural collapse leads to loss of plasticity, as evidenced by changes in effective rank (eRank). This indicates a link to forgetting, since the networks lose the ability to expand their feature space to learn new tasks, which forces the network to overwrite existing representations. Therefore, in this study, we investigate the correlation between forgetting and collapse through the measurement of both weight and activation eRank. To be more specific, we evaluated four architectures, including MLP, ConvGRU, ResNet-18, and Bi-ConvGRU, in the split MNIST and Split CIFAR-100 benchmarks. Those models are trained through the SGD, Learning-without-Forgetting (LwF), and Experience Replay (ER) strategies separately. The results demonstrate that forgetting and collapse are strongly related, and different continual learning strategies help models preserve both capacity and performance in different efficiency.
[513] A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments
Mohammed Omer Shakeel Ahmed
Main category: cs.LG
TL;DR: A privacy-preserving multimodal AI framework for duplicate detection in CRM/healthcare using semantic embeddings, behavioral patterns, and device metadata without sensitive PII.
Details
Motivation: Traditional deduplication methods rely on sensitive PII (names, emails, SSNs), making them ineffective under strict privacy regulations like GDPR and HIPAA. There's a need for privacy-compliant entity resolution solutions.Method: Multimodal framework using: 1) semantic embeddings from textual fields via DistilBERT, 2) behavioral patterns from login timestamps, 3) device metadata via categorical embeddings. Combined using late fusion and clustered with DBSCAN (unsupervised density-based algorithm).
Result: The framework demonstrated good performance with a good F1-score on synthetic CRM dataset, effectively identifying duplicates despite data variations and noise, outperforming traditional string-matching baseline.
Conclusion: Offers a privacy-compliant solution for entity resolution that supports secure digital infrastructure, enhances public health analytics reliability, and promotes ethical AI adoption, suitable for national health data modernization efforts.
Abstract: Duplicate records pose significant challenges in customer relationship management (CRM)and healthcare, often leading to inaccuracies in analytics, impaired user experiences, and compliance risks. Traditional deduplication methods rely heavily on direct identifiers such as names, emails, or Social Security Numbers (SSNs), making them ineffective under strict privacy regulations like GDPR and HIPAA, where such personally identifiable information (PII) is restricted or masked. In this research, I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information. This system leverages three distinct modalities: semantic embeddings derived from textual fields (names, cities) using pre-trained DistilBERT models, behavioral patterns extracted from user login timestamps, and device metadata encoded through categorical embeddings. These heterogeneous modalities are combined using a late fusion approach and clustered via DBSCAN, an unsupervised density-based algorithm. This proposed model is evaluated against a traditional string-matching baseline on a synthetic CRM dataset specifically designed to reflect privacy-preserving constraints. The multimodal framework demonstrated good performance, achieving a good F1-score by effectively identifying duplicates despite variations and noise inherent in the data. This approach offers a privacy-compliant solution to entity resolution and supports secure digital infrastructure, enhances the reliability of public health analytics, and promotes ethical AI adoption across government and enterprise settings. It is well-suited for integration into national health data modernization efforts, aligning with broader goals of privacy-first innovation.
[514] PDE foundation model-accelerated inverse estimation of system parameters in inertial confinement fusion
Mahindra Rautela, Alexander Scheinker, Bradley Love, Diane Oyen, Nathan DeBardeleben, Earl Lawrence, Ayan Biswas
Main category: cs.LG
TL;DR: Fine-tuning PDE foundation models for inverse problems in inertial confinement fusion, achieving accurate hyperspectral image reconstruction and parameter estimation from multi-modal observations.
Details
Motivation: Most PDE foundation model evaluations focus on forward problems (autoregressive rollout prediction), but there's a need to study inverse problems where system parameters must be estimated from observations. The work specifically addresses inverse problems in inertial confinement fusion (ICF) using multi-modal, snapshot-style observations.Method: Using the open JAG benchmark with hyperspectral X-ray images and scalar observables, the authors fine-tune a PDE foundation model and train a lightweight task-specific head to jointly reconstruct hyperspectral images and regress system parameters. They conduct data-scaling experiments (5%-100% of training set) and compare fine-tuning from pretrained MORPH weights versus training from scratch.
Result: The fine-tuned model achieves accurate hyperspectral reconstruction (test MSE 1.2e-3) and strong parameter-estimation performance (up to R^2=0.995). Data-scaling shows consistent improvements in both reconstruction and regression losses with increasing data, with largest marginal gains in low-data regime. Fine-tuning from pretrained weights outperforms training from scratch, demonstrating foundation-model initialization improves sample efficiency.
Conclusion: Foundation-model initialization improves sample efficiency for data-limited inverse problems in ICF, enabling accurate parameter estimation and hyperspectral reconstruction from multi-modal observations through fine-tuning with task-specific heads.
Abstract: PDE foundation models are typically pretrained on large, diverse corpora of PDE datasets and can be adapted to new settings with limited task-specific data. However, most downstream evaluations focus on forward problems, such as autoregressive rollout prediction. In this work, we study an inverse problem in inertial confinement fusion (ICF): estimating system parameters (inputs) from multi-modal, snapshot-style observations (outputs). Using the open JAG benchmark, which provides hyperspectral X-ray images and scalar observables per simulation, we finetune the PDE foundation model and train a lightweight task-specific head to jointly reconstruct hyperspectral images and regress system parameters. The fine-tuned model achieves accurate hyperspectral reconstruction (test MSE 1.2e-3) and strong parameter-estimation performance (up to R^2=0.995). Data-scaling experiments (5%-100% of the training set) show consistent improvements in both reconstruction and regression losses as the amount of training data increases, with the largest marginal gains in the low-data regime. Finally, finetuning from pretrained MORPH weights outperforms training the same architecture from scratch, demonstrating that foundation-model initialization improves sample efficiency for data-limited inverse problems in ICF.
[515] K-Means as a Radial Basis function Network: a Variational and Gradient-based Equivalence
Felipe de Jesus Felix Arredondo, Alejandro Ucan-Puc, Carlos Astengo Noguez
Main category: cs.LG
TL;DR: Paper establishes theoretical equivalence between K-Means clustering and differentiable RBF neural networks, enabling end-to-end differentiable clustering in deep learning architectures.
Details
Motivation: Bridge the gap between discrete clustering algorithms (K-Means) and continuous optimization in deep learning, enabling joint optimization of representations and clusters within neural networks.Method: Reparameterize K-Means objective, embed distortion functional into smooth weighted loss, prove Γ-convergence to K-Means as temperature parameter vanishes, use Entmax-1.5 for numerical stability in low-temperature regime.
Result: Demonstrated theoretical equivalence: gradient updates of RBF centers recover exact K-Means centroid updates, identical training trajectories in limit, monotone collapse of soft RBF centroids toward K-Means fixed points.
Conclusion: Established rigorous variational equivalence enabling K-Means to be embedded directly into deep learning architectures for end-to-end differentiable clustering, bridging discrete partitioning and continuous optimization.
Abstract: This work establishes a rigorous variational and gradient-based equivalence between the classical K-Means algorithm and differentiable Radial Basis Function (RBF) neural networks with smooth responsibilities. By reparameterizing the K-Means objective and embedding its distortion functional into a smooth weighted loss, we prove that the RBF objective $Γ$-converges to the K-Means solution as the temperature parameter $σ$ vanishes. We further demonstrate that the gradient-based updates of the RBF centers recover the exact K-Means centroid update rule and induce identical training trajectories in the limit. To address the numerical instability of the Softmax transformation in the low-temperature regime, we propose the integration of Entmax-1.5, which ensures stable polynomial convergence while preserving the underlying Voronoi partition structure. These results bridge the conceptual gap between discrete partitioning and continuous optimization, enabling K-Means to be embedded directly into deep learning architectures for the joint optimization of representations and clusters. Empirical validation across diverse synthetic geometries confirms a monotone collapse of soft RBF centroids toward K-Means fixed points, providing a unified framework for end-to-end differentiable clustering.
[516] When Sensors Fail: Temporal Sequence Models for Robust PPO under Sensor Drift
Kevin Vogt-Lowell, Theodoros Tsiligkaridis, Rodney Lafuente-Mercado, Surabhi Ghatti, Shanghua Gao, Marinka Zitnik, Daniela Rus
Main category: cs.LG
TL;DR: Transformer-augmented PPO policies maintain robustness under sensor failures by using temporal sequence reasoning to infer missing information from history, outperforming MLP, RNN, and SSM baselines.
Details
Motivation: Real-world RL systems face distributional drift from sensor failures causing partial observability, but most policy architectures assume fully observed states. Need robust policies that can handle temporally persistent sensor failures.Method: Augment PPO with temporal sequence models (Transformers and State Space Models) to enable policies to infer missing information from history. Analyze robustness under stochastic sensor failure process with theoretical bounds on reward degradation.
Result: Transformer-based sequence policies substantially outperform MLP, RNN, and SSM baselines on MuJoCo benchmarks with severe sensor dropout, maintaining high returns even with large fractions of sensors unavailable.
Conclusion: Temporal sequence reasoning provides principled and practical mechanism for reliable operation under observation drift caused by sensor unreliability, with Transformers showing best robustness.
Abstract: Real-world reinforcement learning systems must operate under distributional drift in their observation streams, yet most policy architectures implicitly assume fully observed and noise-free states. We study robustness of Proximal Policy Optimization (PPO) under temporally persistent sensor failures that induce partial observability and representation shift. To respond to this drift, we augment PPO with temporal sequence models, including Transformers and State Space Models (SSMs), to enable policies to infer missing information from history and maintain performance. Under a stochastic sensor failure process, we prove a high-probability bound on infinite-horizon reward degradation that quantifies how robustness depends on policy smoothness and failure persistence. Empirically, on MuJoCo continuous-control benchmarks with severe sensor dropout, we show Transformer-based sequence policies substantially outperform MLP, RNN, and SSM baselines in robustness, maintaining high returns even when large fractions of sensors are unavailable. These results demonstrate that temporal sequence reasoning provides a principled and practical mechanism for reliable operation under observation drift caused by sensor unreliability.
[517] Neuro-Symbolic Financial Reasoning via Deterministic Fact Ledgers and Adversarial Low-Latency Hallucination Detector
Pedram Agand
Main category: cs.LG
TL;DR: VeNRA introduces a verifiable numerical reasoning agent for financial domains using deterministic variable retrieval via Universal Fact Ledger and Double-Lock Grounding, with a 3B-parameter Sentinel SLM for forensic audit of Python execution traces.
Details
Motivation: Standard RAG architectures fail in high-stakes financial domains due to LLM arithmetic incompetence and semantic conflation in dense vector retrieval, where even 99% accuracy yields 0% operational trust in deterministic domains.Method: VeNRA shifts from probabilistic text retrieval to deterministic variable retrieval via Universal Fact Ledger with Double-Lock Grounding. Includes 3B-parameter Sentinel SLM trained with Adversarial Simulation (programmatically sabotaging golden records) and optimized with single-pass classification and novel Micro-Chunking loss algorithm to address Loss Dilution in Reverse-Chain-of-Thought training.
Result: The paper presents a system achieving zero-hallucination financial reasoning through verifiable numerical reasoning, with forensic auditing capabilities for Python execution traces under strict latency constraints.
Conclusion: VeNRA enables trustworthy financial reasoning by addressing fundamental limitations of standard RAG through deterministic variable retrieval, forensic auditing, and novel training techniques for high-stakes applications.
Abstract: Standard Retrieval-Augmented Generation (RAG) architectures fail in high-stakes financial domains due to two fundamental limitations: the inherent arithmetic incompetence of Large Language Models (LLMs) and the distributional semantic conflation of dense vector retrieval (e.g., mapping Net Income'' to Net Sales’’ due to contextual proximity). In deterministic domains, a 99% accuracy rate yields 0% operational trust. To achieve zero-hallucination financial reasoning, we introduce the Verifiable Numerical Reasoning Agent (VeNRA). VeNRA shifts the RAG paradigm from retrieving probabilistic text to retrieving deterministic variables via a strictly typed Universal Fact Ledger (UFL), mathematically bounded by a novel Double-Lock Grounding algorithm. Recognizing that upstream parsing anomalies inevitably occur, we introduce the VeNRA Sentinel: a 3-billion parameter SLM trained to forensically audit Python execution traces with only one token test budget. To train this model, we avoid traditional generative hallucination datasets in favor of Adversarial Simulation, programmatically sabotaging golden financial records to simulate production-level ``Ecological Errors’’ (e.g., Logic Code Lies and Numeric Neighbor Traps). Finally, to optimize the Sentinel under strict latency budgets, we utilize a single-pass classification paradigm with optional post thinking for debug. We identify the phenomenon of Loss Dilution in Reverse-Chain-of-Thought training and present a novel, OOM-safe Micro-Chunking loss algorithm to stabilize gradients under extreme differential penalization.
[518] Direct Estimation of Tree Volume and Aboveground Biomass Using Deep Regression with Synthetic Lidar Data
Habib Pourdelan, Zhengkang Xiang, Hugh Stewart, Cam Nicholson, Martin Tomko, Kourosh Khoshelham
Main category: cs.LG
TL;DR: Deep learning approach using synthetic point cloud data to directly estimate forest biomass from lidar, outperforming traditional allometric methods.
Details
Motivation: Traditional forest biomass estimation relies on indirect allometric models with limited accuracy due to measurement uncertainties and approximations that don't fully account for tree variability. A more direct, accurate approach is needed for climate change monitoring.Method: Created synthetic 3D forest plots with ground truth volume, converted to point clouds using lidar simulator, trained deep regression networks (PointNet, PointNet++, DGCNN, PointConv) on synthetic data, then applied to real lidar data for volume and biomass estimation.
Result: Deep networks achieved 1.69-8.11% MAPE on synthetic data. On real data, direct approach showed 2-20% discrepancies vs field measurements, while indirect methods showed 27-85% underestimation.
Conclusion: Integrating synthetic data with deep learning enables efficient, scalable, and accurate forest carbon estimation at plot level, outperforming traditional indirect approaches.
Abstract: Accurate estimation of forest biomass is crucial for monitoring carbon sequestration and informing climate change mitigation strategies. Existing methods often rely on allometric models, which estimate individual tree biomass by relating it to measurable biophysical parameters, e.g., trunk diameter and height. This indirect approach is limited in accuracy due to measurement uncertainties and the inherently approximate nature of allometric equations, which may not fully account for the variability in tree characteristics and forest conditions. This study proposes a direct approach that leverages synthetic point cloud data to train a deep regression network, which is then applied to real point clouds for plot-level wood volume and aboveground biomass (AGB) estimation. We created synthetic 3D forest plots with ground truth volume, which were then converted into point cloud data using a lidar simulator. These point clouds were subsequently used to train deep regression networks based on PointNet, PointNet++, DGCNN, and PointConv. When applied to synthetic data, the deep regression networks achieved mean absolute percentage error (MAPE) values ranging from 1.69% to 8.11%. The trained networks were then applied to real lidar data to estimate volume and AGB. When compared against field measurements, our direct approach showed discrepancies of 2% to 20%. In contrast, indirect approaches based on individual tree segmentation followed by allometric conversion, as well as FullCAM, exhibited substantially large underestimation, with discrepancies ranging from 27% to 85%. Our results highlight the potential of integrating synthetic data with deep learning for efficient and scalable forest carbon estimation at plot level.
[519] Engineering Regression Without Real-Data Training: Domain Adaptation for Tabular Foundation Models Using Multi-Dataset Embeddings
Lyle Regenwetter, Rosen Yu, Cyril Picard, Faez Ahmed
Main category: cs.LG
TL;DR: TREDBench introduces 83 real-world tabular regression datasets with engineering labels, reveals synthetic-real domain gaps, and proposes embedding-guided synthetic data curation to adapt foundation models for engineering domains without real data.
Details
Motivation: Engineering applications have been limited by bespoke models and small tabular datasets, and existing tabular foundation models use synthetic data that doesn't reflect engineering data statistics, limiting transfer to engineering regression tasks.Method: Created TREDBench dataset collection, used TabPFN 2.5’s dataset-level embeddings to analyze domain structure, proposed embedding-guided synthetic data curation to identify “engineering-like” synthetic datasets, and performed continued pre-training using only selected synthetic tasks.
Result: Synthetic-only adaptation improved predictive accuracy and data efficiency across 35 engineering regression datasets, outperforming TabPFN 2.5 on 29/35 datasets and AutoGluon on 27/35, with mean multiplicative data-efficiency gains of 1.75x and 4.44x respectively.
Conclusion: Principled synthetic data curation can convert procedural generators into domain-relevant “data engines,” enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the bottleneck.
Abstract: Predictive modeling in engineering applications has long been dominated by bespoke models and small, siloed tabular datasets, limiting the applicability of large-scale learning approaches. Despite recent progress in tabular foundation models, the resulting synthetic training distributions used for pre-training may not reflect the statistical structure of engineering data, limiting transfer to engineering regression. We introduce TREDBench, a curated collection of 83 real-world tabular regression datasets with expert engineering/non-engineering labels, and use TabPFN 2.5’s dataset-level embedding to study domain structure in a common representation space. We find that engineering datasets are partially distinguishable from non-engineering datasets, while standard procedurally generated datasets are highly distinguishable from engineering datasets, revealing a substantial synthetic-real domain gap. To bridge this gap without training on real engineering samples, we propose an embedding-guided synthetic data curation method: we generate and identify “engineering-like” synthetic datasets, and perform continued pre-training of TabPFN 2.5 using only the selected synthetic tasks. Across 35 engineering regression datasets, this synthetic-only adaptation improves predictive accuracy and data efficiency, outperforming TabPFN 2.5 on 29/35 datasets and AutoGluon on 27/35, with mean multiplicative data-efficiency gains of 1.75x and 4.44x, respectively. More broadly, our results indicate that principled synthetic data curation can convert procedural generators into domain-relevant “data engines,” enabling foundation models to improve in data-sparse scientific and industrial domains where real data collection is the primary bottleneck.
[520] Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rankness
Baekrok Shin, Chulhee Yun
Main category: cs.LG
TL;DR: Deep matrix factorization for matrix completion shows that network depth intensifies implicit low-rank bias through coupled dynamics, with depth ≥3 networks converging to rank-1 solutions under certain conditions, unlike shallow networks.
Details
Motivation: To understand how network depth influences training dynamics in deep matrix factorization for matrix completion, particularly the implicit low-rank bias observed in deeper networks that prior shallow models don't fully explain.Method: Analyze matrix completion via deep matrix factorization (deep linear neural networks) under gradient flow with block-diagonal observations. Study coupled dynamics as a key mechanism for low-rank bias and examine how it intensifies with depth.
Result: 1) Networks of depth ≥3 exhibit coupling unless initialized diagonally; 2) Convergence to rank-1 occurs if and only if dynamics is coupled; 3) Deep models avoid plasticity loss due to low-rank bias, while depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank even with additional data.
Conclusion: Depth in matrix factorization intensifies implicit low-rank bias through coupled dynamics, with deeper networks converging to rank-1 solutions and avoiding plasticity loss, providing theoretical insights into training dynamics of deep linear networks.
Abstract: We study matrix completion via deep matrix factorization (a.k.a. deep linear neural networks) as a simplified testbed to examine how network depth influences training dynamics. Despite the simplicity and importance of the problem, prior theory largely focuses on shallow (depth-2) models and does not fully explain the implicit low-rank bias observed in deeper networks. We identify coupled dynamics as a key mechanism behind this bias and show that it intensifies with increasing depth. Focusing on gradient flow under block-diagonal observations, we prove: (a) networks of depth $\geq 3$ exhibit coupling unless initialized diagonally, and (b) convergence to rank-1 occurs if and only if the dynamics is coupled – resolving an open question by Menon (2024) for a family of initializations. We also revisit the loss of plasticity phenomenon in matrix completion (Kleinman et al., 2024), where pre-training on few observations and resuming with more degrades performance. We show that deep models avoid plasticity loss due to their low-rank bias, whereas depth-2 networks pre-trained under decoupled dynamics fail to converge to low-rank, even when resumed training (with additional data) satisfies the coupling condition – shedding light on the mechanism behind this phenomenon.
[521] Probabilistic Dreaming for World Models
Gavin Wong
Main category: cs.LG
TL;DR: Improved Dreamer model using probabilistic methods for parallel latent state exploration and maintaining distinct future hypotheses, achieving better performance and lower variance in multi-agent environments.
Details
Motivation: To enhance the Dreamer model's ability to learn world models more robustly and sample-efficiently by addressing limitations in exploring latent states and handling mutually exclusive futures while maintaining gradient properties.Method: Introduces probabilistic innovations to Dreamer that enable parallel exploration of many latent states and maintenance of distinct hypotheses for mutually exclusive futures while preserving continuous latent gradient properties.
Result: Outperforms standard Dreamer with 4.5% score improvement and 28% lower variance in episode returns on the MPE SimpleTag domain.
Conclusion: Probabilistic enhancements to Dreamer improve world model learning, though scaling optimal hyperparameters with environmental complexity and capturing epistemic uncertainty remain future challenges.
Abstract: “Dreaming” enables agents to learn from imagined experiences, enabling more robust and sample-efficient learning of world models. In this work, we consider innovations to the state-of-the-art Dreamer model using probabilistic methods that enable: (1) the parallel exploration of many latent states; and (2) maintaining distinct hypotheses for mutually exclusive futures while retaining the desirable gradient properties of continuous latents. Evaluating on the MPE SimpleTag domain, our method outperforms standard Dreamer with a 4.5% score improvement and 28% lower variance in episode returns. We also discuss limitations and directions for future work, including how optimal hyperparameters (e.g. particle count K) scale with environmental complexity, and methods to capture epistemic uncertainty in world models.
[522] Count Bridges enable Modeling and Deconvolving Transcriptomic Data
Nic Fishman, Gokul Gowri, Tanush Kumar, Jiaqi Lu, Valentin de Bortoli, Jonathan S. Gootenberg, Omar Abudayyeh
Main category: cs.LG
TL;DR: Count Bridges: A stochastic bridge process for integer-valued data that enables exact, tractable generative modeling and deconvolution of aggregated biological measurements like RNA-seq data.
Details
Motivation: Biological assays often produce integer-valued counts (like RNA sequencing) that are aggregated over multiple cells, making it difficult to model at single-cell resolution. Existing generative frameworks don't handle integer-valued data well or provide systematic deconvolution methods.Method: Introduces Count Bridges, a stochastic bridge process on integers with closed-form conditionals for efficient training/sampling. Extends framework with Expectation-Maximization approach to train directly from aggregated measurements by treating unit-level counts as latent variables.
Result: Achieves state-of-the-art performance on integer distribution matching benchmarks against flow matching baselines. Successfully applied to single-cell gene expression modeling at nucleotide resolution and deconvolving bulk RNA-seq and spatial transcriptomic spots.
Conclusion: Provides principled foundation for generative modeling and deconvolution of biological count data across scales and modalities, addressing key challenges in analyzing aggregated biological measurements.
Abstract: Many modern biological assays, including RNA sequencing, yield integer-valued counts that reflect the number of molecules detected. These measurements are often not at the desired resolution: while the unit of interest is typically a single cell, many measurement technologies produce counts aggregated over sets of cells. Although recent generative frameworks such as diffusion and flow matching have been extended to non-Euclidean and discrete settings, it remains unclear how best to model integer-valued data or how to systematically deconvolve aggregated observations. We introduce Count Bridges, a stochastic bridge process on the integers that provides an exact, tractable analogue of diffusion-style models for count data, with closed-form conditionals for efficient training and sampling. We extend this framework to enable direct training from aggregated measurements via an Expectation-Maximization-style approach that treats unit-level counts as latent variables. We demonstrate state-of-the-art performance on integer distribution matching benchmarks, comparing against flow matching and discrete flow matching baselines across various metrics. We then apply Count Bridges to two large-scale problems in biology: modeling single-cell gene expression data at the nucleotide resolution, with applications to deconvolving bulk RNA-seq, and resolving multicellular spatial transcriptomic spots into single-cell count profiles. Our methods offer a principled foundation for generative modeling and deconvolution of biological count data across scales and modalities.
[523] When Priors Backfire: On the Vulnerability of Unlearnable Examples to Pretraining
Zhihao Li, Gezheng Xu, Jiale Cai, Ruiyi Fang, Di Wu, Qicheng Lao, Charles Ling, Boyu Wang
Main category: cs.LG
TL;DR: BAIT introduces a bi-level optimization method to create unlearnable examples that remain effective even when training starts from pretrained models, overcoming the vulnerability of existing UE methods to pretraining priors.
Details
Motivation: Current Unlearnable Examples (UEs) fail when training starts from pretrained models because pretraining priors provide rich semantic representations that allow models to bypass the injected perturbations and learn genuine features, nullifying data protection.Method: BAIT uses bi-level optimization: inner level associates perturbed samples with real labels to simulate standard data-label alignment, while outer level enforces mislabel-perturbation binding that maps samples to designated incorrect targets, overriding semantic guidance from pretraining priors.
Result: Extensive experiments on standard benchmarks with multiple pretrained backbones show BAIT effectively mitigates pretraining prior influence and maintains data unlearnability where previous UE methods fail.
Conclusion: BAIT successfully addresses the fundamental vulnerability of UEs to pretraining priors through its novel bi-level optimization approach, enabling effective data protection even when training starts from pretrained models.
Abstract: Unlearnable Examples (UEs) serve as a data protection strategy that generates imperceptible perturbations to mislead models into learning spurious correlations instead of underlying semantics. In this paper, we uncover a fundamental vulnerability of UEs that emerges when learning starts from a pretrained model. Crucially, our empirical analysis shows that even when data are protected by carefully crafted perturbations, pretraining priors still furnish rich semantic representations that allow the model to circumvent the shortcuts introduced by UEs and capture genuine features, thereby nullifying unlearnability. To address this, we propose BAIT (Binding Artificial perturbations to Incorrect Targets), a novel bi-level optimization formulation. Specifically, the inner level aims at associating the perturbed samples with real labels to simulate standard data-label alignment, while the outer level actively disrupts this alignment by enforcing a mislabel-perturbation binding that maps samples to designated incorrect targets. This mechanism effectively overrides the semantic guidance of priors, forcing the model to rely on the injected perturbations and consequently preventing the acquisition of true semantics. Extensive experiments on standard benchmarks and multiple pretrained backbones demonstrate that BAIT effectively mitigates the influence of pretraining priors and maintains data unlearnability.
[524] Distribution-Conditioned Transport
Nic Fishman, Gokul Gowri, Paolo L. B. Fischer, Marinka Zitnik, Omar Abudayyeh, Jonathan Gootenberg
Main category: cs.LG
TL;DR: Distribution-Conditioned Transport (DCT) framework enables transport models to generalize to unseen source/target distributions by conditioning on learned embeddings, with applications in biological data analysis.
Details
Motivation: Scientific applications increasingly require transport models that can generalize to source and target distributions unseen during training, moving beyond traditional transport models that only work on specific distribution pairs seen during training.Method: DCT conditions transport maps on learned embeddings of source and target distributions, enabling generalization to unseen distribution pairs. It’s agnostic to underlying transport mechanisms (flow matching, Wasserstein, MMD) and supports semi-supervised learning by leveraging distributions observed at only one condition.
Result: Demonstrated practical performance benefits on synthetic benchmarks and four biological applications: batch effect transfer in single-cell genomics, perturbation prediction from mass cytometry data, learning clonal transcriptional dynamics in hematopoiesis, and modeling T-cell receptor sequence evolution.
Conclusion: DCT provides a flexible framework for distribution-conditioned transport that enables generalization to unseen distribution pairs and supports semi-supervised learning, with demonstrated effectiveness across multiple biological applications.
Abstract: Learning a transport model that maps a source distribution to a target distribution is a canonical problem in machine learning, but scientific applications increasingly require models that can generalize to source and target distributions unseen during training. We introduce distribution-conditioned transport (DCT), a framework that conditions transport maps on learned embeddings of source and target distributions, enabling generalization to unseen distribution pairs. DCT also allows semi-supervised learning for distributional forecasting problems: because it learns from arbitrary distribution pairs, it can leverage distributions observed at only one condition to improve transport prediction. DCT is agnostic to the underlying transport mechanism, supporting models ranging from flow matching to distributional divergence-based models (e.g. Wasserstein, MMD). We demonstrate the practical performance benefits of DCT on synthetic benchmarks and four applications in biology: batch effect transfer in single-cell genomics, perturbation prediction from mass cytometry data, learning clonal transcriptional dynamics in hematopoiesis, and modeling T-cell receptor sequence evolution.
[525] KindSleep: Knowledge-Informed Diagnosis of Obstructive Sleep Apnea from Oximetry
Micky C Nnamdi, Wenqi Shi, Cheng Wan, J. Ben Tamo, Benjamin M Smith, Chad A Purnell, May D Wang
Main category: cs.LG
TL;DR: KindSleep: A deep learning framework that integrates clinical knowledge with single-channel oximetry signals and clinical data for precise obstructive sleep apnea diagnosis, achieving state-of-the-art performance on large datasets.
Details
Motivation: Traditional OSA diagnosis via polysomnography is resource-intensive and limits widespread access, creating a critical need for accurate and efficient alternatives for this common sleep disorder affecting nearly one billion people globally.Method: KindSleep first learns clinically interpretable concepts (desaturation indices, respiratory disturbance events) directly from raw oximetry signals, then fuses these AI-derived concepts with multimodal clinical data to estimate the Apnea-Hypopnea Index (AHI).
Result: Evaluated on three large independent datasets (total n=9,815), KindSleep demonstrates excellent performance in estimating AHI scores (R²=0.917, ICC=0.957) and consistently outperforms existing approaches in classifying OSA severity, achieving weighted F1-scores from 0.827 to 0.941 across diverse populations.
Conclusion: KindSleep provides a transparent and trustworthy diagnostic tool for sleep medicine by grounding predictions in clinically meaningful concepts, offering an efficient alternative to resource-intensive polysomnography.
Abstract: Obstructive sleep apnea (OSA) is a sleep disorder that affects nearly one billion people globally and significantly elevates cardiovascular risk. Traditional diagnosis through polysomnography is resource-intensive and limits widespread access, creating a critical need for accurate and efficient alternatives. In this paper, we introduce KindSleep, a deep learning framework that integrates clinical knowledge with single-channel patient-specific oximetry signals and clinical data for precise OSA diagnosis. KindSleep first learns to identify clinically interpretable concepts, such as desaturation indices and respiratory disturbance events, directly from raw oximetry signals. It then fuses these AI-derived concepts with multimodal clinical data to estimate the Apnea-Hypopnea Index (AHI). We evaluate KindSleep on three large, independent datasets from the National Sleep Research Resource (SHHS, CFS, MrOS; total n = 9,815). KindSleep demonstrates excellent performance in estimating AHI scores (R2 = 0.917, ICC = 0.957) and consistently outperforms existing approaches in classifying OSA severity, achieving weighted F1-scores from 0.827 to 0.941 across diverse populations. By grounding its predictions in a layer of clinically meaningful concepts, KindSleep provides a more transparent and trustworthy diagnostic tool for sleep medicine practices.
[526] ConTSG-Bench: A Unified Benchmark for Conditional Time Series Generation
Shaocheng Lan, Shuqi Gu, Zhangzhi Xiong, Kan Ren
Main category: cs.LG
TL;DR: ConTSG-Bench: A comprehensive benchmarking framework for conditional time series generation with diverse conditioning modalities and systematic evaluation metrics.
Details
Motivation: The field of conditional time series generation lacks standardized benchmarking frameworks despite its importance for addressing data scarcity and enabling causal analysis in real-world applications.Method: Introduces ConTSG-Bench, which includes a large-scale, well-aligned dataset spanning diverse conditioning modalities and semantic abstraction levels, enabling systematic evaluation of generation methods with comprehensive metrics for fidelity and condition adherence.
Result: The benchmark reveals traits and limitations of current approaches, highlighting critical challenges in precise structural controllability and downstream task utility under complex conditions.
Conclusion: ConTSG-Bench provides a standardized framework for evaluating conditional time series generation, identifying key research directions for improving controllability and utility in complex conditional settings.
Abstract: Conditional time series generation plays a critical role in addressing data scarcity and enabling causal analysis in real-world applications. Despite its increasing importance, the field lacks a standardized and systematic benchmarking framework for evaluating generative models across diverse conditions. To address this gap, we introduce the Conditional Time Series Generation Benchmark (ConTSG-Bench). ConTSG-Bench comprises a large-scale, well-aligned dataset spanning diverse conditioning modalities and levels of semantic abstraction, first enabling systematic evaluation of representative generation methods across these dimensions with a comprehensive suite of metrics for generation fidelity and condition adherence. Both the quantitative benchmarking and in-depth analyses of conditional generation behaviors have revealed the traits and limitations of the current approaches, highlighting critical challenges and promising research directions, particularly with respect to precise structural controllability and downstream task utility under complex conditions.
[527] Distributional Reinforcement Learning with Information Bottleneck for Uncertainty-Aware DRAM Equalization
Muhammad Usama, Dong Eui Chang
Main category: cs.LG
TL;DR: A distributional risk-sensitive RL framework for equalizer parameter optimization using Information Bottleneck compression and CVaR optimization with worst-case guarantees and uncertainty quantification.
Details
Motivation: Existing equalizer optimization methods are computationally expensive (eye diagram evaluation), optimize expected rather than worst-case performance, and lack uncertainty quantification for deployment decisions in high-speed memory systems.Method: Distributional risk-sensitive RL integrating Information Bottleneck latent representations with Conditional Value-at-Risk optimization, using rate-distortion optimal signal compression, Monte Carlo dropout for epistemic uncertainty, quantile regression for worst-case optimization, and PAC-Bayesian regularization for generalization bounds.
Result: Achieved 51x speedup over eye diagrams, mean improvements of 37.1% (4-tap) and 41.5% (8-tap) with worst-case guarantees of 33.8% and 38.2%, representing 80.7% and 89.1% improvements over Q-learning baselines, and 62.5% high-reliability classification eliminating manual validation.
Conclusion: The framework provides a practical solution for production-scale equalizer optimization with certified worst-case guarantees, addressing computational efficiency, worst-case performance, and uncertainty quantification challenges.
Abstract: Equalizer parameter optimization is critical for signal integrity in high-speed memory systems operating at multi-gigabit data rates. However, existing methods suffer from computationally expensive eye diagram evaluation, optimization of expected rather than worst-case performance, and absence of uncertainty quantification for deployment decisions. In this paper, we propose a distributional risk-sensitive reinforcement learning framework integrating Information Bottleneck latent representations with Conditional Value-at-Risk optimization. We introduce rate-distortion optimal signal compression achieving 51 times speedup over eye diagrams while quantifying epistemic uncertainty through Monte Carlo dropout. Distributional reinforcement learning with quantile regression enables explicit worst-case optimization, while PAC-Bayesian regularization certifies generalization bounds. Experimental validation on 2.4 million waveforms from eight memory units demonstrated mean improvements of 37.1% and 41.5% for 4-tap and 8-tap equalizer configurations with worst-case guarantees of 33.8% and 38.2%, representing 80.7% and 89.1% improvements over Q-learning baselines. The framework achieved 62.5% high-reliability classification eliminating manual validation for most configurations. These results suggest the proposed framework provides a practical solution for production-scale equalizer optimization with certified worst-case guarantees.
[528] Distributional Equivalence in Linear Non-Gaussian Latent-Variable Cyclic Causal Models: Characterization and Learning
Haoyue Dai, Immanuel Albrecht, Peter Spirtes, Kun Zhang
Main category: cs.LG
TL;DR: Establishes first equivalence characterization for linear non-Gaussian models with arbitrary latent variables and cycles, enabling structural-assumption-free causal discovery
Details
Motivation: Existing causal discovery methods with latent variables rely on strong structural assumptions; there's a need for general, assumption-free approaches, but the core obstacle is lack of equivalence characterization - without knowing what can be identified, one cannot design methods for how to identify itMethod: Establishes graphical criterion for distributional equivalence of graphs with arbitrary latent structure and cycles in linear non-Gaussian models; introduces new tool called edge rank constraints; provides procedure to traverse equivalence class and algorithm to recover models from data up to such equivalence
Result: First equivalence characterization with latent variables in any parametric setting without structural assumptions, and consequently the first structural-assumption-free discovery method; code and interactive demo available
Conclusion: Closes the gap in causal discovery with latent variables by providing the necessary equivalence characterization for linear non-Gaussian models, enabling assumption-free methods for the first time
Abstract: Causal discovery with latent variables is a fundamental task. Yet most existing methods rely on strong structural assumptions, such as enforcing specific indicator patterns for latents or restricting how they can interact with others. We argue that a core obstacle to a general, structural-assumption-free approach is the lack of an equivalence characterization: without knowing what can be identified, one generally cannot design methods for how to identify it. In this work, we aim to close this gap for linear non-Gaussian models. We establish the graphical criterion for when two graphs with arbitrary latent structure and cycles are distributionally equivalent, that is, they induce the same observed distribution set. Key to our approach is a new tool, edge rank constraints, which fills a missing piece in the toolbox for latent-variable causal discovery in even broader settings. We further provide a procedure to traverse the whole equivalence class and develop an algorithm to recover models from data up to such equivalence. To our knowledge, this is the first equivalence characterization with latent variables in any parametric setting without structural assumptions, and hence the first structural-assumption-free discovery method. Code and an interactive demo are available at https://equiv.cc.
[529] Diffusion Policy through Conditional Proximal Policy Optimization
Ben Liu, Shunpeng Yang, Hua Chen
Main category: cs.LG
TL;DR: A novel method for training diffusion policies in on-policy reinforcement learning that efficiently computes action log-likelihood using only simple Gaussian probability evaluation, enabling multimodal behaviors and entropy regularization.
Details
Motivation: Diffusion policies show strong potential for modeling multimodal behaviors in RL but face challenges in computing action log-likelihood efficiently for on-policy learning. Existing methods require evaluating the entire denoising process, which is computationally expensive and memory-intensive.Method: Proposes aligning policy iteration with the diffusion process to enable training diffusion policies in on-policy RL using only simple Gaussian probability evaluation, avoiding the need to compute log-likelihood through the entire denoising chain.
Result: The method produces multimodal policy behaviors and achieves superior performance on benchmark tasks in IsaacLab and MuJoCo Playground environments, while naturally handling entropy regularization.
Conclusion: The proposed approach provides an efficient way to train diffusion policies in on-policy RL settings, overcoming computational challenges of previous methods while enabling multimodal behavior modeling and entropy regularization.
Abstract: Reinforcement learning (RL) has been extensively employed in a wide range of decision-making problems, such as games and robotics. Recently, diffusion policies have shown strong potential in modeling multi-modal behaviors, enabling more diverse and flexible action generation compared to the conventional Gaussian policy. Despite various attempts to combine RL with diffusion, a key challenge is the difficulty of computing action log-likelihood under the diffusion model. This greatly hinders the direct application of diffusion policies in on-policy reinforcement learning. Most existing methods calculate or approximate the log-likelihood through the entire denoising process in the diffusion model, which can be memory- and computationally inefficient. To overcome this challenge, we propose a novel and efficient method to train a diffusion policy in an on-policy setting that requires only evaluating a simple Gaussian probability. This is achieved by aligning the policy iteration with the diffusion process, which is a distinct paradigm compared to previous work. Moreover, our formulation can naturally handle entropy regularization, which is often difficult to incorporate into diffusion policies. Experiments demonstrate that the proposed method produces multimodal policy behaviors and achieves superior performance on a variety of benchmark tasks in both IsaacLab and MuJoCo Playground.
[530] Multilevel Training for Kolmogorov Arnold Networks
Ben S. Southworth, Jonas A. Actor, Graham Harper, Eric C. Cyr
Main category: cs.LG
TL;DR: Multilevel training algorithm for Kolmogorov-Arnold Networks (KANs) that exploits their structured spline basis functions to achieve orders of magnitude training speedup compared to conventional methods.
Details
Motivation: Neural network training is computationally expensive due to lack of structure in common architectures. KANs provide more structure through spline basis functions, enabling development of efficient multilevel training algorithms.Method: Established equivalence between KANs with spline basis and multichannel MLPs with power ReLU activations via linear basis change. Developed multilevel training approach using uniform refinement of spline knots with analytic geometric interpolation operators between models.
Result: Multilevel training achieves orders of magnitude improvement in accuracy over conventional methods for training comparable KANs or MLPs, particularly effective for physics informed neural networks.
Conclusion: Principled neural network design (like KANs) creates exploitable structure enabling multilevel algorithms that dramatically improve training performance, demonstrating value of structured architectures.
Abstract: Algorithmic speedup of training common neural architectures is made difficult by the lack of structure guaranteed by the function compositions inherent to such networks. In contrast to multilayer perceptrons (MLPs), Kolmogorov-Arnold networks (KANs) provide more structure by expanding learned activations in a specified basis. This paper exploits this structure to develop practical algorithms and theoretical insights, yielding training speedup via multilevel training for KANs. To do so, we first establish an equivalence between KANs with spline basis functions and multichannel MLPs with power ReLU activations through a linear change of basis. We then analyze how this change of basis affects the geometry of gradient-based optimization with respect to spline knots. The KANs change-of-basis motivates a multilevel training approach, where we train a sequence of KANs naturally defined through a uniform refinement of spline knots with analytic geometric interpolation operators between models. The interpolation scheme enables a ``properly nested hierarchy’’ of architectures, ensuring that interpolation to a fine model preserves the progress made on coarse models, while the compact support of spline basis functions ensures complementary optimization on subsequent levels. Numerical experiments demonstrate that our multilevel training approach can achieve orders of magnitude improvement in accuracy over conventional methods to train comparable KANs or MLPs, particularly for physics informed neural networks. Finally, this work demonstrates how principled design of neural networks can lead to exploitable structure, and in this case, multilevel algorithms that can dramatically improve training performance.
[531] Missingness Bias Calibration in Feature Attribution Explanations
Shailesh Sridhar, Anton Xue, Eric Wong
Main category: cs.LG
TL;DR: MCal is a lightweight post-hoc method that corrects missingness bias in feature importance scores by fine-tuning a linear head on frozen model outputs, outperforming heavyweight approaches across medical vision, language, and tabular domains.
Details
Motivation: Existing explanation methods produce unreliable feature importance scores due to missingness bias, which arises when models are probed with ablated, out-of-distribution inputs. Current solutions treat this as a deep representational flaw requiring expensive retraining or architectural changes.Method: MCal treats missingness bias as a superficial artifact of the model’s output space. It corrects this bias by fine-tuning a simple linear head on the outputs of a frozen base model, requiring no architectural modifications or retraining of the base model.
Result: The simple correction consistently reduces missingness bias and is competitive with or outperforms prior heavyweight approaches across diverse medical benchmarks spanning vision, language, and tabular domains.
Conclusion: Missingness bias can be effectively treated as a superficial output-space artifact rather than a deep representational flaw, enabling lightweight post-hoc correction that matches or exceeds performance of expensive retraining approaches.
Abstract: Popular explanation methods often produce unreliable feature importance scores due to missingness bias, a systematic distortion that arises when models are probed with ablated, out-of-distribution inputs. Existing solutions treat this as a deep representational flaw that requires expensive retraining or architectural modifications. In this work, we challenge this assumption and show that missingness bias can be effectively treated as a superficial artifact of the model’s output space. We introduce MCal, a lightweight post-hoc method that corrects this bias by fine-tuning a simple linear head on the outputs of a frozen base model. Surprisingly, we find this simple correction consistently reduces missingness bias and is competitive with, or even outperforms, prior heavyweight approaches across diverse medical benchmarks spanning vision, language, and tabular domains.
[532] Why Is RLHF Alignment Shallow? A Gradient Analysis
Robin Young
Main category: cs.LG
TL;DR: The paper proves that standard gradient-based safety alignment in LLMs is inherently shallow because gradients concentrate only on positions where harm is decided, leaving later positions with zero gradient signal.
Details
Motivation: To understand why safety alignment in large language models appears shallow in practice, and to provide theoretical explanations for empirical observations that KL divergence between aligned and base models concentrates on early tokens.Method: Uses martingale decomposition of sequence-level harm to derive exact characterization of alignment gradients, showing gradient at position t equals covariance between conditional expected harm and score function. Introduces harm information I_t to quantify each position’s influence on harm.
Result: Proves that positions beyond the harm horizon (where output’s harmfulness is already determined) receive zero gradient signal, explaining why standard alignment objectives cannot produce deep alignment. Derives recovery penalty objective that creates gradient signals at all positions.
Conclusion: Standard safety alignment is fundamentally shallow due to gradient concentration on harm decision points. The paper provides theoretical grounding for data augmentation techniques and proposes recovery penalty objectives for deeper alignment.
Abstract: Why is safety alignment in LLMs shallow? We prove that gradient-based alignment inherently concentrates on positions where harm is decided and vanishes beyond. Using a martingale decomposition of sequence-level harm, we derive an exact characterization of alignment gradients. The gradient at position $t$ equals the covariance between the conditional expected harm and the score function. This implies that positions beyond the harm horizon where the output’s harmfulness is already determined receive zero gradient signal during training. This explains empirical observations that KL divergence between aligned and base models concentrates on early tokens. Consequently, standard alignment objectives cannot produce deep alignment, regardless of optimization quality. We introduce the concept of harm information $I_t$, which quantifies each position’s influence on harm, and prove that equilibrium KL divergence tracks this quantity. Finally, we derive an objective based on recovery penalties that creates gradient signal at all positions, providing theoretical grounding for empirically successful data augmentation techniques.
[533] Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness
Ruichen Xu, Kexin Chen
Main category: cs.LG
TL;DR: Theoretical analysis shows DP-SGD in neural networks degrades performance, causes fairness issues, and reduces robustness due to noise injection affecting feature learning dynamics.
Details
Motivation: Differentially private learning empirically degrades model performance, introduces fairness issues, and reduces adversarial robustness, but theoretical understanding of these phenomena in modern neural networks is lacking.Method: Introduces a unified feature-centric framework to analyze DP-SGD dynamics in two-layer ReLU convolutional neural networks, establishing test loss bounds governed by feature-to-noise ratio (FNR).
Result: Noise required for privacy leads to suboptimal feature learning: 1) imbalanced FNRs cause disparate impact, 2) noise harms semantically long-tailed data more, 3) noise increases vulnerability to adversarial attacks, and 4) public pre-training + private fine-tuning doesn’t guarantee improvement under feature distribution shifts.
Conclusion: Theoretical framework explains how DP noise affects feature learning, causing performance degradation, fairness issues, and reduced robustness, with implications for privacy-preserving ML design.
Abstract: Differentially private learning is essential for training models on sensitive data, but empirical studies consistently show that it can degrade performance, introduce fairness issues like disparate impact, and reduce adversarial robustness. The theoretical underpinnings of these phenomena in modern, non-convex neural networks remain largely unexplored. This paper introduces a unified feature-centric framework to analyze the feature learning dynamics of differentially private stochastic gradient descent (DP-SGD) in two-layer ReLU convolutional neural networks. Our analysis establishes test loss bounds governed by a crucial metric: the feature-to-noise ratio (FNR). We demonstrate that the noise required for privacy leads to suboptimal feature learning, and specifically show that: 1) imbalanced FNRs across classes and subpopulations cause disparate impact; 2) even in the same class, noise has a greater negative impact on semantically long-tailed data; and 3) noise injection exacerbates vulnerability to adversarial attacks. Furthermore, our analysis reveals that the popular paradigm of public pre-training and private fine-tuning does not guarantee improvement, particularly under significant feature distribution shifts between datasets. Experiments on synthetic and real-world data corroborate our theoretical findings.
[534] FedAFD: Multimodal Federated Learning via Adversarial Fusion and Distillation
Min Tan, Junchao Ma, Yinfu Feng, Jiajun Ding, Wenwen Pan, Tingting Han, Qian Zheng, Zhenzhong Kuang, Zhou Yu
Main category: cs.LG
TL;DR: FedAFD is a multimodal federated learning framework that addresses modality/task discrepancies and model heterogeneity through client-side adversarial alignment and granularity-aware fusion, plus server-side similarity-guided ensemble distillation.
Details
Motivation: Existing multimodal federated learning methods struggle with personalized client performance, modality/task discrepancies, and model heterogeneity when clients have different data modalities and need to collaborate without sharing raw data.Method: Client-side: bi-level adversarial alignment to align local/global representations within/across modalities; granularity-aware fusion to integrate global knowledge into personalized features. Server-side: similarity-guided ensemble distillation that aggregates client representations on shared public data based on feature similarity and distills knowledge into global model.
Result: Extensive experiments under IID and non-IID settings show FedAFD achieves superior performance and efficiency for both client and server compared to existing methods.
Conclusion: FedAFD provides an effective unified framework for multimodal federated learning that handles modality/task discrepancies and model heterogeneity while improving both client personalization and global model performance.
Abstract: Multimodal Federated Learning (MFL) enables clients with heterogeneous data modalities to collaboratively train models without sharing raw data, offering a privacy-preserving framework that leverages complementary cross-modal information. However, existing methods often overlook personalized client performance and struggle with modality/task discrepancies, as well as model heterogeneity. To address these challenges, we propose FedAFD, a unified MFL framework that enhances client and server learning. On the client side, we introduce a bi-level adversarial alignment strategy to align local and global representations within and across modalities, mitigating modality and task gaps. We further design a granularity-aware fusion module to integrate global knowledge into the personalized features adaptively. On the server side, to handle model heterogeneity, we propose a similarity-guided ensemble distillation mechanism that aggregates client representations on shared public data based on feature similarity and distills the fused knowledge into the global model. Extensive experiments conducted under both IID and non-IID settings demonstrate that FedAFD achieves superior performance and efficiency for both the client and the server.
[535] U-Parking: Distributed UWB-Assisted Autonomous Parking System with Robust Localization and Intelligent Planning
Yiang Wu, Qiong Wu, Pingyi Fan, Kezhi Wang, Wen Chen, Guoqiang Mao, Khaled B. Letaief
Main category: cs.LG
TL;DR: U-Parking: A distributed UWB-assisted autonomous parking system using LLM-assisted planning with fusion localization and trajectory tracking for reliable indoor automated parking.
Details
Motivation: To enable reliable automated parking in challenging indoor environments where traditional GPS-based systems fail, by combining UWB technology with LLM capabilities for improved planning and execution.Method: Integrates Large Language Models (LLMs) for planning assistance with robust fusion localization (combining UWB with other sensors) and trajectory tracking in a distributed architecture using Ultra-Wideband technology.
Result: Validated through real-vehicle demonstrations showing reliable automated parking performance in challenging indoor environments where conventional systems would struggle.
Conclusion: U-Parking demonstrates that combining LLM-assisted planning with UWB-based localization enables effective autonomous parking in GPS-denied indoor environments, representing a practical application of multimodal AI in robotics.
Abstract: This demonstration presents U-Parking, a distributed Ultra-Wideband (UWB)-assisted autonomous parking system. By integrating Large Language Models (LLMs)-assisted planning with robust fusion localization and trajectory tracking, it enables reliable automated parking in challenging indoor environments, as validated through real-vehicle demonstrations.
[536] EVMbench: Evaluating AI Agents on Smart Contract Security
Justin Wang, Andreas Bigger, Xiaohai Xu, Justin W. Lin, Andy Applebaum, Tejal Patwardhan, Alpin Yukseloglu, Olivia Watkins
Main category: cs.LG
TL;DR: EVMbench is an evaluation framework for AI agents to detect, patch, and exploit smart contract vulnerabilities using 117 curated vulnerabilities from 40 repositories with programmatic grading in a local Ethereum environment.
Details
Motivation: Smart contracts manage large amounts of value on public blockchains, and vulnerabilities can lead to substantial losses. As AI agents become more capable at reading, writing, and running code, there's a need to evaluate how well they can navigate this landscape both for improving security and potentially increasing risk.Method: Created EVMbench evaluation framework with 117 curated vulnerabilities from 40 repositories. Uses programmatic grading based on tests and blockchain state under a local Ethereum execution environment. Evaluates frontier AI agents’ ability to detect, patch, and exploit vulnerabilities end-to-end against live blockchain instances.
Result: Frontier agents are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances. The framework demonstrates AI agents’ current capabilities in smart contract security analysis.
Conclusion: AI agents already show significant capabilities in smart contract vulnerability analysis, with implications for both security improvement and potential risk. The released code, tasks, and tooling support continued measurement of these capabilities and future security work.
Abstract: Smart contracts on public blockchains now manage large amounts of value, and vulnerabilities in these systems can lead to substantial losses. As AI agents become more capable at reading, writing, and running code, it is natural to ask how well they can already navigate this landscape, both in ways that improve security and in ways that might increase risk. We introduce EVMbench, an evaluation that measures the ability of agents to detect, patch, and exploit smart contract vulnerabilities. EVMbench draws on 117 curated vulnerabilities from 40 repositories and, in the most realistic setting, uses programmatic grading based on tests and blockchain state under a local Ethereum execution environment. We evaluate a range of frontier agents and find that they are capable of discovering and exploiting vulnerabilities end-to-end against live blockchain instances. We release code, tasks, and tooling to support continued measurement of these capabilities and future work on security.
[537] BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning
Yuan Li, Bo Wang, Yufei Gao, Yuqian Yao, Xinyuan Wang, Zhangyue Yin, Xipeng Qiu
Main category: cs.LG
TL;DR: BandPO introduces a dynamic clipping mechanism for PPO that replaces fixed bounds with probability-aware intervals to prevent entropy collapse and better explore high-advantage strategies.
Details
Motivation: The paper identifies that fixed clipping bounds in PPO constrain upward updates of low-probability actions, disproportionately suppressing high-advantage tail strategies and causing rapid entropy collapse, which limits exploration in RL for LLMs.Method: Introduces Band-constrained Policy Optimization (BandPO) which replaces canonical clipping with “Band” - a theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Formulates this as a convex optimization problem with closed-form solutions for specific divergences.
Result: Extensive experiments show BandPO consistently outperforms canonical clipping and Clip-Higher across diverse models and datasets, while robustly mitigating entropy collapse.
Conclusion: BandPO effectively resolves the exploration bottleneck in PPO by introducing dynamic, probability-aware clipping intervals that better handle low-probability, high-advantage actions while maintaining stability.
Abstract: Proximal constraints are fundamental to the stability of the Large Language Model reinforcement learning. While the canonical clipping mechanism in PPO serves as an efficient surrogate for trust regions, we identify a critical bottleneck: fixed bounds strictly constrain the upward update margin of low-probability actions, disproportionately suppressing high-advantage tail strategies and inducing rapid entropy collapse. To address this, we introduce Band-constrained Policy Optimization (BandPO). BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions defined by f-divergences into dynamic, probability-aware clipping intervals. Theoretical analysis confirms that Band effectively resolves this exploration bottleneck. We formulate this mapping as a convex optimization problem, guaranteeing a globally optimal numerical solution while deriving closed-form solutions for specific divergences. Extensive experiments across diverse models and datasets demonstrate that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
[538] Semantic Communication-Enhanced Split Federated Learning for Vehicular Networks: Architecture, Challenges, and Case Study
Lu Yu, Zheng Chang, Ying-Chang Liang
Main category: cs.LG
TL;DR: Semantic communication-enhanced split federated learning framework for vehicular edge intelligence that reduces communication overhead and enhances label privacy through task-relevant information compression and adaptive rate adjustment.
Details
Motivation: Traditional centralized learning in vehicular networks suffers from high communication overhead and privacy risks. Split federated learning helps but still faces communication bottlenecks from transmitting high-dimensional features and label privacy concerns.Method: Proposes SC-USFL framework with semantic communication module (SCM) using pre-trained encoding/decoding units to compress and transmit only task-relevant semantic information. Includes network status monitor (NSM) for adaptive compression rate adjustment based on wireless channel conditions.
Result: The framework demonstrates efficient balancing of communication load, privacy preservation, and learning performance in resource-constrained vehicular environments.
Conclusion: SC-USFL offers a promising approach for vehicular edge intelligence, with semantic communication enhancing split federated learning by reducing overhead while maintaining privacy and performance.
Abstract: Vehicular edge intelligence (VEI) is vital for future intelligent transportation systems. However, traditional centralized learning in dynamic vehicular networks faces significant communication overhead and privacy risks. Split federated learning (SFL) offers a distributed solution but is often hindered by substantial communication bottlenecks from transmitting high-dimensional intermediate features and can present label privacy concerns. Semantic communication offers a transformative approach to alleviate these communication challenges in SFL by focusing on transmitting only task-relevant information. This paper leverages the advantages of semantic communication in the design of SFL, and presents a case study the semantic communication-enhanced U-Shaped split federated learning (SC-USFL) framework that inherently enhances label privacy by localizing sensitive computations with reduced overhead. It features a dedicated semantic communication module (SCM), with pre-trained and parameter-frozen encoding/decoding units, to efficiently compress and transmit only the task-relevant semantic information over the critical uplink path from vehicular users to the edge server (ES). Furthermore, a network status monitor (NSM) module enables adaptive adjustment of the semantic compression rate in real-time response to fluctuating wireless channel conditions. The SC-USFL framework demonstrates a promising approach for efficiently balancing communication load, preserving privacy, and maintaining learning performance in resource-constrained vehicular environments. Finally, this paper highlights key open research directions to further advance the synergy between semantic communication and SFL in the vehicular network.
[539] $\nabla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space
Peihao Wang, Ruisi Cai, Zhen Wang, Hongyuan Mei, Qiang Liu, Pan Li, Zhangyang Wang
Main category: cs.LG
TL;DR: ∇-Reasoner: A differentiable optimization framework for LLM decoding that uses gradient signals to refine token logits during inference, improving reasoning performance while reducing model calls.
Details
Motivation: Existing inference-time scaling methods rely on inefficient discrete search or trial-and-error prompting. There's a need for more efficient first-order optimization approaches to improve LLM reasoning at test time.Method: Proposes ∇-Reasoner framework with Differentiable Textual Optimization (DTO) that performs gradient descent in token logit space using signals from LLM likelihood and reward models. Incorporates rejection sampling and acceleration techniques.
Result: Achieves over 20% accuracy improvement on challenging mathematical reasoning benchmark while reducing model calls by 10-40% compared to strong baselines.
Conclusion: Introduces paradigm shift from zeroth-order search to first-order optimization at test time, offering cost-effective path to amplify LLM reasoning capabilities.
Abstract: Scaling inference-time compute for Large Language Models (LLMs) has unlocked unprecedented reasoning capabilities. However, existing inference-time scaling methods typically rely on inefficient and suboptimal discrete search algorithms or trial-and-error prompting to improve the online policy. In this paper, we propose $\nabla$-Reasoner, an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop to refine the policy on the fly. Our core component, Differentiable Textual Optimization (DTO), leverages gradient signals from both the LLM’s likelihood and a reward model to refine textual representations. $\nabla$-Reasoner further incorporates rejection sampling and acceleration design to robustify and speed up decoding. Theoretically, we show that performing inference-time gradient descent in the sample space to maximize reward is dual to aligning an LLM policy via KL-regularized reinforcement learning. Empirically, $\nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark, while reducing number of model calls by approximately 10-40% compared to strong baselines. Overall, our work introduces a paradigm shift from zeroth-order search to first-order optimization at test time, offering a cost-effective path to amplify LLM reasoning.
[540] Uncertainty-aware Blood Glucose Prediction from Continuous Glucose Monitoring Data
Hai Siong Tan
Main category: cs.LG
TL;DR: Transformer models with evidential output heads provide best uncertainty-aware blood glucose prediction in Type 1 diabetes, achieving higher accuracy and better-calibrated uncertainty estimates.
Details
Motivation: To develop uncertainty-aware neural network models for blood glucose prediction and adverse glycemic event identification in Type 1 diabetes, integrating principled uncertainty quantification into real-time machine learning systems.Method: Investigated three families of sequence models (LSTM, GRU, Transformer) with uncertainty quantification enabled by either Monte Carlo dropout or evidential output layers compatible with Deep Evidential Regression, validated on the HUPA-UCM diabetes dataset.
Result: Transformer-based models with evidential output heads provided the most effective uncertainty-aware framework, achieving consistently higher predictive accuracies and better-calibrated uncertainty estimates whose magnitudes significantly correlated with prediction errors.
Conclusion: The study demonstrates the value of integrating principled uncertainty quantification into real-time machine-learning-based blood glucose prediction systems, with Transformer-evidential models showing superior performance.
Abstract: In this work, we investigate uncertainty-aware neural network models for blood glucose prediction and adverse glycemic event identification in Type 1 diabetes. We consider three families of sequence models based on LSTM, GRU, and Transformer architectures, with uncertainty quantification enabled by either Monte Carlo dropout or through evidential output layers compatible with Deep Evidential Regression. Using the HUPA-UCM diabetes dataset for validation, we find that Transformer-based models equipped with evidential output heads provide the most effective uncertainty-aware framework, achieving consistently higher predictive accuracies and better-calibrated uncertainty estimates whose magnitudes significantly correlate with prediction errors. We further evaluate the clinical risk of each model using the recently proposed Diabetes Technology Society error grid, with risk categories defined by international expert consensus. Our results demonstrate the value of integrating principled uncertainty quantification into real-time machine-learning-based blood glucose prediction systems.
[541] WaterSIC: information-theoretically (near) optimal linear layer quantization
Egor Lifar, Semyon Savkin, Or Ordentlich, Yury Polyanskiy
Main category: cs.LG
TL;DR: WaterSIC algorithm for quantizing dense linear layers achieves near-optimal compression with 0.255-bit gap to information-theoretic limit, outperforming GPTQ and setting new SOTA for 1-4 bit LLM quantization.
Details
Motivation: Current quantization methods like GPTQ have arbitrarily large gaps to information-theoretic limits, motivating development of more optimal quantization algorithms for compressing LLM weights while maintaining accuracy.Method: WaterSIC uses waterfilling-inspired rate allocation across weight matrix columns, analyzing compression length vs output discrepancy tradeoff information-theoretically and applying different quantization rates to different in-features.
Result: WaterSIC achieves within 0.255 bits of information-theoretic limit uniformly across all input activation covariance matrices, establishing new SOTA for Llama and Qwen family LLMs at 1-4 bit quantization rates.
Conclusion: WaterSIC provides near-optimal quantization for dense linear layers, significantly improving over existing methods and enabling more efficient LLM deployment through better compression-performance tradeoffs.
Abstract: This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ‘‘WaterSIC’’, is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC’s is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ‘‘waterfilling’’. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.
[542] Mixture of Universal Experts: Scaling Virtual Width via Depth-Width Transformation
Yilong Chen, Naibin Gu, Junyuan Shang, Zhenyu Zhang, Yuchen Feng, Jiawei Sheng, Tingwen Liu, Shuohuan Wang, Yu Sun, Hua Wu, Haifeng Wang
Main category: cs.LG
TL;DR: MOUE introduces Virtual Width as a new scaling dimension for MoE architectures by reusing universal experts across layers, addressing routing explosion and load balancing challenges with specialized components.
Details
Motivation: Current Mixture-of-Experts (MoE) architectures are limited by physical dimensions of depth and width, restricting scalability despite decoupling model capacity from per-token computation. The authors aim to overcome these limitations by introducing a new scaling dimension.Method: Proposes Mixture of Universal Experts (MOUE) with three core components: 1) Staggered Rotational Topology for structured expert sharing across layers, 2) Universal Expert Load Balance for depth-aware exposure correction, and 3) Universal Router with lightweight trajectory state for coherent multi-step routing.
Result: MOUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.
Conclusion: MOUE successfully introduces Virtual Width as a novel scaling dimension for MoE architectures, overcoming limitations of traditional depth and width scaling through universal expert reuse and addressing associated routing and load balancing challenges.
Abstract: Mixture-of-Experts (MoE) decouples model capacity from per-token computation, yet their scalability remains limited by the physical dimensions of depth and width. To overcome this, we propose Mixture of Universal Experts (MOUE),a MoE generalization introducing a novel scaling dimension: Virtual Width. In general, MoUE aims to reuse a universal layer-agnostic expert pool across layers, converting depth into virtual width under a fixed per-token activation budget. However, two challenges remain: a routing path explosion from recursive expert reuse, and a mismatch between the exposure induced by reuse and the conventional load-balancing objectives. We address these with three core components: a Staggered Rotational Topology for structured expert sharing, a Universal Expert Load Balance for depth-aware exposure correction, and a Universal Router with lightweight trajectory state for coherent multi-step routing. Empirically, MoUE consistently outperforms matched MoE baselines by up to 1.3% across scaling regimes, enables progressive conversion of existing MoE checkpoints with up to 4.2% gains, and reveals a new scaling dimension for MoE architectures.
[543] Functionality-Oriented LLM Merging on the Fisher–Rao Manifold
Jiayu Wang, Zuojun Ye, Wenpeng Yin
Main category: cs.LG
TL;DR: Proposes Fisher-Rao manifold-based model merging using Karcher mean to address limitations of Euclidean parameter-space methods, preventing representation collapse and enabling principled multi-expert merging.
Details
Motivation: Current weight-space merging methods are parameter-space heuristics with three key limitations: 1) they operate on Euclidean coordinates rather than focusing on predictive behaviors, 2) they suffer from representation collapse when merging heterogeneous models, and 3) they don't extend cleanly to merging multiple experts.Method: Formulates model merging as computing a weighted Karcher mean on the Fisher-Rao manifold, which minimizes KL-based function distance between predictive distributions. Derives a practical fixed-point algorithm using a lightweight spherical proxy that preserves norms and generalizes to multi-expert merging.
Result: The method remains stable as the number and heterogeneity of merged models increase, consistently outperforming prior baselines across various benchmarks and collapse diagnostics.
Conclusion: Fisher-Rao manifold-based merging provides a principled geometric approach that addresses fundamental limitations of Euclidean methods, preventing representation collapse and enabling effective multi-expert model fusion.
Abstract: Weight-space merging aims to combine multiple fine-tuned LLMs into a single model without retraining, yet most existing approaches remain fundamentally parameter-space heuristics. This creates three practical limitations. First, linear averaging, task vectors, and related rules operate on Euclidean coordinates, even though the desired goal is to merge functionality, i.e., predictive behaviors across tasks. Second, when the source checkpoints are farther apart or more heterogeneous, Euclidean blends often trigger representation collapse, manifested as activation variance shrinkage and effective-rank degradation, which sharply degrades accuracy. Third, many geometry-inspired methods are most natural for two-model interpolation and do not extend cleanly to merging N>2 experts with a principled objective. We address these issues by formulating model merging as computing a weighted Karcher mean on the Fisher–Rao manifold, which is locally equivalent to minimizing a KL-based function distance between predictive distributions. We derive a practical fixed-point algorithm using a lightweight spherical proxy that preserves norms and generalizes directly to multi-expert merging. Across various benchmarks and collapse diagnostics, our method remains stable as the number and heterogeneity of merged models increase, consistently outperforming prior baselines.
[544] Lightweight and Scalable Transfer Learning Framework for Load Disaggregation
L. E. Garcia-Marrero, G. Petrone, E. Monmasson
Main category: cs.LG
TL;DR: RefQuery is a scalable multi-appliance NILM framework that uses compact appliance fingerprints to enable one shared model to serve many appliances without fixed output sets, achieving efficient edge deployment.
Details
Motivation: Cross-domain generalization in NILM remains challenging due to appliance variations across homes. Existing transfer learning methods lack flexibility for evolving real-world deployments, are unsuitable for edge devices, or scale poorly for real-time operation.Method: Proposes RefQuery framework that conditions disaggregation on compact appliance fingerprints, keeping a pretrained disaggregation network frozen and learning only per-appliance embeddings during lightweight backpropagation.
Result: Experiments on three public datasets show RefQuery delivers strong accuracy-efficiency trade-off against single-appliance and multi-appliance baselines, including modern Transformer-based methods.
Conclusion: RefQuery provides a practical path toward scalable, real-time NILM on resource-constrained edge devices by enabling flexible, efficient appliance disaggregation.
Abstract: Non-Intrusive Load Monitoring (NILM) aims to estimate appliance-level consumption from aggregate electrical signals recorded at a single measurement point. In recent years, the field has increasingly adopted deep learning approaches; however, cross-domain generalization remains a persistent challenge due to variations in appliance characteristics, usage patterns, and background loads across homes. Transfer learning provides a practical paradigm to adapt models with limited target data. However, existing methods often assume a fixed appliance set, lack flexibility for evolving real-world deployments, remain unsuitable for edge devices, or scale poorly for real-time operation. This paper proposes RefQuery, a scalable multi-appliance, multi-task NILM framework that conditions disaggregation on compact appliance fingerprints, allowing one shared model to serve many appliances without a fixed output set. RefQuery keeps a pretrained disaggregation network fully frozen and adapts to a target home by learning only a per-appliance embedding during a lightweight backpropagation stage. Experiments on three public datasets demonstrate that RefQuery delivers a strong accuracy-efficiency trade-off against single-appliance and multi-appliance baselines, including modern Transformer-based methods. These results support RefQuery as a practical path toward scalable, real-time NILM on resource-constrained edge devices.
[545] Non-Euclidean Gradient Descent Operates at the Edge of Stability
Rustem Islamov, Michael Crawshaw, Jeremy Cohen, Robert Gower
Main category: cs.LG
TL;DR: The paper extends the Edge of Stability phenomenon to non-Euclidean norms, providing a generalized sharpness measure that works across different optimizers including ℓ∞-descent, Block CD, and Spectral GD.
Details
Motivation: The Edge of Stability phenomenon has been widely observed in deep learning but lacks complete theoretical foundations, especially for non-Euclidean optimization methods. The authors aim to provide a unified framework that explains EoS across different optimization geometries.Method: The authors use Directional Smoothness to interpret EoS and extend it to non-Euclidean norms. They define generalized sharpness under arbitrary norms and apply this framework to various optimization methods including ℓ∞-descent, Block Coordinate Descent, Spectral GD, and Muon without momentum.
Result: Experiments on neural networks show that non-Euclidean gradient descent with the generalized sharpness measure also exhibits progressive sharpening followed by oscillations around or above the threshold 2/η, similar to the classical EoS phenomenon.
Conclusion: The framework provides a single, geometry-aware spectral measure that works across different optimizers, unifying the understanding of Edge of Stability phenomena in various optimization settings beyond Euclidean gradient descent.
Abstract: The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to $2/η$ during training with gradient descent (GD) with a step-size $η$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness Mishkin et al. [2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and Muon without momentum. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/η$. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers.
[546] Poisoning the Inner Prediction Logic of Graph Neural Networks for Clean-Label Backdoor Attacks
Yuxiang Zhang, Bin Ma, Enyan Dai
Main category: cs.LG
TL;DR: BA-Logic: A clean-label graph backdoor attack method that poisons GNN prediction logic without modifying training labels, achieving high attack success rates in realistic scenarios.
Details
Motivation: Existing graph backdoor attacks require modifying training labels, which is impractical in real-world scenarios. Clean-label attacks (where labels remain unchanged) are understudied and existing methods fail in this setting because they don't effectively poison GNNs' internal prediction logic.Method: BA-Logic coordinates a poisoned node selector and a logic-poisoning trigger generator to poison the inner prediction logic of GNN models. It identifies vulnerable nodes and generates triggers that alter the model’s reasoning patterns without changing training labels.
Result: Extensive experiments on real-world datasets show BA-Logic significantly enhances attack success rates and outperforms state-of-the-art graph backdoor attack methods under clean-label settings.
Conclusion: The paper demonstrates that effective clean-label graph backdoor attacks are possible by poisoning GNNs’ internal prediction logic, addressing a realistic but previously understudied attack scenario.
Abstract: Graph Neural Networks (GNNs) have achieved remarkable results in various tasks. Recent studies reveal that graph backdoor attacks can poison the GNN model to predict test nodes with triggers attached as the target class. However, apart from injecting triggers to training nodes, these graph backdoor attacks generally require altering the labels of trigger-attached training nodes into the target class, which is impractical in real-world scenarios. In this work, we focus on the clean-label graph backdoor attack, a realistic but understudied topic where training labels are not modifiable. According to our preliminary analysis, existing graph backdoor attacks generally fail under the clean-label setting. Our further analysis identifies that the core failure of existing methods lies in their inability to poison the prediction logic of GNN models, leading to the triggers being deemed unimportant for prediction. Therefore, we study a novel problem of effective clean-label graph backdoor attacks by poisoning the inner prediction logic of GNN models. We propose BA-Logic to solve the problem by coordinating a poisoned node selector and a logic-poisoning trigger generator. Extensive experiments on real-world datasets demonstrate that our method effectively enhances the attack success rate and surpasses state-of-the-art graph backdoor attack competitors under clean-label settings. Our code is available at https://anonymous.4open.science/r/BA-Logic
[547] MCEL: Margin-Based Cross-Entropy Loss for Error-Tolerant Quantized Neural Networks
Mikail Yayla, Akash Kumar
Main category: cs.LG
TL;DR: Margin Cross-Entropy Loss (MCEL) improves neural network robustness to bit errors without error injection training, by promoting logit-level margin separation.
Details
Motivation: Current approaches for bit error tolerance in neural networks rely on error injection during training, which has computational overhead, degrades accuracy at high error rates, and doesn't scale well for larger architectures. There's a need for more efficient and scalable methods for reliable deployment on approximate computing platforms and error-prone memory technologies.Method: The paper establishes a connection between bit error tolerance and classification margins at the output layer. It proposes Margin Cross-Entropy Loss (MCEL), which explicitly promotes logit-level margin separation while preserving the optimization properties of standard cross-entropy loss. MCEL includes an interpretable margin parameter for principled robustness tuning.
Result: Extensive experiments across multiple datasets, diverse neural network architectures, and various quantization schemes show that MCEL substantially improves bit error tolerance, achieving up to 15% accuracy improvement for 1% error rate. The method is simple to implement and can be used as a drop-in replacement for standard cross-entropy loss.
Conclusion: MCEL provides a scalable and principled alternative to training-time bit flip injection, offering insights into neural network robustness origins and enabling more efficient deployment on approximate computing and memory systems without the computational overhead of error injection training.
Abstract: Robustness to bit errors is a key requirement for the reliable use of neural networks (NNs) on emerging approximate computing platforms and error-prone memory technologies. A common approach to achieve bit error tolerance in NNs is injecting bit flips during training according to a predefined error model. While effective in certain scenarios, training-time bit flip injection introduces substantial computational overhead, often degrades inference accuracy at high error rates, and scales poorly for larger NN architectures. These limitations make error injection an increasingly impractical solution for ensuring robustness on future approximate computing platforms and error-prone memory technologies. In this work, we investigate the mechanisms that enable NNs to tolerate bit errors without relying on error-aware training. We establish a direct connection between bit error tolerance and classification margins at the output layer. Building on this insight, we propose a novel loss function, the Margin Cross-Entropy Loss (MCEL), which explicitly promotes logit-level margin separation while preserving the favorable optimization properties of the standard cross-entropy loss. Furthermore, MCEL introduces an interpretable margin parameter that allows robustness to be tuned in a principled manner. Extensive experimental evaluations across multiple datasets of varying complexity, diverse NN architectures, and a range of quantization schemes demonstrate that MCEL substantially improves bit error tolerance, up to 15 % in accuracy for an error rate of 1 %. Our proposed MCEL method is simple to implement, efficient, and can be integrated as a drop-in replacement for standard CEL. It provides a scalable and principled alternative to training-time bit flip injection, offering new insights into the origins of NN robustness and enabling more efficient deployment on approximate computing and memory systems.
[548] Asymptotic Behavior of Multi–Task Learning: Implicit Regularization and Double Descent Effects
Ayed M. Alrashdi, Oussama Dhifallah, Houssem Sifaou
Main category: cs.LG
TL;DR: Theoretical analysis of multi-task learning showing combining related tasks is asymptotically equivalent to single-task learning with beneficial regularization terms that improve generalization and mitigate double descent.
Details
Motivation: To understand why multi-task learning improves generalization by leveraging common information between related tasks, and to provide precise asymptotic analysis of this phenomenon.Method: Provides precise asymptotic analysis of misspecified perceptron learning models in multi-task settings, showing equivalence to single-task formulations with additional regularization terms.
Result: Combining multiple tasks is asymptotically equivalent to traditional formulations with beneficial regularization that improves generalization and postpones/mitigates double descent phenomenon.
Conclusion: Multi-task learning provides regularization benefits that enhance generalization performance and help mitigate overfitting phenomena like double descent.
Abstract: Multi–task learning seeks to improve the generalization error by leveraging the common information shared by multiple related tasks. One challenge in multi–task learning is identifying formulations capable of uncovering the common information shared between different but related tasks. This paper provides a precise asymptotic analysis of a popular multi–task formulation associated with misspecified perceptron learning models. The main contribution of this paper is to precisely determine the reasons behind the benefits gained from combining multiple related tasks. Specifically, we show that combining multiple tasks is asymptotically equivalent to a traditional formulation with additional regularization terms that help improve the generalization performance. Another contribution is to empirically study the impact of combining tasks on the generalization error. In particular, we empirically show that the combination of multiple tasks postpones the double descent phenomenon and can mitigate it asymptotically.
[549] Deep Learning-Driven Friendly Jamming for Secure Multicarrier ISAC Under Channel Uncertainty
Bui Minh Tuan, Van-Dinh Nguyen, Diep N. Nguyen, Nguyen Linh Trung, Nguyen Van Huynh, Dinh Thai Hoang, Marwan Krunz, Eryk Dutkiewicz
Main category: cs.LG
TL;DR: Deep learning framework for physical-layer security in multicarrier ISAC systems using radar echo feedback for directional jamming without requiring eavesdropper information.
Details
Motivation: To address security challenges in ISAC systems under imperfect CSI and unknown eavesdropper locations, overcoming limitations of conventional friendly jamming approaches that require precise eavesdropper information.Method: Proposes radar-aware neural network that jointly optimizes beamforming and jamming using nonparametric FIM estimator based on f-divergence, with quantized tensor train-based encoder for model compression and non-overlapping secure scheme for dedicated communication sub-bands.
Result: Achieves significant improvements in secrecy rate, reduced BLER, strong robustness against CSI uncertainty and angular estimation errors, with model size reduction >100x and negligible performance loss.
Conclusion: The deep learning-driven friendly jamming framework effectively enhances physical-layer security in practical ISAC systems with impairments, demonstrating robustness and efficiency.
Abstract: Integrated sensing and communication (ISAC) systems promise efficient spectrum utilization by jointly supporting radar sensing and wireless communication. This paper presents a deep learning-driven framework for enhancing physical-layer security in multicarrier ISAC systems under imperfect channel state information (CSI) and in the presence of unknown eavesdropper (Eve) locations. Unlike conventional ISAC-based friendly jamming (FJ) approaches that require Eve’s CSI or precise angle-of-arrival (AoA) estimates, our method exploits radar echo feedback to guide directional jamming without explicit Eve’s information. To enhance robustness to radar sensing uncertainty, we propose a radar-aware neural network that jointly optimizes beamforming and jamming by integrating a novel nonparametric Fisher Information Matrix (FIM) estimator based on f-divergence. The jamming design satisfies the Cramer-Rao lower bound (CRLB) constraints even in the presence of noisy AoA. For efficient implementation, we introduce a quantized tensor train-based encoder that reduces the model size by more than 100 times with negligible performance loss. We also integrate a non-overlapping secure scheme into the proposed framework, in which specific sub-bands can be dedicated solely to communication. Extensive simulations demonstrate that the proposed solution achieves significant improvements in secrecy rate, reduced block error rate (BLER), and strong robustness against CSI uncertainty and angular estimation errors, underscoring the effectiveness of the proposed deep learning-driven friendly jamming framework under practical ISAC impairments.
[550] Reward-Conditioned Reinforcement Learning
Michal Nauman, Marek Cygan, Pieter Abbeel
Main category: cs.LG
TL;DR: RCRL trains a single RL agent to optimize multiple reward functions using reward conditioning and off-policy learning from shared replay data, enabling adaptation to different reward specifications.
Details
Motivation: Traditional RL agents are brittle to reward misspecification and cannot adapt to changing task preferences because they're trained on a single fixed reward function. This limits their robustness and flexibility.Method: Reward-Conditioned Reinforcement Learning (RCRL) conditions agents on reward parameterizations and learns multiple reward objectives from shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors.
Result: RCRL improves performance under nominal reward parameterization and enables efficient adaptation to new parameterizations across single-task, multi-task, and vision-based benchmarks.
Conclusion: RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training, addressing reward misspecification and adaptation challenges.
Abstract: RL agents are typically trained under a single, fixed reward function, which makes them brittle to reward misspecification and limits their ability to adapt to changing task preferences. We introduce Reward-Conditioned Reinforcement Learning (RCRL), a framework that trains a single agent to optimize a family of reward specifications while collecting experience under only one nominal objective. RCRL conditions the agent on reward parameterizations and learns multiple reward objectives from a shared replay data entirely off-policy, enabling a single policy to represent reward-specific behaviors. Across single-task, multi-task, and vision-based benchmarks, we show that RCRL not only improves performance under the nominal reward parameterization, but also enables efficient adaptation to new parameterizations. Our results demonstrate that RCRL provides a scalable mechanism for learning robust, steerable policies without sacrificing the simplicity of single-task training.
[551] Synchronization-based clustering on the unit hypersphere
Zinaid Kapić, Aladin Crnkić, Goran Mauša
Main category: cs.LG
TL;DR: Novel clustering algorithm for unit hypersphere data using generalized Kuramoto model, achieving comparable or better accuracy than traditional methods.
Details
Motivation: Traditional clustering methods fail to account for the geometric structure of unit sphere data, which is important in applications like gene expression analysis, text classification, and image classification where data naturally lies on the unit hypersphere.Method: Develops a clustering algorithm based on the d-dimensional generalized Kuramoto model, which is specifically designed to handle data points on the unit sphere S^{d-1}.
Result: Demonstrated effectiveness on both synthetic and real-world datasets, showing that the method achieves similar or better clustering accuracy compared to traditional clustering methods.
Conclusion: The proposed Kuramoto model-based clustering algorithm provides an effective approach for clustering unit sphere data by properly accounting for its geometric structure.
Abstract: Clustering on the unit hypersphere is a fundamental problem in various fields, with applications ranging from gene expression analysis to text and image classification. Traditional clustering methods are not always suitable for unit sphere data, as they do not account for the geometric structure of the sphere. We introduce a novel algorithm for clustering data represented as points on the unit sphere $\mathbf{S}^{d-1}$. Our method is based on the $d$-dimensional generalized Kuramoto model. The effectiveness of the introduced method is demonstrated on synthetic and real-world datasets. Results are compared with some of the traditional clustering methods, showing that our method achieves similar or better results in terms of clustering accuracy.
[552] Aura: Universal Multi-dimensional Exogenous Integration for Aviation Time Series
Jiafeng Lin, Mengren Zheng, Simeng Ye, Yuxuan Wang, Huan Zhang, Yuhui Liu, Zhongyi Pei, Jianmin Wang
Main category: cs.LG
TL;DR: Aura is a universal framework for time series forecasting that integrates heterogeneous multimodal exogenous factors by explicitly organizing and encoding them based on their interaction modes with target time series.
Details
Motivation: Practical time series forecasting requires integrating diverse exogenous factors beyond numerical data, which are often multi-dimensional or multimodal with heterogeneous interactions that unimodal models struggle to capture.Method: Proposes Aura framework with tailored tripartite encoding mechanism to embed heterogeneous features into established time series models based on three distinct interaction modes identified in aviation maintenance scenarios.
Result: Extensive experiments on large-scale industrial dataset from China Southern Airlines (Boeing 777 and Airbus A320 fleets) show Aura achieves state-of-the-art performance across all baselines with superior adaptability.
Conclusion: Aura demonstrates potential as general-purpose enhancement for aviation safety and reliability by effectively integrating multimodal exogenous information into time series forecasting.
Abstract: Time series forecasting has witnessed an increasing demand across diverse industrial applications, where accurate predictions are pivotal for informed decision-making. Beyond numerical time series data, reliable forecasting in practical scenarios requires integrating diverse exogenous factors. Such exogenous information is often multi-dimensional or even multimodal, introducing heterogeneous interactions that unimodal time series models struggle to capture. In this paper, we delve into an aviation maintenance scenario and identify three distinct types of exogenous factors that influence temporal dynamics through distinct interaction modes. Based on this empirical insight, we propose Aura, a universal framework that explicitly organizes and encodes heterogeneous external information according to its interaction mode with the target time series. Specifically, Aura utilizes a tailored tripartite encoding mechanism to embed heterogeneous features into well-established time series models, ensuring seamless integration of non-sequential context. Extensive experiments on a large-scale, three-year industrial dataset from China Southern Airlines, covering the Boeing 777 and Airbus A320 fleets, demonstrate that Aura consistently achieves state-of-the-art performance across all baselines and exhibits superior adaptability. Our findings highlight Aura’s potential as a general-purpose enhancement for aviation safety and reliability.
[553] Axiomatic On-Manifold Shapley via Optimal Generative Flows
Cenwei Zhang, Lin Zhu, Manxi Lin, Lei You
Main category: cs.LG
TL;DR: Proposes a theory of on-manifold Aumann-Shapley attributions using optimal generative flows to address off-manifold artifacts in XAI, with Wasserstein-2 geodesics for canonical attributions.
Details
Motivation: Shapley-based attribution methods suffer from off-manifold artifacts due to heuristic baseline selection, and existing generative approaches have geometric inefficiency and discretization drift problems.Method: Develops formal theory of on-manifold Aumann-Shapley attributions using optimal generative flows, proves representation theorem establishing gradient line integral as unique solution, selects kinetic-energy-minimizing Wasserstein-2 geodesic to resolve path ambiguity.
Result: Method outperforms baselines with strict manifold adherence (vanishing Flow Consistency Error) and superior semantic alignment (Structure-Aware Total Variation), recovers classical Shapley for additive models, provides stability bounds against flow approximation errors.
Conclusion: Reframing baseline selection as variational problem yields canonical attribution family with theoretical guarantees and practical improvements for post-hoc XAI.
Abstract: Shapley-based attribution is critical for post-hoc XAI but suffers from off-manifold artifacts due to heuristic baselines. While generative methods attempt to address this, they often introduce geometric inefficiency and discretization drift. We propose a formal theory of on-manifold Aumann-Shapley attributions driven by optimal generative flows. We prove a representation theorem establishing the gradient line integral as the unique functional satisfying efficiency and geometric axioms, notably reparameterization invariance. To resolve path ambiguity, we select the kinetic-energy-minimizing Wasserstein-2 geodesic transporting a prior to the data distribution. This yields a canonical attribution family that recovers classical Shapley for additive models and admits provable stability bounds against flow approximation errors. By reframing baseline selection as a variational problem, our method experimentally outperforms baselines, achieving strict manifold adherence via vanishing Flow Consistency Error and superior semantic alignment characterized by Structure-Aware Total Variation. Our code is on https://github.com/cenweizhang/OTFlowSHAP.
[554] Decoupling Task and Behavior: A Two-Stage Reward Curriculum in Reinforcement Learning for Robotics
Kilian Freitag, Knut Åkesson, Morteza Haghir Chehreghani
Main category: cs.LG
TL;DR: Two-stage reward curriculum decouples task objectives from behavioral terms, first training on simplified task-only reward before introducing full reward with auxiliary behavioral objectives like energy efficiency.
Details
Motivation: Deep Reinforcement Learning for robotic control faces challenges in designing effective reward functions, especially for multi-objective tasks requiring precise weight tuning between task-specific objectives and behavioral terms like energy efficiency.Method: Proposes a two-stage reward curriculum: 1) Train agent on simplified task-only reward function for effective exploration, 2) Introduce full reward including auxiliary behavior-related terms. Analyzes transition strategies and emphasizes reusing samples between phases for training stability.
Result: Validated on DeepMind Control Suite, ManiSkill3, and mobile robot environments with auxiliary behavioral objectives. Method substantially outperforms baselines trained directly on full reward and exhibits higher robustness to specific reward weightings.
Conclusion: The proposed two-stage reward curriculum is simple yet effective for robotic control tasks with multiple objectives, addressing reward design challenges and improving training stability and performance.
Abstract: Deep Reinforcement Learning is a promising tool for robotic control, yet practical application is often hindered by the difficulty of designing effective reward functions. Real-world tasks typically require optimizing multiple objectives simultaneously, necessitating precise tuning of their weights to learn a policy with the desired characteristics. To address this, we propose a two-stage reward curriculum where we decouple task-specific objectives from behavioral terms. In our method, we first train the agent on a simplified task-only reward function to ensure effective exploration before introducing the full reward that includes auxiliary behavior-related terms such as energy efficiency. Further, we analyze various transition strategies and demonstrate that reusing samples between phases is critical for training stability. We validate our approach on the DeepMind Control Suite, ManiSkill3, and a mobile robot environment, modified to include auxiliary behavioral objectives. Our method proves to be simple yet effective, substantially outperforming baselines trained directly on the full reward while exhibiting higher robustness to specific reward weightings.
[555] FedBCD:Communication-Efficient Accelerated Block Coordinate Gradient Descent for Federated Learning
Junkang Liu, Fanhua Shang, Yuanyuan Liu, Hongying Liu, Yuangang Li, YunXiang Gong
Main category: cs.LG
TL;DR: FedBCGD reduces communication overhead in federated learning for large models like Vision Transformers by splitting parameters into blocks and having clients upload only specific blocks instead of full models.
Details
Motivation: Federated learning faces high communication overhead when training large-scale models like Vision Transformers, which is costly for each communication round. There's a need to reduce communication complexity while maintaining model performance.Method: Proposes Federated Block Coordinate Gradient Descent (FedBCGD) that splits model parameters into blocks (including a shared block) and enables each client to upload only specific parameter blocks. Also develops accelerated version FedBCGD+ with client drift control and stochastic variance reduction.
Result: Theoretical analysis shows communication complexities are reduced by factor 1/N (where N is number of blocks) compared to existing methods, with faster convergence. Empirical results demonstrate superiority over state-of-the-art federated learning algorithms.
Conclusion: FedBCGD and FedBCGD+ provide efficient communication solutions for federated learning of large models, significantly reducing overhead while maintaining convergence properties, making them practical for real-world applications.
Abstract: Although Federated Learning has been widely studied in recent years, there are still high overhead expenses in each communication round for large-scale models such as Vision Transformer. To lower the communication complexity, we propose a novel Federated Block Coordinate Gradient Descent (FedBCGD) method for communication efficiency. The proposed method splits model parameters into several blocks, including a shared block and enables uploading a specific parameter block by each client, which can significantly reduce communication overhead. Moreover, we also develop an accelerated FedBCGD algorithm (called FedBCGD+) with client drift control and stochastic variance reduction. To the best of our knowledge, this paper is the first work on parameter block communication for training large-scale deep models. We also provide the convergence analysis for the proposed algorithms. Our theoretical results show that the communication complexities of our algorithms are a factor $1/N$ lower than those of existing methods, where $N$ is the number of parameter blocks, and they enjoy much faster convergence than their counterparts. Empirical results indicate the superiority of the proposed algorithms compared to state-of-the-art algorithms. The code is available at https://github.com/junkangLiu0/FedBCGD.
[556] Federated Causal Discovery Across Heterogeneous Datasets under Latent Confounding
Maximilian Hahn, Alina Zajak, Dominik Heider, Adèle Helena Ribeiro
Main category: cs.LG
TL;DR: Federated conditional independence test (fedCI) for privacy-preserving causal discovery across heterogeneous datasets with different variables, enabling federated causal discovery under latent confounding via fedCI-IOD algorithm.
Details
Motivation: Causal discovery across multiple datasets faces challenges from data privacy regulations and cross-site heterogeneity, preventing use of conventional centralized methods that require pooling all data together.Method: Developed fedCI using federated Iteratively Reweighted Least Squares (IRLS) to estimate parameters of generalized linear models for likelihood-ratio tests of conditional independence. Extended this to fedCI-IOD, a federated version of Integration of Overlapping Datasets algorithm that replaces meta-analysis with federated aggregation.
Result: fedCI-IOD preserves privacy while achieving performance comparable to fully pooled analyses, substantially enhancing statistical power and mitigating artifacts from low local sample sizes. Tools available as Python package, R implementation, and web application.
Conclusion: Provides privacy-preserving solutions for federated conditional independence testing and causal discovery across distributed, heterogeneous datasets with different variable sets and mixed data types.
Abstract: Causal discovery across multiple datasets is often constrained by data privacy regulations and cross-site heterogeneity, limiting the use of conventional methods that require a single, centralized dataset. To address these challenges, we introduce fedCI, a federated conditional independence test that rigorously handles heterogeneous datasets with non-identical sets of variables, site-specific effects, and mixed variable types, including continuous, ordinal, binary, and categorical variables. At its core, fedCI uses a federated Iteratively Reweighted Least Squares (IRLS) procedure to estimate the parameters of generalized linear models underlying likelihood-ratio tests for conditional independence. Building on this, we develop fedCI-IOD, a federated extension of the Integration of Overlapping Datasets (IOD) algorithm, that replaces its meta-analysis strategy and enables, for the fist time, federated causal discovery under latent confounding across distributed and heterogeneous datasets. By aggregating evidence federatively, fedCI-IOD not only preserves privacy but also substantially enhances statistical power, achieving performance comparable to fully pooled analyses and mitigating artifacts from low local sample sizes. Our tools are publicly available as the fedCI Python package, a privacy-preserving R implementation of IOD, and a web application for the fedCI-IOD pipeline, providing versatile, user-friendly solutions for federated conditional independence testing and causal discovery.
[557] Balancing Privacy-Quality-Efficiency in Federated Learning through Round-Based Interleaving of Protection Techniques
Yenan Wang, Carla Fabiana Chiasserini, Elad Michael Schiller
Main category: cs.LG
TL;DR: Alt-FL: A federated learning framework combining differential privacy, homomorphic encryption, and synthetic data via round-based interleaving to balance privacy, quality, and efficiency.
Details
Motivation: Current privacy mechanisms in federated learning either degrade learning quality (DP) or incur substantial system overhead (HE), creating a need for flexible solutions that balance privacy protection, learning quality, and efficiency.Method: Proposes Alt-FL with three interleaving methods: Privacy Interleaving (PI), Synthetic Interleaving with DP (SI/DP), and Synthetic Interleaving with HE (SI/HE). Uses round-based strategy to combine DP, HE, and synthetic data. Evaluated against reconstruction attacks using LeNet-5 on CIFAR-10 and Fashion-MNIST with new attacker-centric framework.
Result: PI achieves most balanced trade-offs at high privacy protection levels, while DP-based methods are preferable at intermediate privacy requirements. The framework enables selection of privacy-preserving FL methods under varying constraints.
Conclusion: Alt-FL provides flexible quality-efficiency trade-offs with privacy protection, offering practical solutions for federated learning scenarios with different privacy and resource constraints.
Abstract: In federated learning (FL), balancing privacy protection, learning quality, and efficiency remains a challenge. Privacy protection mechanisms, such as Differential Privacy (DP), degrade learning quality, or, as in the case of Homomorphic Encryption (HE), incur substantial system overhead. To address this, we propose Alt-FL, a privacy-preserving FL framework that combines DP, HE, and synthetic data via a novel round-based interleaving strategy. Alt-FL introduces three new methods, Privacy Interleaving (PI), Synthetic Interleaving with DP (SI/DP), and Synthetic Interleaving with HE (SI/HE), that enable flexible quality-efficiency trade-offs while providing privacy protection. We systematically evaluate Alt-FL against representative reconstruction attacks, including Deep Leakage from Gradients, Inverting Gradients, When the Curious Abandon Honesty, and Robbing the Fed, using a LeNet-5 model on CIFAR-10 and Fashion-MNIST. To enable fair comparison between DP- and HE-based defenses, we introduce a new attacker-centric framework that compares empirical attack success rates across the three proposed interleaving methods. Our results show that, for the studied attacker model and dataset, PI achieves the most balanced trade-offs at high privacy protection levels, while DP-based methods are preferable at intermediate privacy requirements. We also discuss how such results can be the basis for selecting privacy-preserving FL methods under varying privacy and resource constraints.
[558] Trainable Bitwise Soft Quantization for Input Feature Compression
Karsten Schrödter, Jan Stenkamp, Nina Herrmann, Fabian Gieseke
Main category: cs.LG
TL;DR: Trainable feature quantization layer for neural networks that compresses input features to reduce data transmission from edge devices to remote servers.
Details
Motivation: Address the challenge of limited compute/memory resources in IoT edge devices by reducing data transmission needs while maintaining model performance.Method: Proposes a task-specific trainable quantization layer using sigmoid functions to approximate step functions, enabling trainable quantization thresholds and bitwise soft quantization.
Result: Outperforms standard quantization methods, achieves 5-16× compression compared to 32-bit inputs with minimal accuracy loss across datasets.
Conclusion: The trainable quantization layer effectively reduces data transmission needs for edge devices while maintaining model accuracy, enabling more efficient IoT applications.
Abstract: The growing demand for machine learning applications in the context of the Internet of Things calls for new approaches to optimize the use of limited compute and memory resources. Despite significant progress that has been made w.r.t. reducing model sizes and improving efficiency, many applications still require remote servers to provide the required resources. However, such approaches rely on transmitting data from edge devices to remote servers, which may not always be feasible due to bandwidth, latency, or energy constraints. We propose a task-specific, trainable feature quantization layer that compresses the input features of a neural network. This can significantly reduce the amount of data that needs to be transferred from the device to a remote server. In particular, the layer allows each input feature to be quantized to a user-defined number of bits, enabling a simple on-device compression at the time of data collection. The layer is designed to approximate step functions with sigmoids, enabling trainable quantization thresholds. By concatenating outputs from multiple sigmoids, introduced as bitwise soft quantization, it achieves trainable quantized values when integrated with a neural network. We compare our method to full-precision inference as well as to several quantization baselines. Experiments show that our approach outperforms standard quantization methods, while maintaining accuracy levels close to those of full-precision models. In particular, depending on the dataset, compression factors of $5\times$ to $16\times$ can be achieved compared to $32$-bit input without significant performance loss.
[559] Incentive Aware AI Regulations: A Credal Characterisation
Anurag Singh, Julian Rodemann, Rajeev Verma, Siu Lun Chau, Krikamol Muandet
Main category: cs.LG
TL;DR: AI regulation as mechanism design problem where providers bet on their models’ compliance through license selection, with perfect market outcomes achieved when non-compliant distributions form credal sets.
Details
Motivation: Address the challenge of strategic ML providers evading regulations to lower costs by framing AI regulation as a mechanism design problem under uncertainty, aiming to create enforceable regulations that ensure compliant providers participate while non-compliant ones self-exclude.Method: Introduce regulation mechanisms framework mapping empirical evidence to market share licenses, where providers select licenses betting on their models’ regulatory compliance. Prove perfect market outcomes occur when non-compliant distributions form credal sets (closed, convex probability measure sets), establishing duality between regulation mechanisms and non-compliant distributions.
Result: Theoretical proof that regulation mechanisms achieve perfect market outcomes if and only if non-compliant distributions form credal sets. Experimental demonstration on regulating spurious features for prediction and fairness shows practical applicability.
Conclusion: The framework connects mechanism design and imprecise probability theory, providing foundations for enforceable AI regulations that can drive compliance through market mechanisms rather than direct enforcement.
Abstract: While high-stakes ML applications demand strict regulations, strategic ML providers often evade them to lower development costs. To address this challenge, we cast AI regulation as a mechanism design problem under uncertainty and introduce regulation mechanisms: a framework that maps empirical evidence from models to a license for some market share. The providers can select from a set of licenses, effectively forcing them to bet on their model’s ability to fulfil regulation. We aim at regulation mechanisms that achieve perfect market outcome, i.e. (a) drive non-compliant providers to self-exclude, and (b) ensure participation from compliant providers. We prove that a mechanism has perfect market outcome if and only if the set of non-compliant distributions forms a credal set, i.e., a closed, convex set of probability measures. This result connects mechanism design and imprecise probability by establishing a duality between regulation mechanisms and the set of non-compliant distributions. We also demonstrate these mechanisms in practice via experiments on regulating use of spurious features for prediction and fairness. Our framework provides new insights at the intersection of mechanism design and imprecise probability, offering a foundation for development of enforceable AI regulations.
[560] Towards a data-scale independent regulariser for robust sparse identification of non-linear dynamics
Jay Raut, Daniel N. Wilke, Stephan Schmidt
Main category: cs.LG
TL;DR: STCV is a new sparse regression algorithm for system identification that’s robust to data normalization by using statistical significance instead of magnitude-based thresholding.
Details
Motivation: Data normalization, a common preprocessing step, severely distorts governing equation discovery in sparse regression methods like SINDy by magnifying noise and undermining sparsity assumptions, leading to dense, uninterpretable, and physically incorrect models.Method: Introduces Sequential Thresholding of Coefficient of Variation (STCV) which replaces magnitude-based thresholding with a dimensionless statistical metric called Coefficient Presence (CP) that assesses statistical validity and consistency of candidate terms in the model library.
Result: STCV consistently and significantly outperforms standard STLSQ and Ensemble-SINDy on normalized, noisy datasets across canonical dynamical systems and practical engineering problems, including a physical mass-spring-damper experiment.
Conclusion: STCV makes sparse system identification more reliable and automated for real-world applications by mitigating distorting effects of normalization, enhancing model interpretability and trustworthiness.
Abstract: Data normalisation, a common and often necessary preprocessing step in engineering and scientific applications, can severely distort the discovery of governing equations by magnitudebased sparse regression methods. This issue is particularly acute for the Sparse Identification of Nonlinear Dynamics (SINDy) framework, where the core assumption of sparsity is undermined by the interaction between data scaling and measurement noise. The resulting discovered models can be dense, uninterpretable, and physically incorrect. To address this critical vulnerability, we introduce the Sequential Thresholding of Coefficient of Variation (STCV), a novel, computationally efficient sparse regression algorithm that is inherently robust to data scaling. STCV replaces conventional magnitude-based thresholding with a dimensionless statistical metric, the Coefficient Presence (CP), which assesses the statistical validity and consistency of candidate terms in the model library. This shift from magnitude to statistical significance makes the discovery process invariant to arbitrary data scaling. Through comprehensive benchmarking on canonical dynamical systems and practical engineering problems, including a physical mass-spring-damper experiment, we demonstrate that STCV consistently and significantly outperforms standard Sequential Thresholding Least Squares (STLSQ) and Ensemble-SINDy (E-SINDy) on normalised, noisy datasets. The results show that STCV-based methods can successfully identify the correct, sparse physical laws even when other methods fail. By mitigating the distorting effects of normalisation, STCV makes sparse system identification a more reliable and automated tool for real-world applications, thereby enhancing model interpretability and trustworthiness.
[561] Stable-LoRA: Stabilizing Feature Learning of Low-Rank Adaptation
Yize Wu, Ke Gao, Ling Li, Yanjun Wu
Main category: cs.LG
TL;DR: Stable-LoRA: A weight-shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning by progressively shrinking matrix A during early training steps.
Details
Motivation: While LoRA is empirically effective for fine-tuning LLMs, its theoretical foundations regarding feature learning stability are insufficiently understood. The paper identifies that necessary non-zero initialization of matrix A compromises self-stability, leading to suboptimal performance.Method: Proposes Stable-LoRA, which dynamically shrinks matrix A during the earliest training steps to eliminate instability while preserving the benefits of non-zero initialization. This weight-shrinkage optimization strategy enhances stability of LoRA feature learning.
Result: Stable-LoRA consistently outperforms other baselines across diverse models and tasks, with no additional memory usage and only negligible computation overheads.
Conclusion: The paper provides theoretical foundations for LoRA’s feature learning stability and introduces Stable-LoRA as an effective solution to address instability issues while maintaining parameter efficiency.
Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient method for fine-tuning Large Langauge Models. It updates the weight matrix as $W=W_0+sBA$, where $W_0$ is the original frozen weight, $s$ is a scaling factor and $A$,$B$ are trainable low-rank matrices. Despite its robust empirical effectiveness, the theoretical foundations of LoRA remain insufficiently understood, particularly with respect to feature learning stability. In this paper, we first establish that, LoRA can, in principle, naturally achieve and sustain stable feature learning (i.e., be self-stabilized) under appropriate hyper-parameters and initializations of $A$ and $B$. However, we also uncover a fundamental limitation that the necessary non-zero initialization of $A$ compromises self-stability, leading to suboptimal performances. To address this challenge, we propose Stable-LoRA, a weight-shrinkage optimization strategy that dynamically enhances stability of LoRA feature learning. By progressively shrinking $A$ during the earliest training steps, Stable-LoRA is both theoretically and empirically validated to effectively eliminate instability of LoRA feature learning while preserving the benefits of the non-zero start. Experiments show that Stable-LoRA consistently outperforms other baselines across diverse models and tasks, with no additional memory usage and only negligible computation overheads. The code is available at https://github.com/Yize-Wu/Stable-LoRA.
[562] Early Warning of Intraoperative Adverse Events via Transformer-Driven Multi-Label Learning
Xueyao Wang, Xiuding Cai, Honglin Shang, Yaoyao Zhu, Yu Yao
Main category: cs.LG
TL;DR: IAENet: A Transformer-based multi-label learning framework for early warning of multiple intraoperative adverse events, addressing event dependencies, heterogeneous data fusion, and class imbalance through improved TAFiLM modules and Label-Constrained Reweighting Loss.
Details
Motivation: Early warning of intraoperative adverse events is crucial for patient safety, but existing deep learning approaches have limitations: they overlook dependencies between adverse events, underutilize heterogeneous clinical data, and suffer from class imbalance in medical datasets.Method: Constructed the first Multi-label Adverse Events dataset (MuAE) covering six critical events. Proposed IAENet with improved Time-Aware Feature-wise Linear Modulation (TAFiLM) module for fusing static covariates with dynamic variables and modeling temporal dependencies. Introduced Label-Constrained Reweighting Loss (LCRLoss) with co-occurrence regularization to address intra-event imbalance and enforce consistency among frequently co-occurring events.
Result: IAENet consistently outperformed strong baselines on 5, 10, and 15-minute early warning tasks, achieving improvements of +5.05%, +2.82%, and +7.57% on average F1 score respectively.
Conclusion: The proposed IAENet framework demonstrates strong potential for supporting intelligent intraoperative decision-making in clinical practice by effectively predicting multiple adverse events with improved accuracy.
Abstract: Early warning of intraoperative adverse events plays a vital role in reducing surgical risk and improving patient safety. While deep learning has shown promise in predicting the single adverse event, several key challenges remain: overlooking adverse event dependencies, underutilizing heterogeneous clinical data, and suffering from the class imbalance inherent in medical datasets. To address these issues, we construct the first Multi-label Adverse Events dataset (MuAE) for intraoperative adverse events prediction, covering six critical events. Next, we propose a novel Transformerbased multi-label learning framework (IAENet) that combines an improved Time-Aware Feature-wise Linear Modulation (TAFiLM) module for static covariates and dynamic variables robust fusion and complex temporal dependencies modeling. Furthermore, we introduce a Label-Constrained Reweighting Loss (LCRLoss) with co-occurrence regularization to effectively mitigate intra-event imbalance and enforce structured consistency among frequently co-occurring events. Extensive experiments demonstrate that IAENet consistently outperforms strong baselines on 5, 10, and 15-minute early warning tasks, achieving improvements of +5.05%, +2.82%, and +7.57% on average F1 score. These results highlight the potential of IAENet for supporting intelligent intraoperative decision-making in clinical practice.
[563] The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Alper Yıldırım
Main category: cs.LG
TL;DR: Architectural interventions in Transformers can eliminate grokking delays by removing unbounded magnitude degrees of freedom and data-dependent attention routing, accelerating generalization in modular arithmetic tasks.
Details
Motivation: To understand how architectural topology influences training dynamics like grokking (delayed generalization), moving beyond post-hoc analysis to test hypotheses through direct architectural interventions.Method: Two interventions: 1) Spherical topology with L2 normalization throughout residual stream and fixed-temperature unembedding to remove magnitude-based degrees of freedom; 2) Uniform Attention Ablation replacing data-dependent routing with uniform distribution, reducing attention to CBOW aggregator. Tested on cyclic modular addition (Zp) and non-commutative S5 permutation as negative control.
Result: Spherical topology reduced grokking onset time by over 20x without weight decay. Uniform Attention Ablation achieved 100% generalization across all seeds and eliminated grokking delay entirely. However, spherical constraints on non-commutative S5 permutation did not accelerate generalization.
Conclusion: Architectural degrees of freedom substantially influence grokking, and eliminating memorization phases depends on aligning architectural priors with task symmetries, providing interventional evidence for structural perspective on training dynamics.
Abstract: Mechanistic interpretability typically relies on post-hoc analysis of trained networks. We instead adopt an interventional approach: testing hypotheses a priori by modifying architectural topology to observe training dynamics. We study grokking - delayed generalization in Transformers trained on cyclic modular addition (Zp) - investigating if specific architectural degrees of freedom prolong the memorization phase. We identify two independent structural factors in standard Transformers: unbounded representational magnitude and data-dependent attention routing. First, we introduce a fully bounded spherical topology enforcing L2 normalization throughout the residual stream and an unembedding matrix with a fixed temperature scale. This removes magnitude-based degrees of freedom, reducing grokking onset time by over 20x without weight decay. Second, a Uniform Attention Ablation overrides data-dependent query-key routing with a uniform distribution, reducing the attention layer to a Continuous Bag-of-Words (CBOW) aggregator. Despite removing adaptive routing, these models achieve 100% generalization across all seeds and bypass the grokking delay entirely. To evaluate whether this acceleration is a task-specific geometric alignment rather than a generic optimization stabilizer, we use non-commutative S5 permutation composition as a negative control. Enforcing spherical constraints on S5 does not accelerate generalization. This suggests eliminating the memorization phase depends strongly on aligning architectural priors with the task’s intrinsic symmetries. Together, these findings provide interventional evidence that architectural degrees of freedom substantially influence grokking, suggesting a predictive structural perspective on training dynamics.
[564] SlideSparse: Fast and Flexible (2N-2):2N Structured Sparsity
Hanyong Shao, Yingbo Hao, Ting Song, Yan Xia, Di Zhang, Shaohan Huang, Xun Wu, Songchen Xu, Le Xu, Li Dong, Zewen Chi, Yi Zou, Furu Wei
Main category: cs.LG
TL;DR: SlideSparse enables Sparse Tensor Core acceleration for (2N-2):2N sparsity patterns (e.g., 6:8) on commodity GPUs, achieving near-theoretical speedups while preserving LLM accuracy.
Details
Motivation: Current NVIDIA Sparse Tensor Cores only support 2:4 sparsity (50% pruning), which severely degrades LLM reasoning accuracy. Milder (2N-2):2N patterns (e.g., 6:8, 25% pruning) preserve accuracy but lack hardware support, forcing dense execution without sparsity benefits.Method: SlideSparse uses Sliding Window Decomposition to reconstruct any (2N-2):2N weight block into N-1 overlapping 2:4-compliant windows without accuracy loss, and Activation Lifting fuses activation rearrangement into per-token quantization at minimal cost.
Result: Integrated into vLLM, SlideSparse achieves 1.33x speedup on compute-bound workloads for Qwen2.5-7B at 6:8 sparsity, approaching the theoretical upper bound of N/(N-1)=4/3, while preserving model accuracy across various GPUs, precisions, and model families.
Conclusion: SlideSparse establishes (2N-2):2N sparsity patterns as a practical path to accuracy-preserving LLM acceleration on commodity GPUs, bridging the gap between hardware limitations and model accuracy requirements.
Abstract: NVIDIA’s 2:4 Sparse Tensor Cores deliver 2x throughput but demand strict 50% pruning – a ratio that collapses LLM reasoning accuracy (Qwen3: 54% to 15%). Milder $(2N-2):2N$ patterns (e.g., 6:8, 25% pruning) preserve accuracy yet receive no hardware support, falling back to dense execution without any benefit from sparsity. We present SlideSparse, the first system to unlock Sparse Tensor Core acceleration for the $(2N-2):2N$ model family on commodity GPUs. Our Sliding Window Decomposition reconstructs any $(2N-2):2N$ weight block into $N-1$ overlapping 2:4-compliant windows without any accuracy loss; Activation Lifting fuses the corresponding activation rearrangement into per-token quantization at near-zero cost. Integrated into vLLM, SlideSparse is evaluated across various GPUs (A100, H100, B200, RTX 4090, RTX 5080, DGX-spark), precisions (FP4, INT8, FP8, BF16, FP16), and model families (Llama, Qwen, BitNet). On compute-bound workloads, the measured speedup ratio (1.33x) approaches the theoretical upper-bound $N/(N-1)=4/3$ at 6:8 weight sparsity in Qwen2.5-7B, establishing $(2N-2):2N$ as a practical path to accuracy-preserving LLM acceleration. Code available at https://github.com/bcacdwk/vllmbench.
[565] Recursive Inference Machines for Neural Reasoning
Mieszko Komisarczyk, Saurabh Mathur, Maurice Kraus, Sriraam Natarajan, Kristian Kersting
Main category: cs.LG
TL;DR: RIMs bridge neural backbones with classical inference engines, improving reasoning performance on benchmarks like ARC-AGI and Sudoku, and outperforming TabPFNs on tabular data.
Details
Motivation: To bridge neural reasoners (like TRMs) with classical stochastic reasoning systems by incorporating explicit recursive inference mechanisms inspired by traditional inference engines.Method: Introduces Recursive Inference Machines (RIMs) - a neural reasoning framework that combines neural backbones with recursive inference mechanisms. Shows TRMs can be expressed as RIM instances and extends them with a reweighting component.
Result: Achieves better performance on challenging reasoning benchmarks (ARC-AGI-1, ARC-AGI-2, Sudoku Extreme) and improves reasoning on tabular data classification, outperforming TabPFNs.
Conclusion: RIMs successfully bridge neural and classical reasoning paradigms, demonstrating improved performance across multiple reasoning tasks through explicit incorporation of recursive inference mechanisms.
Abstract: Neural reasoners such as Tiny Recursive Models (TRMs) solve complex problems by combining neural backbones with specialized inference schemes. Such inference schemes have been a central component of stochastic reasoning systems, where inference rules are applied to a stochastic model to derive answers to complex queries. In this work, we bridge these two paradigms by introducing Recursive Inference Machines (RIMs), a neural reasoning framework that explicitly incorporates recursive inference mechanisms inspired by classical inference engines. We show that TRMs can be expressed as an instance of RIMs, allowing us to extend them through a reweighting component, yielding better performance on challenging reasoning benchmarks, including ARC-AGI-1, ARC-AGI-2, and Sudoku Extreme. Furthermore, we show that RIMs can be used to improve reasoning on other tasks, such as the classification of tabular data, outperforming TabPFNs.
[566] A Behaviour-Aware Federated Forecasting Framework for Distributed Stand-Alone Wind Turbines
Bowen Li, Xiufeng Liu, Maria Sinziiana Astefanoaei
Main category: cs.LG
TL;DR: Federated learning framework for wind power forecasting using behavioral clustering and LSTM models to address privacy and heterogeneity concerns in distributed turbine data.
Details
Motivation: Centralizing turbine data for wind power forecasting raises privacy, cost, and heterogeneity concerns. There's a need for privacy-preserving solutions that can handle distributed, heterogeneous turbine fleets while maintaining forecasting accuracy.Method: Two-stage federated learning framework: 1) Clusters turbines by long-term behavioral statistics using Double Roulette Selection (DRS) initialization with recursive Auto-split refinement, 2) Trains cluster-specific LSTM models via Federated Averaging (FedAvg).
Result: Experiments on 400 stand-alone turbines in Denmark show DRS-auto discovers behaviorally coherent groups and achieves competitive forecasting accuracy while preserving data locality. Behavior-aware grouping outperforms geographic partitioning and matches strong k-means++ baselines.
Conclusion: The proposed framework provides a practical privacy-friendly solution for heterogeneous distributed turbine fleets, demonstrating that behavioral clustering combined with federated learning can effectively address privacy and heterogeneity challenges in wind power forecasting.
Abstract: Accurate short-term wind power forecasting is essential for grid dispatch and market operations, yet centralising turbine data raises privacy, cost, and heterogeneity concerns. We propose a two-stage federated learning framework that first clusters turbines by long-term behavioural statistics using Double Roulette Selection (DRS) initialisation with recursive Auto-split refinement, and then trains cluster-specific LSTM models via FedAvg. Experiments on 400 stand-alone turbines in Denmark show that DRS-auto discovers behaviourally coherent groups and achieves competitive forecasting accuracy while preserving data locality. Behaviour-aware grouping consistently outperforms geographic partitioning and matches strong k-means++ baselines, suggesting a practical privacy-friendly solution for heterogeneous distributed turbine fleets.
[567] Beyond Word Error Rate: Auditing the Diversity Tax in Speech Recognition through Dataset Cartography
Ting-Hui Cheng, Line H. Clemmensen, Sneha Das
Main category: cs.LG
TL;DR: The paper critiques WER as insufficient for ASR evaluation, introduces SDI to quantify demographic/acoustic factors in model failures, and proposes semantic metrics (EmbER, SemDist) to expose hidden biases and enable pre-deployment auditing.
Details
Motivation: Current ASR evaluation relies heavily on Word Error Rate (WER), which fails to capture semantic fidelity and obscures the "diversity tax" - disproportionate burden on marginalized/atypical speakers due to systematic recognition failures. There's a need for better metrics that reveal hidden biases.Method: 1) Systematically evaluate broader class of non-linear and semantic metrics beyond lexical counts. 2) Introduce Sample Difficulty Index (SDI) to quantify how intrinsic demographic and acoustic factors drive model failure. 3) Map SDI on data cartography to visualize biases. 4) Compare metrics like EmbER and SemDist against traditional WER.
Result: EmbER and SemDist expose hidden systemic biases and inter-model disagreements that WER ignores. SDI successfully quantifies demographic/acoustic factors contributing to model failures. The approach enables identification of disproportionate burdens on marginalized speakers.
Conclusion: The paper establishes initial steps toward a robust audit framework for prospective safety analysis in ASR systems, empowering developers to audit and mitigate disparities before deployment through better evaluation metrics.
Abstract: Automatic speech recognition (ASR) systems are predominantly evaluated using the Word Error Rate (WER). However, raw token-level metrics fail to capture semantic fidelity and routinely obscures the `diversity tax’, the disproportionate burden on marginalized and atypical speaker due to systematic recognition failures. In this paper, we explore the limitations of relying solely on lexical counts by systematically evaluating a broader class of non-linear and semantic metrics. To enable rigorous model auditing, we introduce the sample difficulty index (SDI), a novel metric that quantifies how intrinsic demographic and acoustic factors drive model failure. By mapping SDI on data cartography, we demonstrate that metrics EmbER and SemDist expose hidden systemic biases and inter-model disagreements that WER ignores. Finally, our findings are the first steps towards a robust audit framework for prospective safety analysis, empowering developers to audit and mitigate ASR disparities prior to deployment.
[568] Whispering to a Blackbox: Bootstrapping Frozen OCR with Visual Prompts
Samandar Samandarov, Nazirjon Ismoiljonov, Abdullah Sattorov, Temirlan Sabyrbayev
Main category: cs.LG
TL;DR: Whisperer is a visual prompting framework that learns diffusion-based preprocessors to adapt inputs for frozen downstream models, improving OCR performance without modifying model weights.
Details
Motivation: Frozen pre-trained models are stable and efficient but often underperform on specific tasks due to data distribution mismatches. There's a need to adapt these models to specific domains without retraining or fine-tuning their weights.Method: Uses a diffusion-based visual prompting framework that learns to preprocess inputs in pixel space. Employs a four-stage training curriculum with behavioral cloning of stochastically discovered improvement policies, where intermediate diffusion outputs are sampled, those that improve OCR performance are selected, and the model is trained to reproduce them.
Result: Achieves 8% absolute (10.6% relative) reduction in Character Error Rate on 300k degraded synthetic text images, surpassing hand-engineered baselines like CLAHE. Training requires only 60 GPU-hours across 4 stages.
Conclusion: The approach enables improvement of frozen downstream models by adapting inputs rather than modifying weights, providing a sample-efficient alternative to reinforcement learning through behavioral cloning of exploration policies.
Abstract: In the landscape of modern machine learning, frozen pre-trained models provide stability and efficiency but often underperform on specific tasks due to mismatched data distributions. This paper introduces the Whisperer, a novel visual prompting framework that learns diffusion-based preprocessors to adapt inputs in pixel space, effectively “whispering” enhancements to frozen downstream models like EasyOCR. By framing the process as behavioral cloning of stochastically discovered improvement policies, our method achieves an 8% absolute (10.6% relative) reduction in Character Error Rate (CER) on a challenging dataset of 300k degraded synthetic text images, surpassing hand-engineered baselines such as CLAHE. The key innovation is a four-stage training curriculum that uses behavioral cloning to amplify “lucky” improvements discovered through the stochastic exploration of a partially trained diffusion model. This approach is highly sample-efficient and avoids the pitfalls of traditional reinforcement learning. Crucially, we frame this not as naive reinforcement learning, but as behavioral cloning of an exploration policy: we stochastically sample intermediate diffusion outputs, select those that improve CER by chance, and then train the model to reproduce them. This bootstrapping curriculum (4 stages over 60 GPU-hours) amplifies random successes into a systematic strategy. In summary, by whispering to the frozen OCR through its inputs, we improve an imperfect classifier without touching its weights.
[569] Knowledge Divergence and the Value of Debate for Scalable Oversight
Robin Young
Main category: cs.LG
TL;DR: Theoretical analysis connecting AI safety debate to RLAIF through geometric framework of knowledge divergence between models, showing debate advantage depends on representation subspace angles and has phase transitions between regimes.
Details
Motivation: To establish formal relationship between debate and reinforcement learning from AI feedback (RLAIF) for scalable oversight, and characterize when debate offers advantages over single-agent methods.Method: Parameterizes debate’s value through geometry of knowledge divergence using principal angles between models’ representation subspaces. Provides mathematical proofs for debate advantage closed form and analyzes three regimes: shared, one-sided, and compositional knowledge divergence.
Result: When models share identical training data, debate reduces to RLAIF-like with no advantage. Debate advantage scales with phase transition from quadratic (negligible benefit) to linear (essential) regimes. Shows debate can achieve outcomes inaccessible to either model alone, but adversarial incentives cause coordination failure in compositional regime with sharp threshold.
Conclusion: Provides first formal connection between debate and RLAIF, geometric foundation for understanding when adversarial oversight protocols are justified, and insights into eliciting latent knowledge across complementary models.
Abstract: AI safety via debate and reinforcement learning from AI feedback (RLAIF) are both proposed methods for scalable oversight of advanced AI systems, yet no formal framework relates them or characterizes when debate offers an advantage. We analyze this by parameterizing debate’s value through the geometry of knowledge divergence between debating models. Using principal angles between models’ representation subspaces, we prove that the debate advantage admits an exact closed form. When models share identical training corpora, debate reduces to RLAIF-like where a single-agent method recovers the same optimum. When models possess divergent knowledge, debate advantage scales with a phase transition from quadratic regime (debate offers negligible benefit) to linear regime (debate is essential). We classify three regimes of knowledge divergence (shared, one-sided, and compositional) and provide existence results showing that debate can achieve outcomes inaccessible to either model alone, alongside a negative result showing that sufficiently strong adversarial incentives cause coordination failure in the compositional regime, with a sharp threshold separating effective from ineffective debate. We offer the first formal connection between debate and RLAIF, a geometric foundation for understanding when adversarial oversight protocols are justified, and connection to the problem of eliciting latent knowledge across models with complementary information.
[570] GALACTIC: Global and Local Agnostic Counterfactuals for Time-series Clustering
Christos Fragkathoulas, Eleni Psaroudaki, Themis Palpanas, Evaggelia Pitoura
Main category: cs.LG
TL;DR: GALACTIC is a unified framework for counterfactual explanations in unsupervised time-series clustering, generating both local instance-level perturbations and global cluster-level summaries with theoretical guarantees.
Details
Motivation: Existing explainability methods for time-series clustering fail to identify transitions across cluster boundaries, and counterfactual explanations have been mostly confined to supervised settings, creating a gap for unsupervised clustering interpretability.Method: GALACTIC uses cluster-aware optimization for local counterfactual generation and Minimum Description Length (MDL) objective for global representative selection, proving supermodularity to enable efficient greedy algorithm with approximation guarantees.
Result: Extensive experiments on UCR Archive show GALACTIC produces significantly sparser local counterfactuals and more concise global summaries than adapted baselines.
Conclusion: GALACTIC offers the first unified approach for interpreting clustered time-series through counterfactuals, bridging local and global explainability with theoretical guarantees.
Abstract: Time-series clustering is a fundamental tool for pattern discovery, yet existing explainability methods, primarily based on feature attribution or metadata, fail to identify the transitions that move an instance across cluster boundaries. While Counterfactual Explanations (CEs) identify the minimal temporal perturbations required to alter the prediction of a model, they have been mostly confined to supervised settings. This paper introduces GALACTIC, the first unified framework to bridge local and global counterfactual explainability for unsupervised time-series clustering. At instance level (local), GALACTIC generates perturbations via a cluster-aware optimization objective that respects the target and underlying cluster assignments. At cluster level (global), to mitigate cognitive load and enhance interpretability, we formulate a representative CE selection problem. We propose a Minimum Description Length (MDL) objective to extract a non-redundant summary of global explanations that characterize the transitions between clusters. We prove that our MDL objective is supermodular, which allows the corresponding MDL reduction to be framed as a monotone submodular set function. This enables an efficient greedy selection algorithm with provable $(1-1/e)$ approximation guarantees. Extensive experimental evaluation on the UCR Archive demonstrates that GALACTIC produces significantly sparser local CEs and more concise global summaries than state-of-the-art baselines adapted for our problem, offering the first unified approach for interpreting clustered time-series through counterfactuals.
[571] Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Helena Casademunt, Bartosz Cywiński, Khoi Tran, Arya Jakkli, Samuel Marks, Neel Nanda
Main category: cs.LG
TL;DR: The paper studies honesty elicitation and lie detection techniques on Chinese LLMs that censor politically sensitive topics, finding that certain prompting and fine-tuning methods increase truthful responses but don’t fully eliminate falsehoods.
Details
Motivation: Previous work evaluates honesty techniques on artificially trained lying models, but this may not resemble natural dishonesty. The authors instead study Chinese LLMs that naturally censor politically sensitive topics, providing a more realistic testbed for honesty elicitation and lie detection methods.Method: The authors use open-weights LLMs from Chinese developers (Qwen3 models) that censor topics like Falun Gong and Tiananmen protests. They evaluate honesty elicitation techniques (sampling without chat template, few-shot prompting, fine-tuning on generic honesty data) and lie detection methods (prompting censored model to classify its own responses, linear probes on unrelated data). They also test transfer to frontier models like DeepSeek R1.
Result: For honesty elicitation, sampling without chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes offer a cheaper alternative. The techniques transfer to other models like DeepSeek R1, but no method fully eliminates false responses.
Conclusion: The paper provides a realistic testbed for evaluating honesty techniques using naturally censoring Chinese LLMs. While certain methods improve truthfulness, complete elimination of false responses remains challenging. The findings suggest practical approaches for honesty elicitation and lie detection that work across different model architectures.
Abstract: Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation – modifying prompts or weights so that the model answers truthfully – and lie detection – classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.
[572] FairFinGAN: Fairness-aware Synthetic Financial Data Generation
Tai Le Quy, Dung Nguyen Tuan, Trung Nguyen Thanh, Duy Tran Cong, Huyen Giang Thi Thu, Frank Hopfgartner
Main category: cs.LG
TL;DR: FairFinGAN: A WGAN-based framework for generating synthetic financial data with fairness constraints to mitigate bias while preserving data utility.
Details
Motivation: Financial datasets often contain biases that can lead to unfair decision-making in automated systems, creating a need for methods that can generate synthetic data while addressing fairness concerns.Method: Proposes FairFinGAN, a Wasserstein GAN-based framework that incorporates fairness constraints directly into training through a classifier, ensuring synthetic data is both fair and maintains utility for downstream predictive tasks.
Result: Evaluated on five real-world financial datasets, FairFinGAN achieves superior fairness metrics compared to existing GAN-based methods without significant loss in data utility.
Conclusion: FairFinGAN demonstrates potential as an effective tool for bias-aware data generation in financial applications, balancing fairness and utility in synthetic data creation.
Abstract: Financial datasets often suffer from bias that can lead to unfair decision-making in automated systems. In this work, we propose FairFinGAN, a WGAN-based framework designed to generate synthetic financial data while mitigating bias with respect to the protected attribute. Our approach incorporates fairness constraints directly into the training process through a classifier, ensuring that the synthetic data is both fair and preserves utility for downstream predictive tasks. We evaluate our proposed model on five real-world financial datasets and compare it with existing GAN-based data generation methods. Experimental results show that our approach achieves superior fairness metrics without significant loss in data utility, demonstrating its potential as a tool for bias-aware data generation in financial applications.
[573] POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation
Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu
Main category: cs.LG
TL;DR: POET-X is a scalable, memory-efficient variant of POET that enables billion-parameter LLM pretraining on a single GPU by reducing computational overhead while maintaining training stability.
Details
Motivation: The original POET framework provides strong training stability for LLMs but suffers from high memory consumption and computational overhead due to intensive matrix multiplications, limiting its practical scalability.Method: POET-X performs orthogonal equivalence transformations with significantly reduced computational cost through optimization techniques that maintain the spectrum-preserving properties of POET while improving efficiency.
Result: POET-X enables pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, whereas standard optimizers like AdamW run out of memory under the same settings, while maintaining generalization and stability benefits.
Conclusion: POET-X provides a practical solution for efficient and stable training of large language models by addressing the memory and computational limitations of the original POET framework.
Abstract: Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.
[574] Preserving Continuous Symmetry in Discrete Spaces: Geometric-Aware Quantization for SO(3)-Equivariant GNNs
Haoyu Zhou, Ping Xue, Hao Zhang, Tianfan Fu
Main category: cs.LG
TL;DR: GAQ framework enables efficient quantization of equivariant GNNs for molecular simulations while preserving SO(3) symmetry through magnitude-direction decoupling and symmetry-aware training.
Details
Motivation: Equivariant GNNs are crucial for physically consistent molecular simulations but suffer from high computational costs and memory bottlenecks. Naive quantization destroys SO(3)-equivariant structure, leading to errors and conservation law violations.Method: Proposes Geometric-Aware Quantization (GAQ) with three key components: 1) Magnitude-Direction Decoupled Quantization (MDDQ) separating invariant lengths from equivariant orientations, 2) symmetry-aware training with distinct quantization schedules for scalar and vector features, and 3) robust attention normalization for gradient stability in low-bit regimes.
Result: W4A8 models match FP32 baseline accuracy (9.31 meV vs. 23.20 meV) on rMD17 benchmark while reducing Local Equivariance Error by over 30x compared to naive quantization. Achieves 2.39x inference speedup and 4x memory reduction on consumer hardware.
Conclusion: GAQ enables efficient compression and acceleration of equivariant models while rigorously preserving continuous symmetry in discrete spaces, facilitating stable, energy-conserving molecular dynamics simulations for extended timescales.
Abstract: Equivariant Graph Neural Networks (GNNs) are essential for physically consistent molecular simulations but suffer from high computational costs and memory bottlenecks, especially with high-order representations. While low-bit quantization offers a solution, applying it naively to rotation-sensitive features destroys the SO(3)-equivariant structure, leading to significant errors and violations of conservation laws. To address this issue, in this work, we propose a Geometric-Aware Quantization (GAQ) framework that compresses and accelerates equivariant models while rigorously preserving continuous symmetry in discrete spaces. Our approach introduces three key contributions: (1) a Magnitude-Direction Decoupled Quantization (MDDQ) scheme that separates invariant lengths from equivariant orientations to maintain geometric fidelity; (2) a symmetry-aware training strategy that treats scalar and vector features with distinct quantization schedules; and (3) a robust attention normalization mechanism to stabilize gradients in low-bit regimes. Experiments on the rMD17 benchmark demonstrate that our W4A8 models match the accuracy of FP32 baselines (9.31 meV vs. 23.20 meV) while reducing Local Equivariance Error (LEE) by over 30x compared to naive quantization. On consumer hardware, GAQ achieves 2.39x inference speedup and 4x memory reduction, enabling stable, energy-conserving molecular dynamics simulations for nanosecond timescales.
[575] InfoFlow KV: Information-Flow-Aware KV Recomputation for Long Context
Xin Teng, Canyu Zhang, Shaoyi Zheng, Danyang Zhuo, Tianyi Zhou, Shengjie Wang
Main category: cs.LG
TL;DR: Selective KV recomputation for RAG using attention-norm signal to identify influential tokens, with information-flow-guided chunk reordering for long-context QA.
Details
Motivation: Retrieval-augmented generation for long-context QA is bottlenecked by inference-time prefilling over large retrieved contexts. Existing methods for selective KV recomputation rely on heuristics or representation discrepancies without modeling whether selected tokens can effectively influence generation.Method: Cast selective KV recomputation as an information flow problem, using attention-norm signal from the query to identify tokens that are both semantically relevant and structurally positioned to propagate information. Reconstruct global positional assignments for retrieved chunks and introduce information-flow-guided chunk reordering strategy.
Result: Experiments on LLM and VLM benchmarks demonstrate consistent gains over prior methods under comparable efficiency budgets.
Conclusion: Attention-norm signal reliably identifies influential tokens for selective KV recomputation in RAG systems, enabling more efficient long-context question answering.
Abstract: Retrieval-augmented generation (RAG) for long-context question answering is bottlenecked by inference-time prefilling over large retrieved contexts. A common strategy is to precompute key-value (KV) caches for individual documents and selectively recompute a small subset of tokens to restore global causal dependencies, but existing methods rely on heuristics or representation discrepancies without modeling whether selected tokens can effectively influence generation. We cast selective KV recomputation as an information flow problem and show that a simple attention-norm signal from the query reliably identifies tokens that are both semantically relevant and structurally positioned to propagate information, when computed under an inference-consistent RoPE geometry. We therefore reconstruct global positional assignments for retrieved chunks and introduce an information-flow-guided chunk reordering strategy. Experiments on LLM and VLM benchmarks demonstrate consistent gains over prior methods under comparable efficiency budgets.
[576] Learning Causal Structure of Time Series using Best Order Score Search
Irene Gema Castillo Mansilla, Urmi Ninad
Main category: cs.LG
TL;DR: TS-BOSS extends BOSS algorithm to time series causal discovery using dynamic Bayesian networks with permutation-based search and grow-shrink trees for scalability.
Details
Motivation: Causal structure learning from observational time series data is challenging due to temporal dependence, requiring specialized methods that can handle dynamic settings while maintaining scalability and performance.Method: TS-BOSS extends BOSS algorithm to time series by performing permutation-based search over dynamic Bayesian network structures, using grow-shrink trees to cache intermediate score computations for scalability.
Result: TS-BOSS shows strong performance in high auto-correlation regimes, achieving higher adjacency recall at comparable precision than standard constraint-based methods on synthetic data.
Conclusion: TS-BOSS provides a scalable, high-performing approach for time series causal discovery and bridges permutation-based causal learning theory to dynamic settings.
Abstract: Causal structure learning from observational data is central to many scientific and policy domains, but the time series setting common to many disciplines poses several challenges due to temporal dependence. In this paper we focus on score-based causal discovery for multivariate time series and introduce TS-BOSS, a time series extension of the recently proposed Best Order Score Search (BOSS) (Andrews et al. 2023). TS-BOSS performs a permutation-based search over dynamic Bayesian network structures while leveraging grow-shrink trees to cache intermediate score computations, preserving the scalability and strong empirical performance of BOSS in the static setting. We provide theoretical guarantees establishing the soundness of TS-BOSS under suitable assumptions, and we present an intermediate result that extends classical subgraph minimality results for permutation-based methods to the dynamic (time series) setting. Our experiments on synthetic data show that TS-BOSS is especially effective in high auto-correlation regimes, where it consistently achieves higher adjacency recall at comparable precision than standard constraint-based methods. Overall, TS-BOSS offers a high-performing, scalable approach for time series causal discovery and our results provide a principled bridge for extending sparsity-based, permutation-driven causal learning theory to dynamic settings.
[577] Embedded Inter-Subject Variability in Adversarial Learning for Inertial Sensor-Based Human Activity Recognition
Francisco M. Calatrava-Nicolás, Shoko Miyauchi, Vitor Fortes Rey, Paul Lukowicz, Todor Stoyanov, Oscar Martinez Mozos
Main category: cs.LG
TL;DR: A deep adversarial framework for Human Activity Recognition that reduces inter-subject variability by encouraging subject-invariant feature representations, improving generalization to unseen individuals.
Details
Motivation: Human Activity Recognition models struggle with generalization to new individuals due to inter-subject variability - the same activity is performed differently by different people. This limits practical deployment of HAR systems.Method: Proposes a novel deep adversarial framework that explicitly integrates inter-subject variability into the adversarial task. The approach encourages subject-invariant feature representations while maintaining activity classification performance.
Result: Outperforms previous methods on three established HAR datasets using leave-one-subject-out cross-validation. The adversarial task effectively reduces inter-subject variability in feature space and outperforms previous adversarial approaches.
Conclusion: The proposed adversarial framework successfully addresses inter-subject variability in HAR, improving generalization to unseen individuals through subject-invariant feature learning.
Abstract: This paper addresses the problem of Human Activity Recognition (HAR) using data from wearable inertial sensors. An important challenge in HAR is the model’s generalization capabilities to new unseen individuals due to inter-subject variability, i.e., the same activity is performed differently by different individuals. To address this problem, we propose a novel deep adversarial framework that integrates the concept of inter-subject variability in the adversarial task, thereby encouraging subject-invariant feature representations and enhancing the classification performance in the HAR problem. Our approach outperforms previous methods in three well-established HAR datasets using a leave-one-subject-out (LOSO) cross-validation. Further results indicate that our proposed adversarial task effectively reduces inter-subject variability among different users in the feature space, and it outperforms adversarial tasks from previous works when integrated into our framework. Code: https://github.com/FranciscoCalatrava/EmbeddedSubjectVariability.git
[578] Robust Node Affinities via Jaccard-Biased Random Walks and Rank Aggregation
Bastian Pfeifer, Michael G. Schimek
Main category: cs.LG
TL;DR: TopKGraphs: A random walk-based method for estimating node similarity in graphs using Jaccard similarity to bias transitions toward structurally similar neighborhoods, producing interpretable affinity matrices via robust rank aggregation.
Details
Motivation: Node similarity estimation is fundamental for network analysis and graph-based ML tasks like clustering, community detection, classification, and recommendation. Existing methods range from simple local measures to complex embedding approaches, but there's a need for interpretable, general-purpose methods that bridge these extremes.Method: Uses start-node-anchored random walks biased toward nodes with structurally similar neighborhoods (measured via Jaccard similarity). Instead of computing stationary distributions, treats walks as stochastic neighborhood samplers to produce partial node rankings, which are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices.
Result: Achieves competitive or superior performance compared to Jaccard, Dice, personalized PageRank, and Node2Vec across synthetic graphs (stochastic block models, LFR benchmarks), k-NN graphs from tabular data, and protein-protein interaction networks. Demonstrates robustness in sparse, noisy, or heterogeneous networks.
Conclusion: TopKGraphs provides a versatile, interpretable, non-parametric tool that bridges simple local similarity measures with complex embedding-based approaches, facilitating both data mining and network analysis applications.
Abstract: Estimating node similarity is a fundamental task in network analysis and graph-based machine learning, with applications in clustering, community detection, classification, and recommendation. We propose TopKGraphs, a method based on start-node-anchored random walks that bias transitions toward nodes with structurally similar neighborhoods, measured via Jaccard similarity. Rather than computing stationary distributions, walks are treated as stochastic neighborhood samplers, producing partial node rankings that are aggregated using robust rank aggregation to construct interpretable node-to-node affinity matrices. TopKGraphs provides a non-parametric, interpretable, and general-purpose representation of node similarity that can be applied in both network analysis and machine learning workflows. We evaluate the method on synthetic graphs (stochastic block models, Lancichinetti-Fortunato-Radicchi benchmark graphs), k-nearest-neighbor graphs from tabular datasets, and a curated high-confidence protein-protein interaction network. Across all scenarios, TopKGraphs achieves competitive or superior performance compared to standard similarity measures (Jaccard, Dice), a diffusion-based method (personalized PageRank), and an embedding-based approach (Node2Vec), demonstrating robustness in sparse, noisy, or heterogeneous networks. These results suggest that TopKGraphs is a versatile and interpretable tool for bridging simple local similarity measures with more complex embedding-based approaches, facilitating both data mining and network analysis applications.
[579] On the Necessity of Learnable Sheaf Laplacians
Ferran Hernandez Caralt, Mar Gonzàlez i Català, Adrián Bazaga, Pietro Liò
Main category: cs.LG
TL;DR: Identity Sheaf Networks with fixed identity restriction maps perform comparably to complex sheaf-learning architectures on heterophilic graphs, questioning the necessity of learning restriction maps for mitigating oversmoothing.
Details
Motivation: To investigate whether the additional complexity of learning restriction maps in Sheaf Neural Networks (SNNs) is necessary for addressing oversmoothing on heterophilous graphs, or if simpler approaches like residual connections and normalization suffice.Method: Introduces an Identity Sheaf Network baseline where all restriction maps are fixed to identity, and uses it to ablate empirical improvements of sheaf-learning architectures. Also introduces Rayleigh quotient as a normalized measure for comparing oversmoothing across models.
Result: Across five popular heterophilic benchmarks, the identity baseline achieves comparable performance to a range of SNN variants. Identity Sheaf Networks do not appear to suffer more significant oversmoothing than their SNN counterparts.
Conclusion: The theoretical benefits of learning restriction maps in SNNs for mitigating oversmoothing are not empirically supported; simpler identity sheaf constructions perform equally well, questioning the necessity of complex sheaf-learning architectures.
Abstract: Sheaf Neural Networks (SNNs) were introduced as an extension of Graph Convolutional Networks to address oversmoothing on heterophilous graphs by attaching a sheaf to the input graph and replacing the adjacency-based operator with a sheaf Laplacian defined by (learnable) restriction maps. Prior work motivates this design through theoretical properties of sheaf diffusion and the kernel of the sheaf Laplacian, suggesting that suitable non-identity restriction maps can avoid representations converging to constants across connected components. Since oversmoothing can also be mitigated through residual connections and normalization, we revisit a trivial sheaf construction to ask whether the additional complexity of learning restriction maps is necessary. We introduce an Identity Sheaf Network baseline, where all restriction maps are fixed to the identity, and use it to ablate the empirical improvements reported by sheaf-learning architectures. Across five popular heterophilic benchmarks, the identity baseline achieves comparable performance to a range of SNN variants. Finally, we introduce the Rayleigh quotient as a normalized measure for comparing oversmoothing across models and show that, in trained networks, the behavior predicted by the diffusion-based analysis of SNNs is not reflected empirically. In particular, Identity Sheaf Networks do not appear to suffer more significant oversmoothing than their SNN counterparts.
[580] An interpretable prototype parts-based neural network for medical tabular data
Jacek Karolczak, Jerzy Stefanowski
Main category: cs.LG
TL;DR: A prototype-based interpretable neural network for medical tabular data that learns human-readable feature subsets as prototypes, enabling transparent clinical decision support.
Details
Motivation: Need for interpretable machine learning in healthcare where trust in model predictions is as critical as accuracy, inspired by prototype-based methods from computer vision but adapted for structured medical data.Method: Proposes a neural network for tabular medical data using trainable patching over patient features to learn prototypical parts as binary/discretized feature subsets, enabling concept-based predictions through latent space prototype comparison.
Result: The model achieves classification performance competitive with baseline models on medical benchmark datasets while providing transparency and interpretability.
Conclusion: The approach bridges the gap between predictive performance and interpretability in clinical decision support by offering human-readable, prototype-based explanations.
Abstract: The ability to interpret machine learning model decisions is critical in such domains as healthcare, where trust in model predictions is as important as their accuracy. Inspired by the development of prototype parts-based deep neural networks in computer vision, we propose a new model for tabular data, specifically tailored to medical records, that requires discretization of diagnostic result norms. Unlike the original vision models that rely on the spatial structure, our method employs trainable patching over features describing a patient, to learn meaningful prototypical parts from structured data. These parts are represented as binary or discretized feature subsets. This allows the model to express prototypes in human-readable terms, enabling alignment with clinical language and case-based reasoning. Our proposed neural network is inherently interpretable and offers interpretable concept-based predictions by comparing the patient’s description to learned prototypes in the latent space of the network. In experiments, we demonstrate that the model achieves classification performance competitive to widely used baseline models on medical benchmark datasets, while also offering transparency, bridging the gap between predictive performance and interpretability in clinical decision support.
[581] On-Policy Self-Distillation for Reasoning Compression
Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, Jiachen Sun
Main category: cs.LG
TL;DR: OPSDC is a self-distillation method that teaches reasoning models to be more concise by distilling their own concise behavior back into themselves, achieving significant token reduction while improving accuracy on math reasoning tasks.
Details
Motivation: Reasoning models produce verbose outputs with much noise and redundancy, which can actually be harmful by compounding errors with unnecessary tokens. Current methods for compression often require ground-truth answers, token budgets, or difficulty estimators.Method: OPSDC uses on-policy self-distillation: condition the same model on a “be concise” instruction to obtain teacher logits, then minimize per-token reverse KL divergence on the student’s own rollouts. No external resources needed - just self-distillation.
Result: Achieved 57-59% token reduction on MATH-500 while improving accuracy by 9-16 absolute points on Qwen3-8B and Qwen3-14B. On AIME 2024, the 14B model gained 10 points with 41% compression. The method automatically compresses easy problems aggressively while preserving deliberation for hard ones.
Conclusion: Self-distillation alone can effectively compress reasoning models, revealing that much of what reasoning models produce is not just redundant but actively harmful. The simplicity of OPSDC belies its sophisticated ability to adapt compression based on problem difficulty.
Abstract: Reasoning models think out loud, but much of what they say is noise. We introduce OPSDC (On-Policy Self-Distillation for Reasoning Compression), a method that teaches models to reason more concisely by distilling their own concise behavior back into themselves. The entire approach reduces to one idea: condition the same model on a “be concise” instruction to obtain teacher logits, and minimize per-token reverse KL on the student’s own rollouts. No ground-truth answers, no token budgets, no difficulty estimators. Just self-distillation. Yet this simplicity belies surprising sophistication: OPSDC automatically compresses easy problems aggressively while preserving the deliberation needed for hard ones. On Qwen3-8B and Qwen3-14B, we achieve 57-59% token reduction on MATH-500 while improving accuracy by 9-16 points absolute. On AIME 2024, the 14B model gains 10 points with 41% compression. The secret? Much of what reasoning models produce is not just redundant-it is actively harmful, compounding errors with every unnecessary token.
[582] Latent Wasserstein Adversarial Imitation Learning
Siqi Yang, Kai Yan, Alexander G. Schwing, Yu-Xiong Wang
Main category: cs.LG
TL;DR: LWAIL is a novel adversarial imitation learning framework that uses Wasserstein distance in a dynamics-aware latent space to match state-only distributions, requiring only 1-2 expert episodes without actions.
Details
Motivation: Traditional imitation learning requires large amounts of medium-to-high-quality demonstrations with expert actions, which are often unavailable. The authors aim to reduce this dependency by developing a method that works with only state-only demonstrations and minimal expert data.Method: Proposes Latent Wasserstein Adversarial Imitation Learning (LWAIL) with two stages: 1) Pre-train an Intention Conditioned Value Function (ICVF) using randomly generated state-only data to create a dynamics-aware latent space, 2) Use Wasserstein distance in this latent space for state-only distribution matching between agent and expert trajectories.
Result: LWAIL outperforms prior Wasserstein-based and adversarial IL methods on multiple MuJoCo environments, achieving expert-level performance with only one or a few state-only expert episodes.
Conclusion: The dynamics-aware latent space enables effective imitation learning with minimal state-only expert data, reducing the dependency on large, high-quality demonstrations with actions.
Abstract: Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To reduce this need, we propose Latent Wasserstein Adversarial Imitation Learning (LWAIL), a novel adversarial imitation learning framework that focuses on state-only distribution matching. It benefits from the Wasserstein distance computed in a dynamics-aware latent space. This dynamics-aware latent space differs from prior work and is obtained via a pre-training stage, where we train the Intention Conditioned Value Function (ICVF) to capture a dynamics-aware structure of the state space using a small set of randomly generated state-only data. We show that this enhances the policy’s understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better results across various tasks.
[583] Kraus Constrained Sequence Learning For Quantum Trajectories from Continuous Measurement
Priyanshi Singh, Krishna Bhatia
Main category: cs.LG
TL;DR: A neural network approach with Kraus-structured output layer for physically valid quantum state reconstruction from continuous measurement, ensuring CPTP constraints across various sequence models.
Details
Motivation: Standard quantum state reconstruction methods require exact models and are sensitive to parameter mismatch, while existing neural approaches can violate physical constraints like positivity and trace preservation.Method: Proposes a Kraus-structured output layer that converts hidden representations from sequence models (RNN, GRU, LSTM, TCN, ESN, Mamba, Neural ODE) into completely positive trace preserving quantum operations, ensuring physical validity by construction.
Result: Kraus-LSTM achieves strongest results, improving state estimation quality by 7% over unconstrained counterparts while guaranteeing physically valid predictions in non-stationary regimes with parameter drift.
Conclusion: The Kraus-structured output layer enables physically valid quantum state reconstruction across diverse sequence models, with LSTM-based architecture showing best performance for handling stochastic dynamics with parameter drift.
Abstract: Real-time reconstruction of conditional quantum states from continuous measurement records is a fundamental requirement for quantum feedback control, yet standard stochastic master equation (SME) solvers require exact model specification, known system parameters, and are sensitive to parameter mismatch. While neural sequence models can fit these stochastic dynamics, the unconstrained predictors can violate physicality such as positivity or trace constraints, leading to unstable rollouts and unphysical estimates. We propose a Kraus-structured output layer that converts the hidden representation of a generic sequence backbone into a completely positive trace preserving (CPTP) quantum operation, yielding physically valid state updates by construction. We instantiate this layer across diverse backbones, RNN, GRU, LSTM, TCN, ESN and Mamba; including Neural ODE as a comparative baseline, on stochastic trajectories characterized by parameter drift. Our evaluation reveals distinct trade-offs between gating mechanisms, linear recurrence, and global attention. Across all models, Kraus-LSTM achieves the strongest results, improving state estimation quality by 7% over its unconstrained counterpart while guaranteeing physically valid predictions in non-stationary regimes.
[584] SurvHTE-Bench: A Benchmark for Heterogeneous Treatment Effect Estimation in Survival Analysis
Shahriar Noroozizadeh, Xiaobin Shen, Jeremy C. Weiss, George H. Chen
Main category: cs.LG
TL;DR: SurvHTE-Bench is a comprehensive benchmark for evaluating heterogeneous treatment effect estimation methods in survival analysis with censored outcomes, covering synthetic, semi-synthetic, and real-world datasets.
Details
Motivation: Current evaluation practices for heterogeneous treatment effect (HTE) estimation in survival analysis are fragmented and inconsistent, lacking standardized benchmarks to fairly compare methods under diverse conditions and realistic assumption violations.Method: The authors create SurvHTE-Bench with three components: (1) modular synthetic datasets with known ground truth varying causal assumptions and survival dynamics, (2) semi-synthetic datasets combining real-world covariates with simulated treatments/outcomes, and (3) real-world datasets from a twin study and HIV clinical trial.
Result: The benchmark provides the first rigorous comparison of survival HTE methods across diverse settings, establishing a foundation for fair, reproducible, and extensible evaluation of causal survival methods.
Conclusion: SurvHTE-Bench addresses critical gaps in HTE estimation evaluation for survival analysis and enables systematic assessment of methods under various causal assumptions and realistic conditions.
Abstract: Estimating heterogeneous treatment effects (HTEs) from right-censored survival data is critical in high-stakes applications such as precision medicine and individualized policy-making. Yet, the survival analysis setting poses unique challenges for HTE estimation due to censoring, unobserved counterfactuals, and complex identification assumptions. Despite recent advances, from Causal Survival Forests to survival meta-learners and outcome imputation approaches, evaluation practices remain fragmented and inconsistent. We introduce SurvHTE-Bench, the first comprehensive benchmark for HTE estimation with censored outcomes. The benchmark spans (i) a modular suite of synthetic datasets with known ground truth, systematically varying causal assumptions and survival dynamics, (ii) semi-synthetic datasets that pair real-world covariates with simulated treatments and outcomes, and (iii) real-world datasets from a twin study (with known ground truth) and from an HIV clinical trial. Across synthetic, semi-synthetic, and real-world settings, we provide the first rigorous comparison of survival HTE methods under diverse conditions and realistic assumption violations. SurvHTE-Bench establishes a foundation for fair, reproducible, and extensible evaluation of causal survival methods. The data and code of our benchmark are available at: https://github.com/Shahriarnz14/SurvHTE-Bench .
[585] Cheap Thrills: Effective Amortized Optimization Using Inexpensive Labels
Khai Nguyen, Petros Ellinas, Anvita Bhagavathula, Priya Donti
Main category: cs.LG
TL;DR: A three-stage framework combining supervised pretraining with cheap imperfect labels followed by self-supervised refinement for optimization/simulation surrogates, showing improved performance with reduced offline costs.
Details
Motivation: Existing ML surrogate approaches for optimization/simulation problems face trade-offs: supervised learning needs expensive high-quality labels, while self-supervised learning struggles with difficult optimization landscapes. There's a need for a method that balances these approaches.Method: Three-stage framework: 1) Collect cheap imperfect labels, 2) Supervised pretraining using these labels, 3) Self-supervised refinement to improve performance. Theoretical analysis shows labels only need to place model within basin of attraction.
Result: Empirical validation across nonconvex constrained optimization, power-grid operation, and stiff dynamical systems shows faster convergence, improved accuracy/feasibility/optimality, and up to 59x reductions in total offline cost.
Conclusion: The proposed three-stage strategy effectively balances supervised and self-supervised learning for optimization/simulation surrogates, requiring only modest numbers of inexact labels and training epochs while significantly reducing computational costs.
Abstract: To scale the solution of optimization and simulation problems, prior work has explored machine-learning surrogates that inexpensively map problem parameters to corresponding solutions. Commonly used approaches, including supervised and self-supervised learning with either soft or hard feasibility enforcement, face inherent challenges such as reliance on expensive, high-quality labels or difficult optimization landscapes. To address their trade-offs, we propose a novel framework that first collects “cheap” imperfect labels, then performs supervised pretraining, and finally refines the model through self-supervised learning to improve overall performance. Our theoretical analysis and merit-based criterion show that labeled data need only place the model within a basin of attraction, confirming that only modest numbers of inexact labels and training epochs are required. We empirically validate our simple three-stage strategy across challenging domains, including nonconvex constrained optimization, power-grid operation, and stiff dynamical systems, and show that it yields faster convergence; improved accuracy, feasibility, and optimality; and up to 59x reductions in total offline cost.
[586] Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models
Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, Aditya Grover
Main category: cs.LG
TL;DR: On-Policy Self-Distillation (OPSD) enables a single LLM to act as both teacher and student by conditioning on different contexts, achieving efficient reasoning improvement without separate teacher models.
Details
Motivation: Existing on-policy distillation methods require separate, often larger teacher LLMs and don't leverage available ground-truth solutions in reasoning datasets. The authors propose that a capable LLM can rationalize external reasoning traces and teach its weaker self.Method: OPSD uses a single model as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (verified reasoning traces) while the student policy sees only the question. Training minimizes per-token divergence between these distributions over the student’s own rollouts.
Result: Achieves 8-12x token efficiency compared to reinforcement learning methods like GRPO and superior performance over off-policy distillation methods on multiple mathematical reasoning benchmarks.
Conclusion: OPSD provides an efficient framework for improving LLM reasoning through self-distillation, eliminating the need for separate teacher models while leveraging available privileged information in reasoning datasets.
Abstract: Knowledge distillation improves large language model (LLM) reasoning by compressing the knowledge of a teacher LLM to train smaller LLMs. On-policy distillation advances this approach by having the student sample its own trajectories while a teacher LLM provides dense token-level supervision, addressing the distribution mismatch between training and inference in off-policy distillation methods. However, on-policy distillation typically requires a separate, often larger, teacher LLM and does not explicitly leverage ground-truth solutions available in reasoning datasets. Inspired by the intuition that a sufficiently capable LLM can rationalize external privileged reasoning traces and teach its weaker self (i.e., the version without access to privileged information), we introduce On-Policy Self-Distillation (OPSD), a framework where a single model acts as both teacher and student by conditioning on different contexts. The teacher policy conditions on privileged information (e.g., verified reasoning traces) while the student policy sees only the question; training minimizes the per-token divergence between these distributions over the student’s own rollouts. We demonstrate the efficacy of our method on multiple mathematical reasoning benchmarks, achieving 8-12x token efficiency compared to reinforcement learning methods such as GRPO and superior performance over off-policy distillation methods.
[587] Localized Distributional Robustness in Submodular Multi-Task Subset Selection
Ege C. Kaya, Abolfazl Hashemi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2404.03759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2404.03759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[588] Learning to Cover: Online Learning and Optimization with Irreversible Decisions
Alexandre Jacquillat, Michael Lingzhi Li
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2406.14777: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.14777&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[589] Towards a Fairer Non-negative Matrix Factorization
Lara Kassab, Erin George, Deanna Needell, Haowen Geng, Nika Jafar Nia, Aoxi Li
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2411.09847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.09847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] An Experimental Study on Fairness-aware Machine Learning for Credit Scoring Problems
Huyen Giang Thi Thu, Thang Viet Doan, Ha-Bang Ban, Tai Le Quy
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions due to lack of paper content
Abstract: Failed to fetch summary for 2412.20298: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.20298&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] Curse of Dimensionality in Neural Network Optimization
Sanghoon Na, Haizhao Yang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). No abstract content available for analysis.
Details
Motivation: Unable to determine motivation as paper content could not be retrieved due to server rate limiting.Method: No method information available due to failed API request.
Result: No results available as the paper summary could not be fetched.
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content.
Abstract: Failed to fetch summary for 2502.05360: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.05360&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy
Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, Eduard Gorbunov
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2502.11682: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.11682&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] TianQuan-S2S: A Subseasonal-to-Seasonal Global Weather Model via Incorporate Climatology State
Guowen Li, Xintong Liu, Yang Liu, Mengxuan Chen, Shilei Cao, Xuehe Wang, Juepeng Zheng, Jinxiao Zhang, Haoyuan Liang, Lixian Zhang, Jiuke Wang, Meng Jin, Hong Cheng, Haohuan Fu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical fetching error
Abstract: Failed to fetch summary for 2504.09940: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.09940&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] Attribute-Efficient PAC Learning of Sparse Halfspaces with Constant Malicious Noise Rate
Shiwei Zeng, Jie Shen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.21430: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.21430&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] Continuous Chain of Thought Enables Parallel Exploration and Reasoning
Halil Alperen Gozeten, M. Emrullah Ildiz, Xuechen Zhang, Hrayr Harutyunyan, Ankit Singh Rawat, Samet Oymak
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to analyze paper due to technical error fetching content
Abstract: Failed to fetch summary for 2505.23648: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23648&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] FPGA-Enabled Machine Learning Applications in Earth Observation: A Systematic Review
Cédric Léonard, Dirk Stober, Martin Schulz
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.03938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.03938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] SPEED-RL: Faster Training of Reasoning Models via Online Curriculum Learning
Ruiqi Zhang, Daman Arora, Song Mei, Andrea Zanette
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.09016: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.09016&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] From Bandit Regret to FDR Control: Online Selective Generation with Adversarial Feedback Unlocking
Minjae Lee, Yoonjae Jung, Sangdon Park
Main category: cs.LG
TL;DR: Unable to analyze paper 2506.14067 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper abstractMethod: Cannot determine method due to inability to access paper abstract
Result: Cannot determine results due to inability to access paper abstract
Conclusion: Cannot draw conclusions due to inability to access paper abstract
Abstract: Failed to fetch summary for 2506.14067: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.14067&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[599] Parameter Stress Analysis in Reinforcement Learning: Applying Synaptic Filtering to Policy Networks
Zain ul Abdeen, Ming Jin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2506.23036: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23036&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[600] Some Super-approximation Rates of ReLU Neural Networks for Korobov Functions
Yuwen Li, Guozhi Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2507.10345: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.10345&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[601] Kernel Based Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games
Berkay Anahtarci, Can Deha Kariksiz, Naci Saldi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2507.14529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.14529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[602] TIC-GRPO: Provable and Efficient Optimization for Reinforcement Learning from Human Feedback
Lei Pang, Jun Luo, Ruinan Jin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2508.02833: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02833&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[603] Honest and Reliable Evaluation and Expert Equivalence Testing of Automated Neonatal Seizure Detection
Jovana Kljajic, John M. O’Toole, Robert Hogan, Tamara Skoric
Main category: cs.LG
TL;DR: Unable to analyze paper 2508.04899 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot draw conclusions as abstract is unavailable
Abstract: Failed to fetch summary for 2508.04899: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.04899&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[604] Multi-Agent Reinforcement Learning in Intelligent Transportation Systems: A Comprehensive Survey
Rexcharles Donatus, Kumater Ter, Daniel Udekwe
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.20315: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.20315&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[605] AttnBoost: Retail Supply Chain Sales Insights via Gradient Boosting Perspective
Yadi Liu, Xiaoli Ma, Muxin Ge, Zeyu Han, Jingxi Qiu, Ye Aung Moe, Yilan Shen, Wenbin Wei, Cheng Huang
Main category: cs.LG
TL;DR: Unable to analyze paper 2509.10506 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailable due to rate limiting errorMethod: Cannot determine method as abstract is unavailable due to rate limiting error
Result: Cannot determine results as abstract is unavailable due to rate limiting error
Conclusion: Cannot draw conclusions as paper content is inaccessible due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2509.10506: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.10506&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[606] Topology Structure Optimization of Reservoirs Using GLMY Homology
Yu Chen, Shengwei Wang, Hongwei Lin
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.11612: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11612&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[607] TabStruct: Measuring Structural Fidelity of Tabular Data
Xiangjian Jiang, Nikola Simidjievski, Mateja Jamnik
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2509.11950: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.11950&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[608] OPPO: Accelerating PPO-based RLHF via Pipeline Overlap
Kaizhuo Yan, Yingjie Yu, Yifan Yu, Haizhong Zheng, Fan Lai
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2509.25762: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.25762&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[609] Non-Asymptotic Analysis of Efficiency in Conformalized Regression
Yunzhen Yao, Lie He, Michael Gastpar
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.07093: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07093&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[610] Do We Really Need Permutations? Impact of Model Width on Linear Mode Connectivity
Akira Ito, Masanori Yamada, Daiki Chijiwa, Atsutoshi Kumagai
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API for paper ID 2510.08023
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2510.08023: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08023&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[611] Breaking and Fixing Defenses Against Control-Flow Hijacking in Multi-Agent Systems
Rishi Jha, Harold Triedman, Justin Wagle, Vitaly Shmatikov
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) - cannot analyze content
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2510.17276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[612] SPOT: Single-Shot Positioning via Trainable Near-Field Rainbow Beamforming
Yeyue Cai, Jianhua Mo, Meixia Tao
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.11391: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11391&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[613] ReCast: Reliability-aware Codebook Assisted Lightweight Time Series Forecasting
Xiang Ma, Taihua Chen, Pengcheng Wang, Xuemei Li, Caiming Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.11991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.11991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[614] A physics-informed U-Net-LSTM network for nonlinear structural response under seismic excitation
Sutirtha Biswas, Kshitij Kumar Yadav
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.21276: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21276&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[615] Measuring Uncertainty Calibration
Kamil Ciosek, Nicolò Felicioni, Sina Ghiassian, Juan Elenter Litwin, Francesco Tonolini, David Gustafsson, Eva Garcia-Martin, Carmen Barcena Gonzalez, Raphaëlle Bertrand-Lalo
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.13872: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13872&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[616] BPE: Behavioral Profiling Ensemble
Yanxin Liu, Yunqi Zhang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.10024: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10024&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[617] Position: Beyond Model-Centric Prediction – Agentic Time Series Forecasting
Mingyue Cheng, Xiaoyu Tao, Qi Liu, Ze Guo, Enhong Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.01776: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.01776&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[618] Beyond the Unit Hypersphere: Embedding Magnitude in Contrastive Learning
Xincan Feng, Taro Watanabe
Main category: cs.LG
TL;DR: Paper 2602.09229: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.09229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.09229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[619] Learn from Your Mistakes: Self-Correcting Masked Diffusion Models
Yair Schiff, Omer Belhasin, Roy Uziel, Guanghan Wang, Marianne Arriola, Gilad Turok, Michael Elad, Volodymyr Kuleshov
Main category: cs.LG
TL;DR: Failed to fetch summary for arXiv ID 2602.11590 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictions preventing paper retrievalMethod: No method information available due to failed API request
Result: No results available due to access limitations
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2602.11590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[620] QTabGAN: A Hybrid Quantum-Classical GAN for Tabular Data Synthesis
Subhangi Kumari, Rakesh Achutha, Vignesh Sivaraman
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2602.12704: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.12704&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[621] Out-of-Support Generalisation via Weight-Space Sequence Modelling
Roussel Desmond Nzoyem
Main category: cs.LG
TL;DR: Paper 2602.13550: Could not fetch summary due to HTTP 429 error (rate limiting). No content available for analysis.
Details
Motivation: Unable to determine motivation as paper content is not accessible due to API rate limiting.Method: Unable to determine method as paper content is not accessible due to API rate limiting.
Result: Unable to determine results as paper content is not accessible due to API rate limiting.
Conclusion: Unable to draw conclusions as paper content is not accessible due to API rate limiting.
Abstract: Failed to fetch summary for 2602.13550: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.13550&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[622] cc-Shapley: Measuring Multivariate Feature Importance Needs Causal Context
Jörg Martin, Stefan Haufe
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2602.20396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[623] Inverse Reconstruction of Shock Time Series from Shock Response Spectrum Curves using Machine Learning
Adam Watts, Andrew Jeon, Destry Newton, Ryan Bowering
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2603.03229: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03229&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[624] Zeroth-Order primal-dual Alternating Projection Gradient Algorithms for Nonconvex Minimax Problems with Coupled linear Constraints
Huiling Zhang, Zi Xu, Yuhong Dai
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to draw conclusions due to access error
Abstract: Failed to fetch summary for 2402.03352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.03352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[625] Generalization Bounds for Markov Algorithms through Entropy Flow Computations
Benjamin Dupuis, Maxime Haddouche, George Deligiannidis, Umut Simsekli
Main category: cs.LG
TL;DR: Unable to analyze paper 2502.07584 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2502.07584: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07584&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[626] Sink equilibria and the attractors of learning in games
Oliver Biggar, Christos Papadimitriou
Main category: cs.LG
TL;DR: Unable to analyze paper 2502.07975 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusion as abstract retrieval failed
Abstract: Failed to fetch summary for 2502.07975: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.07975&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[627] Differentially Private and Scalable Estimation of the Network Principal Component
Alireza Khayatian, Anil Vullikanti, Aritra Konar
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.03858: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.03858&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[628] Variational Formulation of Particle Flow
Yinzhuang Yi, Jorge Cortés, Nikolay Atanasov
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2505.04007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.04007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[629] Highly Efficient and Effective LLMs with Multi-Boolean Architectures
Ba-Hien Tran, Van Minh Nguyen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: No method information available due to API rate limiting error
Result: No results available - paper summary fetch failed
Conclusion: Cannot analyze paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2505.22811: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22811&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[630] Learning Physical Systems: Symplectification via Gauge Fixing in Dirac Structures
Aristotelis Papatheodorou, Pranav Vaidhyanathan, Natalia Ares, Ioannis Havoutis
Main category: cs.LG
TL;DR: Unable to analyze paper 2506.18812 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2506.18812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.18812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[631] Structured quantum learning via em algorithm for Boltzmann machines
Takeshi Kimura, Kohtaro Kato, Masahito Hayashi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2507.21569: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.21569&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[632] Dropping Just a Handful of Preferences Can Change Top Large Language Model Rankings
Jenny Y. Huang, Yunyi Shen, Dennis Wei, Tamara Broderick
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2508.11847: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.11847&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[633] Quantitative convergence of trained single layer neural networks to Gaussian processes
Eloy Mosig, Andrea Agazzi, Dario Trevisan
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.24544: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.24544&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[634] Bayesian Inference for PDE-based Inverse Problems using the Optimization of a Discrete Loss
Lucas Amoudruz, Sergey Litvinov, Costas Papadimitriou, Petros Koumoutsakos
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed summary fetchMethod: Unable to determine method due to failed summary fetch
Result: Unable to determine results due to failed summary fetch
Conclusion: Unable to determine conclusion due to failed summary fetch
Abstract: Failed to fetch summary for 2510.15664: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15664&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[635] Generalization Below the Edge of Stability: The Role of Data Geometry
Tongtong Liang, Alexander Cloninger, Rahul Parhi, Yu-Xiang Wang
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.18120: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18120&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[636] Testing Most Influential Sets
Lucas Darius Konrad, Nikolas Kuschnig
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: The paper ID 2510.20372 could not be retrieved due to API rate limiting, so the motivation is unknownMethod: Method cannot be determined due to failed data retrieval
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot draw conclusions about an inaccessible paper
Abstract: Failed to fetch summary for 2510.20372: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.20372&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[637] Auto-Adaptive PINNs with Applications to Phase Transitions
Kevin Buck, Woojeong Kim
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2510.23999: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.23999&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[638] Symmetric Aggregation of Conformity Scores for Efficient Uncertainty Sets
Nabil Alami, Jad Zakharia, Souhaib Ben Taieb
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.06945: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.06945&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[639] Agentic Multi-Persona Framework for Evidence-Aware Fake News Detection
Roopa Bukke, Soumya Pandey, Suraj Kumar, Soumi Chattopadhyay, Chandranath Adak
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.21039: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.21039&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[640] Prediction of Cellular Malignancy Using Electrical Impedance Signatures and Supervised Machine Learning
Shadeeb Hossain
Main category: cs.LG
TL;DR: Unable to analyze paper 2601.04478 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2601.04478: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.04478&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[641] Improved Convergence Rates of Muon Optimizer for Nonconvex Optimization
Shuntaro Nagashima, Hideaki Iiduka
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.19400: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.19400&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[642] Latent-IMH: Efficient Bayesian Inference for Inverse Problems with Approximate Operators
Youguang Chen, George Biros
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.20888: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20888&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[643] Optimal training-conditional regret for online conformal prediction
Jiadong Liang, Zhimei Ren, Yuxin Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.16537: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16537&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[644] Lap2: Revisiting Laplace DP-SGD for High Dimensions via Majorization Theory
Meisam Mohammady, Qin Yang, Nicholas Stout, Ayesha Samreen, Han Wang, Christopher J Quinn, Yuan Hong
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23516: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23516&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[645] Inference-time optimization for experiment-grounded protein ensemble generation
Advaith Maddipatla, Anar Rzayev, Marco Pegoraro, Martin Pacesa, Paul Schanda, Ailie Marx, Sanketh Vedula, Alex M. Bronstein
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2602.24007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.24007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[646] Conformal Graph Prediction with Z-Gromov Wasserstein Distances
Gabriel Melo, Thibaut de Saivre, Anna Calissano, Florence d’Alché-Buc
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2603.02460 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2603.02460: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02460&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[647] stratum: A System Infrastructure for Massive Agent-Centric ML Workloads
Arnab Phani, Elias Strauss, Sebastian Schelter
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.03589: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03589&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[648] LoRA-MME: Multi-Model Ensemble of LoRA-Tuned Encoders for Code Comment Classification
Md Akib Haider, Ahsan Bulbul, Nafis Fuad Shahid, Aimaan Ahmed, Mohammad Ishrak Abedin
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.03959: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.03959&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[649] From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration
Yizhe Xie, Congcong Zhu, Xinyue Zhang, Tianqing Zhu, Dayong Ye, Minfeng Qi, Huajie Chen, Wanlei Zhou
Main category: cs.MA
TL;DR: Proposes a propagation dynamics model for LLM-based multi-agent systems to detect and mitigate error amplification through collaboration, with a genealogy-graph governance layer that prevents minor errors from solidifying into system-wide false consensus.
Details
Motivation: LLM-based multi-agent systems face risks where minor inaccuracies can amplify into system-level false consensus through iterative collaboration, but existing protections either rely on single-agent validation or require architecture modifications that disrupt natural collaboration flows.Method: Develops a propagation dynamics model that abstracts LLM-MAS collaboration as a directed dependency graph with early-stage risk criteria. Identifies three vulnerability classes through experiments on six frameworks. Proposes a genealogy-graph-based governance layer implemented as a message-layer plugin that suppresses error amplification without altering collaboration architecture.
Result: Experiments show the approach raises defense success rate from baseline 0.32 to over 0.89 and significantly mitigates cascading spread of minor errors. Identifies three vulnerability classes: cascade amplification, topological sensitivity, and consensus inertia. Demonstrates attacks where single atomic error seeds lead to widespread failure.
Conclusion: The proposed genealogy-graph governance layer effectively addresses error propagation in LLM-MAS without disrupting collaboration architecture, providing a practical solution to prevent minor errors from amplifying into system-wide false consensus.
Abstract: Large Language Model-based Multi-Agent Systems (LLM-MAS) are increasingly applied to complex collaborative scenarios. However, their collaborative mechanisms may cause minor inaccuracies to gradually solidify into system-level false consensus through iteration. Such risks are difficult to trace since errors can propagate and amplify through message dependencies. Existing protections often rely on single-agent validation or require modifications to the collaboration architecture, which can weaken effective information flow and may not align with natural collaboration processes in real tasks. To address this, we propose a propagation dynamics model tailored for LLM-MAS that abstracts collaboration as a directed dependency graph and provides an early-stage risk criterion to characterize amplification risk. Through experiments on six mainstream frameworks, we identify three vulnerability classes: cascade amplification, topological sensitivity, and consensus inertia. We further instantiate an attack where injecting just a single atomic error seed leads to widespread failure. In response, we introduce a genealogy-graph-based governance layer, implemented as a message-layer plugin, that suppresses both endogenous and exogenous error amplification without altering the collaboration architecture. Experiments show that this approach raises the defense success rate from a baseline of 0.32 to over 0.89 and significantly mitigates the cascading spread of minor errors.
[650] Strategic Interactions in Multi-Level Stackelberg Games with Non-Follower Agents and Heterogeneous Leaders
Niloofar Aminikalibar, Farzaneh Farhadi, Maria Chli
Main category: cs.MA
TL;DR: A three-level Stackelberg game framework that incorporates non-follower agents in congestion-coupled systems, applied to EV charging infrastructure to show how accounting for non-followers alters strategic incentives and equilibrium outcomes.
Details
Motivation: Existing Stackelberg game models for congested systems ignore non-follower agents who don't directly participate in market competition but still contribute to and adapt to congestion, leading to systematically distorted equilibrium predictions.Method: Introduces a three-level Stackelberg framework with heterogeneous leaders (differing in decision horizons and feasible actions), strategic followers, and non-follower agents that captures bidirectional coupling between infrastructure decisions, competition, and equilibrium congestion.
Result: The model, instantiated in EV charging infrastructure context, shows how explicitly accounting for non-followers and heterogeneous competitors qualitatively alters strategic incentives and equilibrium outcomes beyond what traditional models predict.
Conclusion: The framework addresses a key limitation in congestion-coupled market modeling and applies broadly to multi-agent systems in mobility, energy, and computing markets where non-participant agents affect congestion patterns.
Abstract: Strategic interaction in congested systems is commonly modelled using Stackelberg games, where competing leaders anticipate the behaviour of self-interested followers. A key limitation of existing models is that they typically ignore agents who do not directly participate in market competition, yet both contribute to and adapt to congestion. Although such non-follower agents do not generate revenue or respond to market incentives, their behaviour reshapes congestion patterns, which in turn affects the decisions of leaders and followers through shared resources. We argue that overlooking non-followers leads to systematically distorted equilibrium predictions in congestion-coupled markets. To address this, we introduce a three-level Stackelberg framework with heterogeneous leaders differing in decision horizons and feasible actions, strategic followers, and non-follower agents that captures bidirectional coupling between infrastructure decisions, competition, and equilibrium congestion. We instantiate the framework in the context of electric vehicle (EV) charging infrastructure, where charging providers compete with rivals, while EV and non-EV traffic jointly shape congestion. The model illustrates how explicitly accounting for non-followers and heterogeneous competitors qualitatively alters strategic incentives and equilibrium outcomes. Beyond EV charging, the framework applies to a broad class of congestion-coupled multi-agent systems in mobility, energy, and computing markets.
[651] SCoUT: Scalable Communication via Utility-Guided Temporal Grouping in Multi-Agent Reinforcement Learning
Manav Vora, Gokul Puthumanaillam, Hiroyasu Tsukamoto, Melkior Ornik
Main category: cs.MA
TL;DR: SCoUT is a scalable communication method for multi-agent RL that uses temporal grouping and utility-guided recipient selection to enable efficient communication learning with precise credit assignment.
Details
Motivation: Communication can improve coordination in partially observed multi-agent RL, but learning when and who to communicate with is challenging due to the large number of possible sender-recipient pairs and difficulty isolating the effect of individual messages on future rewards.Method: SCoUT uses temporal abstraction with macro-steps, resampling soft agent groups via Gumbel-Softmax to create latent clusters that serve as differentiable priors over recipients. It employs a group-aware critic that predicts values for each agent group and maps them to per-agent baselines. Each agent has a three-headed policy for environment actions, send decisions, and recipient selection. Counterfactual communication advantages are derived by analytically removing each sender’s contribution from recipient messages for precise credit assignment.
Result: The method enables decentralized execution while maintaining centralized training components, providing scalable communication learning with improved coordination in multi-agent settings.
Conclusion: SCoUT addresses scalability challenges in multi-agent communication learning through temporal grouping and utility-guided recipient selection, enabling efficient communication with precise credit assignment while preserving decentralized execution.
Abstract: Communication can improve coordination in partially observed multi-agent reinforcement learning (MARL), but learning \emph{when} and \emph{who} to communicate with requires choosing among many possible sender-recipient pairs, and the effect of any single message on future reward is hard to isolate. We introduce \textbf{SCoUT} (\textbf{S}calable \textbf{Co}mmunication via \textbf{U}tility-guided \textbf{T}emporal grouping), which addresses both these challenges via temporal and agent abstraction within traditional MARL. During training, SCoUT resamples \textit{soft} agent groups every (K) environment steps (macro-steps) via Gumbel-Softmax; these groups are latent clusters that induce an affinity used as a differentiable prior over recipients. Using the same assignments, a group-aware critic predicts values for each agent group and maps them to per-agent baselines through the same soft assignments, reducing critic complexity and variance. Each agent is trained with a three-headed policy: environment action, send decision, and recipient selection. To obtain precise communication learning signals, we derive counterfactual communication advantages by analytically removing each sender’s contribution from the recipient’s aggregated messages. This counterfactual computation enables precise credit assignment for both send and recipient-selection decisions. At execution time, all centralized training components are discarded and only the per-agent policy is run, preserving decentralized execution. Project website, videos and code: \hyperlink{https://scout-comm.github.io/}{https://scout-comm.github.io/}
[652] Real-Time BDI Agents: a model and its implementation
Andrea Traldi, Francesco Bruschetti, Marco Robol, Davide Calvaresi, Marco Roveri, Paolo Giorgini
Main category: cs.MA
TL;DR: Real-time BDI agent model with temporal constraints for responsive autonomous systems
Details
Motivation: BDI models are effective for autonomous applications but lack explicit time representation, causing delays and unresponsiveness in real-time scenarios when systems get overloaded.Method: Redefine BDI agent control loop using established real-time systems algorithms, propose real-time management of goals, plans, and actions with respect to time constraints and resource availability.
Result: Implemented the model for a resource-collection video-game and validated the approach against significant scenarios.
Conclusion: The proposed real-time BDI model ensures proper agent reaction and effective application in real-time domains by addressing temporal constraints.
Abstract: The BDI model proved to be effective for developing applications requiring high-levels of autonomy and to deal with the complexity and unpredictability of real-world scenarios. The model, however, has significant limitations in reacting and handling contingencies within the given real-time constraints. Without an explicit representation of time, existing real-time BDI implementations overlook the temporal implications during the agent’s decision process that may result in delays or unresponsiveness of the system when it gets overloaded. In this paper, we redefine the BDI agent control loop inspired by well established algorithms for real-time systems to ensure a proper reaction of agents and their effective application in typical real-time domains. Our model proposes an effective real-time management of goals, plans, and actions with respect to time constraints and resources availability. We propose an implementation of the model for a resource-collection video-game and we validate the approach against a set of significant scenarios.
[653] Conflict-Based Search as a Protocol: A Multi-Agent Motion Planning Protocol for Heterogeneous Agents, Solvers, and Independent Tasks
Rishi Veerapaneni, Alvin Tang, Haodong He, Sophia Zhao, Viraj Shah, Yidai Cen, Ziteng Ji, Gabriel Olin, Jon Arrizabalaga, Yorai Shaoul, Jiaoyang Li, Maxim Likhachev
Main category: cs.MA
TL;DR: CBS Protocol enables multi-agent motion planning for heterogeneous robots with different planning algorithms by using Conflict-Based Search as a coordination framework.
Details
Motivation: Enable diverse robots from different manufacturers with independent motion planning systems to effectively move in shared environments without requiring standardization of their internal algorithms.Method: Uses Conflict-Based Search (CBS) as a protocol that requires only a specific single-agent motion planning API (finding collision-free paths with space-time constraints). A central planner coordinates heterogeneous agents regardless of their internal planning implementations.
Result: Demonstrated multi-agent motion planning for heterogeneous teams using various single-agent planners including Heuristic Search (A*), Sampling Based Search (RRT), Optimization (Direct Collocation), Diffusion, and Reinforcement Learning.
Conclusion: CBS Protocol provides a practical solution for coordinating algorithmically heterogeneous robots in shared environments by abstracting away implementation details through a standardized API.
Abstract: Imagine the future construction site, hospital, or office with dozens of robots bought from different manufacturers. How can we enable these different robots to effectively move in a shared environment, given that each robot may have its own independent motion planning system? This work shows how we can get efficient collision-free movements between algorithmically heterogeneous agents by using Conflict-Based Search (Sharon et al. 2015) as a protocol. At its core, the CBS Protocol requires one specific single-agent motion planning API; finding a collision-free path that satisfies certain space-time constraints. Given such an API, CBS uses a central planner to find collision-free paths - independent of how the API is implemented. We demonstrate how this protocol enables multi-agent motion planning for a heterogeneous team of agents completing independent tasks with a variety of single-agent planners including: Heuristic Search (e.g., A*), Sampling Based Search (e.g., RRT), Optimization (e.g., Direct Collocation), Diffusion, and Reinforcement Learning.
cs.MM
[654] SarcasmMiner: A Dual-Track Post-Training Framework for Robust Audio-Visual Sarcasm Reasoning
Zhu Li, Yongjian Chen, Huiyuan Lai, Xiyuan Gao, Shekhar Nayak, Matt Coler
Main category: cs.MM
TL;DR: SarcasmMiner: Reinforcement learning framework for multimodal sarcasm detection that uses structured reasoning and dual-track distillation to resist hallucination in foundation models.
Details
Motivation: Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. Existing foundation models often suffer from hallucination in multimodal reasoning, necessitating a framework that can resist such issues while enabling robust sarcasm detection.Method: Proposes SarcasmMiner, a reinforcement learning based post-training framework that reformulates sarcasm detection as structured reasoning. Uses dual-track distillation: high-quality teacher trajectories initialize the student model, while full trajectories train a generative reward model (GenRM) to evaluate reasoning quality. Optimizes student with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality.
Result: On MUStARD++ dataset, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. Demonstrates that reasoning-aware reward modeling enhances both performance and multimodal grounding.
Conclusion: The proposed reinforcement learning framework effectively addresses hallucination in multimodal reasoning for sarcasm detection, showing that structured reasoning with reasoning-aware reward modeling improves both accuracy and multimodal grounding in foundation models.
Abstract: Multimodal sarcasm detection requires resolving pragmatic incongruity across textual, acoustic, and visual cues through cross-modal reasoning. To enable robust sarcasm reasoning with foundation models, we propose SarcasmMiner, a reinforcement learning based post-training framework that resists hallucination in multimodal reasoning. We reformulate sarcasm detection as structured reasoning and adopt a dual-track distillation strategy: high-quality teacher trajectories initialize the student model, while the full set of trajectories trains a generative reward model (GenRM) to evaluate reasoning quality. The student is optimized with group relative policy optimization (GRPO) using decoupled rewards for accuracy and reasoning quality. On MUStARD++, SarcasmMiner increases F1 from 59.83% (zero-shot), 68.23% (supervised finetuning) to 70.22%. These findings suggest that reasoning-aware reward modeling enhances both performance and multimodal grounding.
eess.AS
[655] Temporal Pooling Strategies for Training-Free Anomalous Sound Detection with Self-Supervised Audio Embeddings
Kevin Wilkinghoff, Sarthak Yadav, Zheng-Hua Tan
Main category: eess.AS
TL;DR: Systematic evaluation of temporal pooling strategies for training-free anomalous sound detection using pre-trained audio embeddings, proposing relative deviation pooling and hybrid methods that outperform mean pooling and achieve state-of-the-art results.
Details
Motivation: Existing training-free anomalous sound detection methods rely almost exclusively on temporal mean pooling for pre-trained audio embeddings, while alternative pooling strategies have only been explored for spectrogram-based representations. The role of temporal pooling in training-free ASD with pre-trained embeddings remains insufficiently understood.Method: Proposes relative deviation pooling (RDP), an adaptive pooling method that emphasizes informative temporal deviations, and introduces a hybrid pooling strategy that combines RDP with generalized mean pooling. Conducts systematic evaluation across multiple state-of-the-art audio embedding models.
Result: Experiments on five benchmark datasets show the proposed methods consistently outperform mean pooling and achieve state-of-the-art performance for training-free ASD. Results surpass all previously reported trained systems and ensembles on the DCASE2025 ASD dataset.
Conclusion: Temporal pooling strategies significantly impact training-free anomalous sound detection performance with pre-trained audio embeddings. The proposed relative deviation pooling and hybrid methods offer superior alternatives to conventional mean pooling.
Abstract: Training-free anomalous sound detection (ASD) based on pre-trained audio embedding models has recently garnered significant attention, as it enables the detection of anomalous sounds using only normal reference data while offering improved robustness under domain shifts. However, existing embedding-based approaches almost exclusively rely on temporal mean pooling, while alternative pooling strategies have so far only been explored for spectrogram-based representations. Consequently, the role of temporal pooling in training-free ASD with pre-trained embeddings remains insufficiently understood. In this paper, we present a systematic evaluation of temporal pooling strategies across multiple state-of-the-art audio embedding models. We propose relative deviation pooling (RDP), an adaptive pooling method that emphasizes informative temporal deviations, and introduce a hybrid pooling strategy that combines RDP with generalized mean pooling. Experiments on five benchmark datasets demonstrate that the proposed methods consistently outperform mean pooling and achieve state-of-the-art performance for training-free ASD, including results that surpass all previously reported trained systems and ensembles on the DCASE2025 ASD dataset.
[656] An Approach to Simultaneous Acquisition of Real-Time MRI Video, EEG, and Surface EMG for Articulatory, Brain, and Muscle Activity During Speech Production
Jihwan Lee, Parsa Razmara, Kevin Huang, Sean Foley, Aditya Kommineni, Haley Hsu, Woojae Jeong, Prakash Kumar, Xuan Shi, Yoonjeong Lee, Tiantian Feng, Takfarinas Medani, Ye Tian, Sudarsana Reddy Kadiri, Krishna S. Nayak, Dani Byrd, Louis Goldstein, Richard M. Leahy, Shrikanth Narayanan
Main category: eess.AS
TL;DR: First simultaneous acquisition of real-time MRI, EEG, and surface EMG for speech production research, with novel artifact suppression pipeline for this tri-modal setting.
Details
Motivation: Speech production involves complex neural planning, motor control, and articulatory processes, but acoustic signals alone don't reveal the underlying neurophysiological substrates. There's a need to capture multiple aspects of the speech production chain simultaneously to understand causal relationships.Method: Developed first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG during speech production. Created an artifact suppression pipeline specifically tailored to handle MRI-induced electromagnetic interference and myogenic artifacts in this tri-modal setting.
Result: Successfully captured brain signals (EEG), muscle activations (EMG), and articulatory movements (MRI) simultaneously during speech production. The artifact suppression pipeline effectively mitigates technical challenges of multimodal acquisition.
Conclusion: This framework provides unprecedented multimodal data for speech neuroscience research and has potential applications in brain-computer interface development by offering comprehensive insights into the speech production chain.
Abstract: Speech production is a complex process spanning neural planning, motor control, muscle activation, and articulatory kinematics. While the acoustic speech signal is the most accessible product of the speech production act, it does not directly reveal its causal neurophysiological substrates. We present the first simultaneous acquisition of real-time (dynamic) MRI, EEG, and surface EMG, capturing several key aspects of the speech production chain: brain signals, muscle activations, and articulatory movements. This multimodal acquisition paradigm presents substantial technical challenges, including MRI-induced electromagnetic interference and myogenic artifacts. To mitigate these, we introduce an artifact suppression pipeline tailored to this tri-modal setting. Once fully developed, this framework is poised to offer an unprecedented window into speech neuroscience and insights leading to brain-computer interface advances.
[657] Voice Timbre Attribute Detection with Compact and Interpretable Training-Free Acoustic Parameters
Aemon Yat Fei Chiu, Yujia Xiao, Qiuqiang Kong, Tan Lee
Main category: eess.AS
TL;DR: A compact acoustic parameter set for voice timbre attribute detection that outperforms conventional features and supervised DNN embeddings while requiring no trainable parameters and offering explicit interpretability.
Details
Motivation: Voice timbre is crucial but complex in speech perception. Current DNN embeddings work well for speaker modeling but are black-box representations with limited interpretability and high computational cost.Method: Investigates a compact acoustic parameter set that captures important acoustic measures and their temporal dynamics for voice timbre attribute detection (vTAD). The set requires no trainable parameters.
Result: The acoustic parameter set is competitive, outperforming conventional cepstral features and supervised DNN embeddings, and approaching state-of-the-art self-supervised models.
Conclusion: The proposed compact acoustic parameter set offers an effective, interpretable, and computationally efficient alternative to DNN-based approaches for voice timbre attribute detection.
Abstract: Voice timbre attribute detection (vTAD) is the task of determining the relative intensity of timbre attributes between speech utterances. Voice timbre is a crucial yet inherently complex component of speech perception. While deep neural network (DNN) embeddings perform well in speaker modelling, they often act as black-box representations with limited physical interpretability and high computational cost. In this work, a compact acoustic parameter set is investigated for vTAD. The set captures important acoustic measures and their temporal dynamics which are found to be crucial in the task. Despite its simplicity, the acoustic parameter set is competitive, outperforming conventional cepstral features and supervised DNN embeddings, and approaching state-of-the-art self-supervised models. Importantly, the studied set require no trainable parameters, incur negligible computation, and offer explicit interpretability for analysing physical traits behind human timbre perception.
[658] PolyBench: A Benchmark for Compositional Reasoning in Polyphonic Audio
Yuanjian Chen, Yang Xiao, Han Yin, Xubo Liu, Jinjie Huang, Ting Dang
Main category: eess.AS
TL;DR: PolyBench is a new benchmark for evaluating compositional reasoning in polyphonic audio, testing LALMs on multiple concurrent sound events and their relations.
Details
Motivation: Current LALMs show limited capability in reasoning over polyphonic audio where multiple sound events co-occur and create compositional structure, and existing benchmarks don't adequately cover this aspect.Method: Introduces PolyBench with five evaluation subsets: counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations.
Result: Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current models.
Conclusion: PolyBench identifies a critical weakness in current LALMs for compositional reasoning in polyphonic audio, providing a benchmark to drive future improvements in audio understanding.
Abstract: Large Audio Language Models (LALMs) are increasingly capable of reasoning over audio. However, existing benchmarks provide limited coverage of reasoning in polyphonic audio, where multiple sound events co-occur and induce compositional structure. In this work, we introduce PolyBench, a benchmark designed to evaluate compositional reasoning in polyphonic audio. PolyBench comprises five evaluation subsets covering counting, classification, detection, concurrency, and duration estimation, requiring reasoning over multiple concurrent events and their relations. Evaluation of state-of-the-art LALMs reveals consistent performance degradation in polyphonic audio, indicating a fundamental bottleneck in current LALMs.
[659] BabAR: from phoneme recognition to developmental measures of young children’s speech production
Marvin Lavechin, Elika Bergelson, Roger Levy
Main category: eess.AS
TL;DR: BabAR is a cross-linguistic phoneme recognition system for child speech trained on TinyVox corpus of 500K+ transcribed child vocalizations across 5 languages, using multilingual pretraining and audio context to improve performance.
Details
Motivation: Automatic phoneme recognition for young children's speech remains largely unsolved, hindering large-scale study of early speech development. Existing tools are inadequate for child speech analysis across multiple languages.Method: Curated TinyVox corpus with 500K+ phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. Trained BabAR system using multilingual pretraining on child-centered daylong recordings and fine-tuning with 20 seconds of surrounding audio context.
Result: Multilingual pretraining substantially outperforms alternatives, and audio context during fine-tuning further improves performance. Error analysis shows substitutions within same phonetic categories, suitable for coarse-grained developmental analysis. Automatic measures align with developmental estimates from literature.
Conclusion: BabAR provides an effective automatic phoneme recognition system for child speech across multiple languages, enabling large-scale study of early speech development with validated developmental alignment.
Abstract: Studying early speech development at scale requires automatic tools, yet automatic phoneme recognition, especially for young children, remains largely unsolved. Building on decades of data collection, we curate TinyVox, a corpus of more than half a million phonetically transcribed child vocalizations in English, French, Portuguese, German, and Spanish. We use TinyVox to train BabAR, a cross-linguistic phoneme recognition system for child speech. We find that pretraining the system on multilingual child-centered daylong recordings substantially outperforms alternatives, and that providing 20 seconds of surrounding audio context during fine-tuning further improves performance. Error analyses show that substitutions predominantly fall within the same broad phonetic categories, suggesting suitability for coarse-grained developmental analyses. We validate BabAR by showing that its automatic measures of speech maturity align with developmental estimates from the literature.
[660] Visual-Informed Speech Enhancement Using Attention-Based Beamforming
Chihyun Liu, Jiaxuan Fan, Mingtung Sun, Michael Anthony, Mingsian R. Bai, Yu Tsao
Main category: eess.AS
TL;DR: VI-NBFNet integrates visual lip movement features from a pretrained visual speech recognition model with microphone array processing for improved speech enhancement in challenging scenarios with moving speakers, overlapping speech, and low SNR conditions.
Details
Motivation: Single-channel speech enhancement methods perform poorly in challenging conditions like low SNR, high reverberation, dynamic speakers, overlapping speech, and non-stationary noise. Existing methods using auxiliary information (speaker voiceprint or visual cues) need improvement for complex real-world scenarios.Method: Proposes Visual-Informed Neural Beamforming Network (VI-NBFNet) that combines microphone array signal processing with deep neural networks using multimodal features. Uses pretrained visual speech recognition model to extract lip movements for voice activity detection and target speaker identification. Features an end-to-end beamforming framework with attention mechanism to handle both static and moving speakers.
Result: The audiovisual system achieved better speech enhancement performance and robustness for both stationary and dynamic speaker scenarios compared to several baseline methods.
Conclusion: Integrating visual information (lip movements) with microphone array processing through neural beamforming significantly improves speech enhancement in challenging real-world conditions with moving speakers and complex acoustic environments.
Abstract: Recent studies have demonstrated that incorporating auxiliary information, such as speaker voiceprint or visual cues, can substantially improve Speech Enhancement (SE) performance. However, single-channel methods often yield suboptimal results in low signal-to-noise ratio (SNR) conditions, when there is high reverberation, or in complex scenarios involving dynamic speakers, overlapping speech, or non-stationary noise. To address these issues, we propose a novel Visual-Informed Neural Beamforming Network (VI-NBFNet), which integrates microphone array signal processing and deep neural networks (DNNs) using multimodal input features. The proposed network leverages a pretrained visual speech recognition model to extract lip movements as input features, which serve for voice activity detection (VAD) and target speaker identification. The system is intended to handle both static and moving speakers by introducing a supervised end-to-end beamforming framework equipped with an attention mechanism. The experimental results demonstrated that the proposed audiovisual system has achieved better SE performance and robustness for both stationary and dynamic speaker scenarios, compared to several baseline methods.
[661] A Large-Scale Probing Analysis of Speaker-Specific Attributes in Self-Supervised Speech Representations
Aemon Yat Fei Chiu, Kei Ching Fung, Roger Tsz Yeung Li, Jingyu Li, Tan Lee
Main category: eess.AS
TL;DR: Large-scale probing analysis of 11 speech SSL models reveals how they encode speaker identity across layers, challenging conventional understanding of layer specialization and showing larger models recover speaker identity in deep layers.
Details
Motivation: To enhance explainability in speech self-supervised learning (SSL) for developing more reliable SSL-based speech processing systems by understanding how these models encode speaker-specific information.Method: Conducted large-scale probing analysis of 11 speech SSL models, decomposing speaker identity into acoustic, prosodic, and paralinguistic attributes across different model layers.
Result: Found a general hierarchy: initial layers encode fundamental acoustics, middle layers synthesize abstract traits. Challenged consensus that final layers purely abstract linguistic content - discovered larger models unexpectedly recover speaker identity in deep layers. Intermediate representations capture dynamic prosody better than specialized speaker embeddings.
Conclusion: The study decodes complex internal mechanics of SSL models, providing guidelines for selecting interpretable and task-optimal representations, with implications for speech processing system design.
Abstract: Enhancing explainability in speech self-supervised learning (SSL) is important for developing reliable SSL-based speech processing systems. This study probes how speech SSL models encode speaker-specific information via a large-scale probing analysis of 11 models, decomposing identity into acoustic, prosodic, and paralinguistic attributes. The results confirm a general hierarchy wherein initial layers encode fundamental acoustics and middle layers synthesise abstract traits. Crucially, the consensus that final layers purely abstract linguistic content is challenged. It is discovered that larger models unexpectedly recover speaker identity in their deep layers. Furthermore, the intermediate representations of speech SSL models are found to capture dynamic prosody better than specialised speaker embeddings. These insights decode the complex internal mechanics of SSL models, providing guidelines for selecting interpretable and task-optimal representations.
[662] BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings
Théo Charlot, Tarek Kunze, Maxime Poli, Alejandrina Cristia, Emmanuel Dupoux, Marvin Lavechin
Main category: eess.AS
TL;DR: BabyHuBERT is a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings that outperforms existing models on voice type classification tasks for analyzing child language development.
Details
Motivation: Existing speech models trained on clean adult data perform poorly on child-centered recordings due to acoustic and linguistic differences, creating a need for specialized models to study early language development.Method: Trained BabyHuBERT on 13,000 hours of multilingual child-centered recordings spanning 40+ languages using self-supervised learning, then evaluated on voice type classification tasks to distinguish target children from various adult and child voices.
Result: BabyHuBERT-VTC achieves F1-scores from 52.1% to 74.4% across six corpora, consistently outperforming W2V2-LL4300 and HuBERT, with notable gains of 13.2 and 15.9 absolute F1 points over HuBERT on Vanuatu and Solomon Islands datasets.
Conclusion: BabyHuBERT effectively addresses the limitations of adult-trained models for child-centered recordings and demonstrates strong performance across diverse linguistic contexts, supporting research in early language development.
Abstract: Child-centered daylong recordings are essential for studying early language development, but existing speech models trained on clean adult data perform poorly due to acoustic and linguistic differences. We introduce BabyHuBERT, a self-supervised speech model trained on 13,000 hours of multilingual child-centered recordings spanning 40+ languages. Evaluated on voice type classification – distinguishing target children from female adults, male adults, and other children, a key preprocessing step for analyzing naturalistic language experiences – BabyHuBERT-VTC achieves F1-scores from 52.1% to 74.4% across six corpora, consistently outperforming W2V2-LL4300 (English daylongs) and HuBERT (clean adult speech). Notable gains include 13.2 and 15.9 absolute F1 points over HuBERT on Vanuatu and Solomon Islands, demonstrating effectiveness on underrepresented languages. We share code and model to support researchers working with child-centered recordings across diverse linguistic contexts.
[663] VoxKnesset: A Large-Scale Longitudinal Hebrew Speech Dataset for Aging Speaker Modeling
Yanir Marmor, Arad Zulti, David Krongauz, Adam Gabet, Yoad Snapir, Yair Lifshitz, Eran Segal
Main category: eess.AS
TL;DR: VoxKnesset: A longitudinal Hebrew parliamentary speech dataset spanning 15 years for studying voice aging effects on speech processing systems.
Details
Motivation: Speech systems struggle with voice changes over time due to aging, but existing datasets lack longitudinal coverage needed to study these effects systematically.Method: Created VoxKnesset dataset with ~2,300 hours of Hebrew parliamentary speech (2009-2025) from 393 speakers, with aligned transcripts and demographic metadata. Benchmarked modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification tasks under longitudinal conditions.
Result: Speaker verification EER increased from 2.15% to 4.58% over 15 years for best model. Cross-sectionally trained age regressors failed to capture within-speaker aging, while longitudinally trained models recovered meaningful temporal aging signals.
Conclusion: Longitudinal datasets are crucial for developing aging-robust speech systems. The publicly released VoxKnesset dataset supports research on voice aging and Hebrew speech processing.
Abstract: Speech processing systems face a fundamental challenge: the human voice changes with age, yet few datasets support rigorous longitudinal evaluation. We introduce VoxKnesset, an open-access dataset of ~2,300 hours of Hebrew parliamentary speech spanning 2009-2025, comprising 393 speakers with recording spans of up to 15 years. Each segment includes aligned transcripts and verified demographic metadata from official parliamentary records. We benchmark modern speech embeddings (WavLM-Large, ECAPA-TDNN, Wav2Vec2-XLSR-1B) on age prediction and speaker verification under longitudinal conditions. Speaker verification EER rises from 2.15% to 4.58% over 15 years for the strongest model, and cross-sectionally trained age regressors fail to capture within-speaker aging, while longitudinally trained models recover a meaningful temporal signal. We publicly release the dataset and pipeline to support aging-robust speech systems and Hebrew speech processing.
[664] Benchmarking Speech Systems for Frontline Health Conversations: The DISPLACE-M Challenge
Dhanya E, Ankita Meena, Manas Nanivadekar, Noumida A, Victor Azad, Ashwini Nagaraj Shenoy, Pratik Roy Chowdhuri, Shobhit Banga, Vanshika Chhabra, Chitralekha Bhat, Shareef babu Kalluri, Srikanth Raj Chetupalli, Deepu Vijayasenan, Sriram Ganapathy
Main category: eess.AS
TL;DR: DISPLACE-M challenge introduces a conversational AI benchmark for medical dialogues with multi-speaker interactions, featuring 55 hours of audio data and baseline systems for 4 tasks: speaker diarization, ASR, topic identification, and dialogue summarization.
Details
Motivation: To create a benchmark for understanding goal-oriented, real-world medical dialogues, addressing challenges of multi-speaker interactions between health workers and care seekers with spontaneous, noisy, and overlapping speech in conversational AI.Method: Released a medical conversational dataset (40h development + 15h blind evaluation), provided baseline systems across 4 tasks, and evaluated using DER, tcpWER, and ROUGE-L metrics for Phase-I evaluation.
Result: Established a benchmark with baseline performance metrics across diarization, speech recognition, topic identification, and summarization tasks for medical conversational AI systems.
Conclusion: DISPLACE-M provides a comprehensive benchmark for conversational AI in medical settings, addressing real-world challenges of noisy, overlapping speech in multi-speaker healthcare dialogues.
Abstract: The DIarization and Speech Processing for LAnguage understanding in Conversational Environments - Medical (DISPLACE-M) challenge introduces a conversational AI benchmark for understanding goal-oriented, real-world medical dialogues. The challenge addresses multi-speaker interactions between frontline health workers and care seekers, characterized by spontaneous, noisy and overlapping speech. As part of the challenge, medical conversational dataset comprising 40 hours of development and 15 hours of blind evaluation recordings was released. We provided baseline systems across 4 tasks - speaker diarization, automatic speech recognition, topic identification and dialogue summarization - to enable consistent benchmarking. System performance is evaluated using diarization error rate (DER), time-constrained minimum-permutation word error rate (tcpWER) and ROUGE-L. This paper describes the Phase-I evaluation - data, tasks and baseline systems - along with the summary of the evaluation results.
[665] The PARLO Dementia Corpus: A German Multi-Center Resource for Alzheimer’s Disease
Franziska Braun, Christopher Witzl, Florian Hönig, Elmar Nöth, Tobias Bocklet, Korbinian Riedhammer
Main category: eess.AS
TL;DR: Introduces PARLO Dementia Corpus (PDC), a German speech dataset for Alzheimer’s detection with audio recordings, transcriptions, and clinical metadata from standardized neuropsychological tasks.
Details
Motivation: Need for accessible, non-invasive Alzheimer's detection methods, especially for non-English languages, as current diagnostics rely on costly/invasive biomarkers and lack public datasets.Method: Collected multi-center German dataset from 9 memory clinics with AD patients and controls, using 8 standardized neuropsychological tasks with audio recordings, manual transcriptions, and clinical metadata.
Result: Created first publicly available German benchmark for neurodegenerative disease research, with baseline experiments showing feasibility of automatic speech-based cognitive assessment and diagnostic value of recall-driven speech.
Conclusion: PDC enables multimodal and cross-lingual research on Alzheimer’s detection through speech analysis, addressing the lack of non-English resources in this domain.
Abstract: Early and accessible detection of Alzheimer’s disease (AD) remains a major challenge, as current diagnostic methods often rely on costly and invasive biomarkers. Speech and language analysis has emerged as a promising non-invasive and scalable approach to detecting cognitive impairment, but research in this area is hindered by the lack of publicly available datasets, especially for languages other than English. This paper introduces the PARLO Dementia Corpus (PDC), a new multi-center, clinically validated German resource for AD collected across nine academic memory clinics in Germany. The dataset comprises speech recordings from individuals with AD-related mild cognitive impairment and mild to moderate dementia, as well as cognitively healthy controls. Speech was elicited using a standardized test battery of eight neuropsychological tasks, including confrontation naming, verbal fluency, word repetition, picture description, story reading, and recall tasks. In addition to audio recordings, the dataset includes manually verified transcriptions and detailed demographic, clinical, and biomarker metadata. Baseline experiments on ASR benchmarking, automated test evaluation, and LLM-based classification illustrate the feasibility of automatic, speech-based cognitive assessment and highlight the diagnostic value of recall-driven speech production. The PDC thus establishes the first publicly available German benchmark for multi-modal and cross-lingual research on neurodegenerative diseases.
eess.IV
[666] CogGen: Cognitive-Load-Informed Fully Unsupervised Deep Generative Modeling for Compressively Sampled MRI Reconstruction
Qingyong Zhu, Yumin Tan, Xiang Gu, Dong Liang
Main category: eess.IV
TL;DR: CogGen is a cognitive-load-informed fully unsupervised deep generative model for compressive sensing MRI that uses staged inversion with progressive scheduling of task difficulty to improve reconstruction quality and convergence.
Details
Motivation: Classical fully unsupervised deep generative models (FU-DGMs) like DIP and INR rely on architectural priors but struggle with ill-conditioned inverse problems in compressive sensing MRI, requiring many iterations and being prone to overfitting measurement noise.Method: CogGen casts CS-MRI as staged inversion and regulates “cognitive load” by progressively scheduling intrinsic difficulty and extraneous interference. It replaces uniform data fitting with an easy-to-hard k-space weighting/selection strategy: early iterations emphasize low-frequency, high-SNR, structure-dominant samples, while higher-frequency or noise-dominated measurements are introduced later via self-paced curriculum learning with student-mode and teacher-mode criteria.
Result: Experiments show that CogGen-DIP and CogGen-INR improve fidelity and convergence over strong unsupervised baselines and competitive supervised pipelines.
Conclusion: Cognitive-load-informed staged inversion with progressive difficulty scheduling is an effective approach for fully unsupervised deep generative modeling in compressive sensing MRI reconstruction.
Abstract: Fully unsupervised deep generative modeling (FU-DGM) is promising for compressively sampled MRI (CS-MRI) when training data or compute are limited. Classical FU-DGMs such as DIP and INR rely on architectural priors, but the ill-conditioned inverse problem often demands many iterations and easily overfits measurement noise. We propose CogGen, a cognitive-load-informed FU-DGM that casts CS-MRI as staged inversion and regulates task-side “cognitive load” by progressively scheduling intrinsic difficulty and extraneous interference. CogGen replaces uniform data fitting with an easy-to-hard k-space weighting/selection strategy: early iterations emphasize low-frequency, high-SNR, structure-dominant samples, while higher-frequency or noise-dominated measurements are introduced later. We realize this schedule via self-paced curriculum learning with complementary student-mode (what the model can currently learn) and teacher-mode (what it should follow) criteria, supporting both soft weighting and hard selection. Experiments and analysis show that CogGen-DIP and CogGen-INR improve fidelity and convergence over strong unsupervised baselines and competitive supervised pipelines.
[667] HoloPASWIN: Robust Inline Holographic Reconstruction via Physics-Aware Swin Transformers
Gökhan Koçmarlı, G. Bora Esmer
Main category: eess.IV
TL;DR: HoloPASWIN: A physics-aware Swin Transformer framework for twin-image suppression in digital in-line holography using hierarchical attention and differentiable angular spectrum propagation.
Details
Motivation: Digital in-line holography suffers from twin-image artifacts that degrade reconstruction quality. Traditional CNNs have limited receptive fields for capturing global diffraction patterns, necessitating a more effective deep learning approach for holographic reconstruction.Method: Proposes HoloPASWIN based on Swin Transformer architecture with hierarchical shifted-window attention to capture both local details and long-range dependencies. Uses comprehensive loss function integrating frequency-domain constraints with physical consistency via differentiable angular spectrum propagator.
Result: Validated on large-scale synthetic dataset of 25,000 samples with diverse noise configurations. Demonstrates effective twin-image suppression and robust reconstruction quality.
Conclusion: HoloPASWIN provides an effective physics-aware deep learning solution for holographic reconstruction, addressing twin-image artifacts through transformer architecture and physical constraints.
Abstract: In-line digital holography (DIH) is a widely used lensless imaging technique, valued for its simplicity and capability to image samples at high throughput. However, capturing only intensity of the interference pattern during the recording process gives rise to some unwanted terms such as cross-term and twin-image. The cross-term can be suppressed by adjusting the intensity of reference wave, but the twin-image problem remains. The twin-image is a spectral artifact that superimposes a defocused conjugate wave onto the reconstructed object, severely degrading image quality. While deep learning has recently emerged as a powerful tool for phase retrieval, traditional Convolutional Neural Networks (CNNs) are limited by their local receptive fields, making them less effective at capturing the global diffraction patterns inherent in holography. In this study, we introduce HoloPASWIN, a physics-aware deep learning framework based on the Swin Transformer architecture. By leveraging hierarchical shifted-window attention, our model efficiently captures both local details and long-range dependencies essential for accurate holographic reconstruction. We propose a comprehensive loss function that integrates frequency-domain constraints with physical consistency via a differentiable angular spectrum propagator, ensuring high spectral fidelity. Validated on a large-scale synthetic dataset of 25,000 samples with diverse noise configurations (speckle, shot, read, and dark noise), HoloPASWIN demonstrates effective twin-image suppression and robust reconstruction quality.
[668] Anti-Aliasing Snapshot HDR Imaging Using Non-Regular Sensing
Teresa Stürzenhofäcker, Moritz Klimm, Jürgen Seiler, André Kaup
Main category: eess.IV
TL;DR: A snapshot HDR imaging sensor using spatially varying apertures with two differently sized prototype pixels arranged non-regularly to extend dynamic range while avoiding aliasing artifacts.
Details
Motivation: Snapshot HDR imaging is needed for capturing full dynamic range in single exposures, especially for video and dynamic environments where multi-exposure techniques or complex hardware setups are impractical due to motion.Method: Uses a sensor with spatially varying apertures combining two differently sized prototype pixels. A non-regular pixel arrangement mitigates aliasing and overcomes resolution loss from larger pixels. Reconstruction in Fourier domain leverages sparse representation of natural images to recover high-detail images.
Result: Simulation and analysis show the proposed non-regular HDR sensor layout effectively acquires images with high dynamic range while being free from aliasing artifacts.
Conclusion: The snapshot HDR sensor with non-regular pixel arrangement and Fourier domain reconstruction is an effective approach for high dynamic range imaging in single exposures without aliasing issues.
Abstract: Snapshot HDR imaging is essential to capture the full dynamic range of a scene in a single exposure, making it essential for video and dynamic environments where motion prevents the use of multi-exposure techniques or complex hardware set-ups. This work presents a snapshot HDR imaging sensor that is based on spatially varying apertures, implemented by combining two differently sized prototype pixels. The different light integration areas physically extend the dynamic range towards the lower end, compared to a standard high resolution sensor. A non-regular pixel arrangement is suggested, to mitigate aliasing and overcome a loss in spatial resolution that is associated with increased light integration area of the larger prototype pixel. Subsequent reconstruction in the Fourier domain, where natural images can be sparsely represented allows to recover the image with high detail. The image acquisition approach with the proposed non-regular HDR sensor is simulated and analysed with special emphasis on the spatial resolution. The results suggest the snapshot HDR sensor layout to be an effective way to acquire images with high dynamic range and free from aliasing artefacts.
[669] Limited-Angle CT Reconstruction Using Multi-Volume Latent Consistency Model
Hinako Isogai, Naruki Murahashi, Mitsuhiro Nakamura, Megumi Nakao
Main category: eess.IV
TL;DR: A multi-volume latent diffusion model for limited-angle CT reconstruction that uses 3D latent representations from multiple FOVs as guidance, achieving high-precision organ structure preservation under diverse clinical imaging conditions.
Details
Motivation: Limited-angle CT reconstruction is severely ill-posed due to missing projection angles, requiring prior knowledge for high-precision restoration. Existing diffusion models struggle with accurate 3D organ/vessel structure restoration and contrast preservation, and haven't sufficiently addressed diverse clinical imaging conditions like FOV and projection angle variations.Method: Proposes a multi-volume latent diffusion model using 3D latent representations from multiple effective fields of view as guidance. Introduces consistency models into latent space for fast, stable inference. Uses a Multi-volume encoder to acquire latent variables from different scales (global region and central region) to preserve organ boundaries and internal structures under different FOV conditions.
Result: Achieved high-precision synthetic CT generation: under 60° limited-angle condition, MAE of 10.12 HU and SSIM of 0.9677; under extreme 30° condition, MAE of 16.69 HU and SSIM of 0.9393. Demonstrated stable reconstruction even for unknown projection angle conditions not seen during training.
Conclusion: The proposed method effectively addresses diverse clinical imaging conditions and achieves high-precision CT reconstruction under limited-angle scenarios, confirming applicability to practical clinical settings with varying FOV and projection angle conditions.
Abstract: Limited-angle computed tomography (LACT) reconstruction is an inverse problem with severe ill-posedness arising from missing projection angles, and it is difficult to restore high-precision images without sufficient prior knowledge. In recent years, machine learning methods represented by diffusion models have demonstrated high image generation capabilities. However, accurate restoration of three-dimensional structures of organs and vessels and preservation of contrast remain challenges, and the impact of differences in diverse clinical imaging conditions such as field of view (FOV) and projection angle range on reconstruction accuracy has not been sufficiently investigated. In this study, we propose a multi-volume latent diffusion model that uses three-dimensional latent representations obtained from multiple effective fields of view as guidance for LACT reconstruction in clinical practical problems. The proposed method achieves fast and stable inference by introducing consistency models into latent space, and enables high-precision preservation of organ boundary information and internal structures under different FOV conditions through a Multi-volume encoder that acquires latent variables from different scales of the global region and central region. The evaluation experiments demonstrated that the proposed method achieved high-precision synthetic CT image generation compared to existing methods. Under the limited-angle condition of 60 degrees, MAE of 10.12 HU and SSIM of 0.9677 were achieved, and under the extreme limited-angle condition of 30 degrees, MAE of 16.69 HU and SSIM of 0.9393 were achieved. Furthermore, stable reconstruction performance was demonstrated even for unknown projection angle conditions not included during training, confirming the applicability to diverse imaging conditions in clinical practice.
[670] Adaptive Sampling for Storage of Progressive Images on DNA
Xavier Pic, Nimesh Pinnamaneni, Raja Appuswamy
Main category: eess.IV
TL;DR: DNA-based image storage system using JPEG2000 progressive decoding with adaptive nanopore sequencing for resolution-based random access
Details
Motivation: Address limitations of DNA data storage including high read costs and lack of efficient random access to specific files in mixed oligo poolsMethod: Encode images into DNA using JPEG DNA VM codec with progressive resolution layers, then use nanopore adaptive sampling to selectively sequence only oligos needed for desired resolution
Result: Reduces read costs by enabling retrieval of resolution-reduced image versions without sequencing entire oligo pool, providing PCR-free random access solution
Conclusion: Progressive encoding combined with nanopore adaptive sampling enables efficient DNA-based image storage with reduced read costs and practical random access
Abstract: The short lifespan of traditional data storage media, coupled with an exponential increase in storage demand, has made long-term archival a fundamental problem in the data storage industry and beyond. Consequently, researchers are looking for innovative media solutions that can store data over long time periods at a very low cost. DNA molecules, with their high density, long lifespan, and low energy needs, have emerged as a viable alternative to digital data archival. However, current DNA data storage technologies are facing challenges with respect to cost and reliability. Thus, coding rate and error robustness are critical to scale DNA storage and make it technologically and economically achievable. Moreover, the molecules of DNA that encode different files are often located in the same oligo pool. Without random access solutions at the oligo level, it is very impractical to decode a specific file from these mixed pools, as all oligos need to first be sequenced and decoded before a target file can be retrieved, which greatly deteriorates the read cost. This paper introduces a solution to efficiently encode and store images into DNA molecules, that aims at reducing the read cost necessary to retrieve a resolution-reduced version of an image. This image storage system is based on the Progressive Decoding Functionality of the JPEG2000 codec but can be adapted to any conventional progressive codec. Each resolution layer is encoded into a set of oligos using the JPEG DNA VM codec, a DNA-based coder that aims at retrieving a file with a high reliability. Depending on the desired resolution to be read, the set of oligos as well as the portion of the oligos to be sequenced and decoded are adjusted accordingly. These oligos will be selected at sequencing time, with the help of the adaptive sampling method provided by the Nanopore sequencers, making it a PCR-free random access solution.
[671] ICHOR: A Robust Representation Learning Approach for ASL CBF Maps with Self-Supervised Masked Autoencoders
Xavier Beltran-Urbano, Yiran Li, Xinglin Zeng, Katie R. Jobson, Manuel Taso, Christopher A. Brown, David A. Wolk, Corey T. McMillan, Ilya M. Nashrallah, Paul A. Yushkevich, Ze Wang, John A. Detre, Sudipto Dolui
Main category: eess.IV
TL;DR: ICHOR is a self-supervised pre-training approach using 3D masked autoencoders with Vision Transformers for arterial spin labeling (ASL) perfusion MRI, trained on 11,405 ASL CBF scans to learn transferable representations for downstream diagnostic and quality prediction tasks.
Details
Motivation: ASL perfusion MRI enables noninvasive cerebral blood flow quantification but faces challenges including variable image quality, inter-site/vendor/protocol differences, and limited labeled datasets for training generalizable deep learning models.Method: Developed ICHOR, a self-supervised pre-training approach using 3D masked autoencoders with Vision Transformer backbone. Pre-trained on 11,405 ASL CBF scans from 14 studies across multiple sites and protocols using masked image modeling.
Result: ICHOR outperformed existing neuroimaging self-supervised pre-training methods adapted to ASL across three diagnostic classification tasks and one ASL CBF map quality prediction regression task.
Conclusion: ICHOR provides an effective self-supervised pre-training approach for ASL CBF maps that learns transferable representations and outperforms existing methods, with pre-trained weights and code to be made publicly available.
Abstract: Arterial spin labeling (ASL) perfusion MRI allows direct quantification of regional cerebral blood flow (CBF) without exogenous contrast, enabling noninvasive measurements that can be repeated without constraints imposed by contrast injection. ASL is increasingly acquired in research studies and clinical MRI protocols. Building on successes in structural imaging, recent efforts have implemented deep learning based methods to improve image quality, enable automated quality control, and derive robust quantitative and predictive biomarkers with ASL derived CBF. However, progress has been limited by variable image quality, substantial inter-site, vendor and protocol differences, and limited availability of labeled datasets needed to train models that generalize across cohorts. To address these challenges, we introduce ICHOR, a self supervised pre-training approach for ASL CBF maps that learns transferable representations using 3D masked autoencoders. ICHOR is pretrained via masked image modeling using a Vision Transformer backbone and can be used as a general-purpose encoder for downstream ASL tasks. For pre-training, we curated one of the largest ASL datasets to date, comprising 11,405 ASL CBF scans from 14 studies spanning multiple sites and acquisition protocols. We evaluated the pre-trained ICHOR encoder on three downstream diagnostic classification tasks and one ASL CBF map quality prediction regression task. Across all evaluations, ICHOR outperformed existing neuroimaging self-supervised pre-training methods adapted to ASL. Pre-trained weights and code will be made publicly available.
[672] MedFuncta: A Unified Framework for Learning Efficient Medical Neural Fields
Paul Friedrich, Florentin Bieder, Julian McGinnis, Julia Wolleb, Daniel Rueckert, Philippe C. Cattin
Main category: eess.IV
TL;DR: MedFuncta: A unified framework for large-scale neural field training on diverse medical datasets using meta-learned continuous function representations with improved SIREN activations and scalable training.
Details
Motivation: Current medical imaging research uses discrete data representations that scale poorly with resolution and fail to capture continuous signals. While single-instance neural fields work in medical contexts, scaling them to large datasets remains challenging.Method: Introduces MedFuncta framework that encodes medical data into 1D latent vectors modulating a shared meta-learned neural field. Improves SIREN activations with non-constant frequency parameter ω, connects ω-schedule to layer-wise learning rates, and uses scalable meta-learning with sparse supervision to reduce memory/computation.
Result: Evaluated across diverse medical datasets, shows how to solve downstream tasks on neural data representation. Releases code, model weights, and MedNF dataset containing >500k latent vectors for multi-instance medical neural fields.
Conclusion: MedFuncta provides a unified framework for large-scale neural field training on medical data, addressing scalability challenges while enabling continuous signal representation and downstream task applications.
Abstract: Research in medical imaging primarily focuses on discrete data representations that poorly scale with grid resolution and fail to capture the often continuous nature of the underlying signal. Neural Fields (NFs) offer a powerful alternative by modeling data as continuous functions. While single-instance NFs have successfully been applied in medical contexts, extending them to large-scale medical datasets remains an open challenge. We therefore introduce MedFuncta, a unified framework for large-scale NF training on diverse medical signals. Building on Functa, our approach encodes data into a unified representation, namely a 1D latent vector, that modulates a shared, meta-learned NF, enabling generalization across a dataset. We revisit common design choices, introducing a non-constant frequency parameter $ω$ in widely used SIREN activations, and establish a connection between this $ω$-schedule and layer-wise learning rates, relating our findings to recent work in theoretical learning dynamics. We additionally introduce a scalable meta-learning strategy for shared network learning that employs sparse supervision during training, thereby reducing memory consumption and computational overhead while maintaining competitive performance. Finally, we evaluate MedFuncta across a diverse range of medical datasets and show how to solve relevant downstream tasks on our neural data representation. To promote further research in this direction, we release our code, model weights and the first large-scale dataset - MedNF - containing > 500 k latent vectors for multi-instance medical NFs.
[673] Adiabatic Capacitive Neuron: An Energy-Efficient Functional Unit for Artificial Neural Networks
Sachin Maheshwari, Mike Smart, Himadri Singh Raghav, Themis Prodromakis, Alexander Serb
Main category: eess.IV
TL;DR: A hardware implementation of an Adiabatic Capacitive Neuron (ACN) with improved energy efficiency, accuracy, and robustness over previous designs, featuring 12-bit precision and novel threshold logic for activation functions.
Details
Motivation: To develop a more energy-efficient and robust hardware implementation of artificial neurons for neural network applications, addressing limitations in previous capacitive neuron designs regarding energy consumption, accuracy, and scalability.Method: Implemented a 12-bit single neuron with positive/negative weight support in 0.18μm CMOS technology, featuring a new Threshold Logic design for binary activation functions with low symmetrical offset across process corners and temperature variations.
Result: Achieved >90% energy savings (over 12x improvement) compared to non-adiabatic CMOS Capacitive Neuron benchmark, with maximum offset voltage of 9mV vs 27mV/5mV in conventional designs, and consistent energy savings across supply voltage scaling.
Conclusion: The proposed ACN demonstrates significant improvements in energy efficiency, accuracy, and robustness for hardware neural implementations, making it suitable for energy-constrained neural network applications.
Abstract: This paper introduces a new, highly energy-efficient, Adiabatic Capacitive Neuron (ACN) hardware implementation of an Artificial Neuron (AN) with improved functionality, accuracy, robustness and scalability over previous work. The paper describes the implementation of a \mbox{12-bit} single neuron, with positive and negative weight support, in an $\mathbf{0.18μm}$ CMOS technology. The paper also presents a new Threshold Logic (TL) design for a binary AN activation function that generates a low symmetrical offset across three process corners and five temperatures between $-55^o$C and $125^o$C. Post-layout simulations demonstrate a maximum rising and falling offset voltage of 9$mV$ compared to conventional TL, which has rising and falling offset voltages of 27$mV$ and 5$mV$ respectively, across temperature and process. Moreover, the proposed TL design shows a decrease in average energy of 1.5$%$ at the SS corner and 2.3$%$ at FF corner compared to the conventional TL design. The total synapse energy saving for the proposed ACN was above 90$%$ (over 12x improvement) when compared to a non-adiabatic CMOS Capacitive Neuron (CCN) benchmark for a frequency ranging from 500$kHz$ to 100$MHz$. A 1000-sample Monte Carlo simulation including process variation and mismatch confirms the worst-case energy savings of $>$90$%$ compared to CCN in the synapse energy profile. Finally, the impact of supply voltage scaling shows consistent energy savings of above 90$%$ (except all zero inputs) without loss of functionality.
[674] Graph-Based Multi-Modal Light-weight Network for Adaptive Brain Tumor Segmentation
Guohao Huo, Ruiting Dai, Zitong Wang, Junxin Kong, Hao Tang
Main category: eess.IV
TL;DR: GMLN-BTS: A lightweight graph-based network for brain tumor segmentation that achieves high precision with only 4.58M parameters through modality-aware encoding, graph-based cross-modal interaction, and voxel refinement.
Details
Motivation: Multi-modal brain tumor segmentation models are computationally expensive for practical deployment, creating a need for lightweight yet accurate solutions that can handle multi-modal medical imaging data efficiently.Method: Three key components: 1) Modality-Aware Adaptive Encoder (M2AE) for efficient multi-scale semantic extraction, 2) Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) using graph structures to model complementary cross-modal relationships, 3) Voxel Refinement UpSampling Module (VRUM) combining linear interpolation with multi-scale transposed convolutions to suppress artifacts and preserve boundaries.
Result: Achieves state-of-the-art performance on BraTS 2017, 2019, and 2021 benchmarks among lightweight models. With only 4.58M parameters, reduces parameter count by 98% compared to mainstream 3D Transformers while outperforming existing compact approaches.
Conclusion: GMLN-BTS provides an effective lightweight solution for multi-modal brain tumor segmentation, balancing computational efficiency with high precision through innovative graph-based cross-modal interaction and efficient architectural design.
Abstract: Multi-modal brain tumor segmentation remains challenging for practical deployment due to the high computational costs of mainstream models. In this work, we propose GMLN-BTS, a Graph-based Multi-modal interaction Lightweight Network for brain tumor segmentation. Our architecture achieves high-precision, resource-efficient segmentation through three key components. First, a Modality-Aware Adaptive Encoder (M2AE) facilitates efficient multi-scale semantic extraction. Second, a Graph-based Multi-Modal Collaborative Interaction Module (G2MCIM) leverages graph structures to model complementary cross-modal relationships. Finally, a Voxel Refinement UpSampling Module (VRUM) integrates linear interpolation with multi-scale transposed convolutions to suppress artifacts and preserve boundary details. Experimental results on BraTS 2017, 2019, and 2021 benchmarks demonstrate that GMLN-BTS achieves state-of-the-art performance among lightweight models. With only 4.58M parameters, our method reduces parameter count by 98% compared to mainstream 3D Transformers while significantly outperforming existing compact approaches.
[675] Learning to Select Like Humans: Explainable Active Learning for Medical Imaging
Ifrat Ikhtear Uddin, Longwei Wang, Xiao Qin, Yang Zhou, KC Santosh
Main category: eess.IV
TL;DR: Explainability-guided active learning framework for medical imaging that combines classification uncertainty with attention misalignment to select samples that improve both performance and clinical interpretability.
Details
Motivation: Medical image analysis requires expensive expert annotation. Traditional active learning methods focus only on predictive uncertainty, ignoring whether models learn clinically meaningful features needed for clinical deployment.Method: Proposes a dual-criterion selection strategy: (1) classification uncertainty to identify informative examples, and (2) attention misalignment between Grad-CAM attention maps and radiologist-defined ROIs using Dice similarity. The framework integrates spatial attention alignment into sample acquisition.
Result: Evaluated on three medical imaging datasets (BraTS, VinDr-CXR, SIIM-COVID-19). Using only 570 strategically selected samples, outperformed random sampling across all datasets: 77.22% accuracy on BraTS, 52.37% on VinDr-CXR, and 52.66% on SIIM-COVID. Grad-CAM visualizations confirmed models focus on diagnostically relevant regions.
Conclusion: Incorporating explanation guidance into active learning sample acquisition yields superior data efficiency while maintaining clinical interpretability, addressing the critical requirement for models to learn clinically meaningful features.
Abstract: Medical image analysis requires substantial labeled data for model training, yet expert annotation is expensive and time-consuming. Active learning (AL) addresses this challenge by strategically selecting the most informative samples for the annotation purpose, but traditional methods solely rely on predictive uncertainty while ignoring whether models learn from clinically meaningful features a critical requirement for clinical deployment. We propose an explainability-guided active learning framework that integrates spatial attention alignment into a sample acquisition process. Our approach advocates for a dual-criterion selection strategy combining: (i) classification uncertainty to identify informative examples, and (ii) attention misalignment with radiologist-defined regions-of-interest (ROIs) to target samples where the model focuses on incorrect features. By measuring misalignment between Grad-CAM attention maps and expert annotations using Dice similarity, our acquisition function judiciously identifies samples that enhance both predictive performance and spatial interpretability. We evaluate the framework using three expert-annotated medical imaging datasets, namely, BraTS (MRI brain tumors), VinDr-CXR (chest X-rays), and SIIM-COVID-19 (chest X-rays). Using only 570 strategically selected samples, our explainability-guided approach consistently outperforms random sampling across all the datasets, achieving 77.22% accuracy on BraTS, 52.37% on VinDr-CXR, and 52.66% on SIIM-COVID. Grad-CAM visualizations confirm that the models trained by our dual-criterion selection focus on diagnostically relevant regions, demonstrating that incorporating explanation guidance into sample acquisition yields superior data efficiency while maintaining clinical interpretability.