Editor’s Picks
Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.
[1] Woosh: A Sound Effects Foundation Model
Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji
Main category: cs.SD
TL;DR: Woosh is Sony AI’s open-source sound effect foundation model with audio encoder/decoder, text-audio alignment, text-to-audio, and video-to-audio generation capabilities, offering competitive performance against existing open models.
Details
Motivation: The audio research community needs open generative models as foundational tools for building novel approaches and establishing baselines, particularly for sound effects.Method: Developed a comprehensive sound effect foundation model with four key components: (1) high-quality audio encoder/decoder model, (2) text-audio alignment model for conditioning, (3) text-to-audio generative model, and (4) video-to-audio generative model, plus distilled versions for low-resource operation.
Result: Evaluation on both public and private data shows competitive or better performance compared to existing open alternatives like StableAudio-Open and TangoFlux across all modules.
Conclusion: Woosh provides a comprehensive open-source foundation model for sound effects that advances the field by offering multiple generative capabilities with competitive performance and practical deployment options.
Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI’s publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
Relevance: 9/10
[2] HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding
Yueqian Lin, Jingyang Zhang, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Yudong Liu, Hai “Helen” Li, Yiran Chen
Main category: cs.MM
TL;DR: HippoMM is a computational cognitive architecture inspired by hippocampal mechanisms for multimodal episodic memory, achieving state-of-the-art performance on audiovisual understanding tasks while being more efficient than retrieval-augmented baselines.
Details
Motivation: Current computational systems struggle with comprehending extended audiovisual experiences, particularly in temporal integration and cross-modal associations that are fundamental to human episodic memory. The authors aim to address these challenges by drawing inspiration from hippocampal mechanisms rather than relying on scaling or architectural sophistication.Method: HippoMM implements three integrated components: 1) Episodic Segmentation that detects audiovisual input changes to split videos into discrete episodes (mirroring dentate gyrus pattern separation), 2) Memory Consolidation that compresses episodes into summaries with key features preserved (analogous to hippocampal memory formation), and 3) Hierarchical Memory Retrieval that first searches semantic summaries, then escalates via temporal window expansion around seed segments for cross-modal queries (mimicking CA3 pattern completion).
Result: On the HippoVlog benchmark testing associative memory, HippoMM achieves state-of-the-art 78.2% accuracy while operating 5x faster than retrieval-augmented baselines. The system demonstrates that cognitive architectures provide blueprints for next-generation multimodal understanding.
Conclusion: The paper shows that biologically-inspired cognitive architectures can effectively address challenges in multimodal episodic memory and audiovisual understanding, offering an alternative approach to scaling-based methods. The integrated hippocampal-inspired components work synergistically to create a system that exceeds the sum of its parts.
Abstract: Comprehending extended audiovisual experiences remains challenging for computational systems, particularly temporal integration and cross-modal associations fundamental to human episodic memory. We introduce HippoMM, a computational cognitive architecture that maps hippocampal mechanisms to solve these challenges. Rather than relying on scaling or architectural sophistication, HippoMM implements three integrated components: (i) Episodic Segmentation detects audiovisual input changes to split videos into discrete episodes, mirroring dentate gyrus pattern separation; (ii) Memory Consolidation compresses episodes into summaries with key features preserved, analogous to hippocampal memory formation; and (iii) Hierarchical Memory Retrieval first searches semantic summaries, then escalates via temporal window expansion around seed segments for cross-modal queries, mimicking CA3 pattern completion. These components jointly create an integrated system exceeding the sum of its parts. On our HippoVlog benchmark testing associative memory, HippoMM achieves state-of-the-art 78.2% accuracy while operating 5x faster than retrieval-augmented baselines. Our results demonstrate that cognitive architectures provide blueprints for next-generation multimodal understanding. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.
Relevance: 9/10
[3] Reinforcing Consistency in Video MLLMs with Structured Rewards
Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang
Main category: cs.CV
TL;DR: The paper proposes a compositional consistency audit and structured reward system to improve visual and temporal grounding in multimodal LLMs for video understanding, addressing hallucination issues through factual and temporal unit-based rewards.
Details
Motivation: Current MLLMs for video understanding often produce plausible outputs with poor visual and temporal grounding, fabricating objects, assigning incorrect attributes, or collapsing repeated events. Standard sentence-level supervision is a weak proxy for faithful video understanding, and RL with sentence-level rewards is too coarse to localize specific grounding failures.Method: 1) Compositional consistency audit that decomposes captions into factual and temporal claims to assess lower-level evidence; 2) Structured reward system with three components: instance-aware scene-graph reward for objects/attributes/relations, temporal reward for event ordering/repetition, and video-grounded VQA reward for hierarchical self-verification.
Result: The structured reward approach yields consistent gains across temporal, general video understanding, and hallucination-oriented benchmarks on open-source backbones, demonstrating improved grounding and reduced hallucinations.
Conclusion: Structured reward shaping is a practical route to more faithful video understanding in MLLMs, addressing fundamental grounding issues that standard sentence-level supervision fails to capture.
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.
Relevance: 9/10
Today’s Research Highlights
AI-enhanced summaries of the latest research papers from arXiv.
Table of Contents
- cs.CL [Total: 100]
- cs.CV [Total: 212]
- cs.AI [Total: 159]
- cs.SD [Total: 6]
- cs.LG [Total: 151]
- cs.MA [Total: 5]
- cs.MM [Total: 4]
- eess.AS [Total: 8]
- eess.IV [Total: 6]
cs.CL
[1] The Overlooked Repetitive Lengthening Form in Sentiment Analysis
Lei Wang, Eduard Dragut
Main category: cs.CL
TL;DR: Paper introduces Lengthening dataset for Repetitive Lengthening Form (RLF) in informal online communication and proposes ExpInstruct framework to improve LLM understanding of RLF for sentiment analysis.
Details
Motivation: While LMs with informal communications have been studied, the unique Repetitive Lengthening Form (RLF) style (e.g., "soooo gooood") has been overlooked for years. The paper aims to determine if RLF is important for sentiment analysis and whether LMs can understand it.Method: 1) Curated Lengthening dataset with 850k samples focused on RLF for sentiment analysis; 2) Proposed ExpInstruct, a two-stage instruction tuning framework to improve LLM performance and explainability for RLF; 3) Developed novel unified approach to quantify LMs’ understanding of informal expressions.
Result: RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Fine-tuned PLMs surpass zero-shot GPT-4 in performance but not in explanation for RLF. ExpInstruct improves open-sourced LLMs to match zero-shot GPT-4 in both performance and explainability for RLF with limited samples.
Conclusion: RLF has potential value for online content analysis and sentiment analysis. The proposed ExpInstruct framework effectively enhances LLM understanding of this overlooked informal communication style, bridging the gap between open-source models and GPT-4.
Abstract: Individuals engaging in online communication frequently express personal opinions with informal styles (e.g., memes and emojis). While Language Models (LMs) with informal communications have been widely discussed, a unique and emphatic style, the Repetitive Lengthening Form (RLF), has been overlooked for years. In this paper, we explore answers to two research questions: 1) Is RLF important for sentiment analysis (SA)? 2) Can LMs understand RLF? Inspired by previous linguistic research, we curate \textbf{Lengthening}, the first multi-domain dataset with 850k samples focused on RLF for SA. Moreover, we introduce \textbf{Exp}lainable \textbf{Instruct}ion Tuning (\textbf{ExpInstruct}), a two-stage instruction tuning framework aimed to improve both performance and explainability of LLMs for RLF. We further propose a novel unified approach to quantify LMs’ understanding of informal expressions. We show that RLF sentences are expressive expressions and can serve as signatures of document-level sentiment. Additionally, RLF has potential value for online content analysis. Our results show that fine-tuned Pre-trained Language Models (PLMs) can surpass zero-shot GPT-4 in performance but not in explanation for RLF. Finally, we show ExpInstruct can improve the open-sourced LLMs to match zero-shot GPT-4 in performance and explainability for RLF with limited samples. Code and sample data are available at https://github.com/Tom-Owl/OverlookedRLF
[2] Scaling Reasoning Tokens via RL and Parallel Thinking: Evidence From Competitive Programming
Qianfan Zhang, Tianyu Guo, Xuandi Ren, Jiale Chen, Ming Ding, Ran Xin, Xia Xiao
Main category: cs.CL
TL;DR: Training RL and test-time parallel thinking to scale reasoning tokens for competitive programming, achieving high accuracy with massive token budgets through multi-round parallel verification.
Details
Motivation: Address the challenge of scaling reasoning token budgets for complex competitive programming problems, where traditional single-generation approaches become computationally expensive and limited in reasoning depth.Method: Two complementary approaches: 1) Training-time RL with verification warmup and randomized clipping to improve token efficiency, 2) Test-time multi-round parallel thinking pipeline distributing token budget across threads and rounds of generation, verification, and refinement.
Result: System achieves oracle pass@16 performance at pass@1 using 7.6M tokens per problem, surpassing GPT-5-high on 456 hard competitive programming problems from AetherCode benchmark.
Conclusion: Combining RL training optimizations with structured parallel thinking pipelines enables effective scaling of reasoning tokens for complex problem-solving, demonstrating significant performance gains on competitive programming tasks.
Abstract: We study how to scale reasoning token budgets for competitive programming through two complementary approaches: training-time reinforcement learning (RL) and test-time parallel thinking. During RL training, we observe an approximately log-linear relationship between validation accuracy and the average number of generated reasoning tokens over successive checkpoints, and show two ways to shift this training trajectory: verification RL warmup raises the starting point, while randomized clipping produces a steeper trend in the observed regime. As scaling single-generation reasoning during RL quickly becomes expensive under full attention, we introduce a multi-round parallel thinking pipeline that distributes the token budget across threads and rounds of generation, verification, and refinement. We train the model end-to-end on this pipeline to match the training objective to the test-time structure. Starting from Seed-OSS-36B, the full system with 16 threads and 16 rounds per thread matches the underlying RL model’s oracle pass@16 at pass@1 using 7.6 million tokens per problem on average, and surpasses GPT-5-high on 456 hard competitive programming problems from AetherCode.
[3] M2-Verify: A Large-Scale Multidomain Benchmark for Checking Multimodal Claim Consistency
Abolfazl Ansari, Delvin Ce Zhang, Zhuoyang Zou, Wenpeng Yin, Dongwon Lee
Main category: cs.CL
TL;DR: M2-Verify: A large-scale multimodal dataset for evaluating scientific claim consistency across 16 domains with 469K instances, revealing model struggles with complex visual reasoning and hallucinations.
Details
Motivation: Existing benchmarks lack the scale, domain diversity, and visual complexity needed to realistically evaluate strict consistency between scientific claims and multimodal evidence, creating a gap in assessing multimodal reasoning capabilities.Method: Created M2-Verify dataset sourced from PubMed and arXiv with over 469K instances across 16 domains, rigorously validated through expert audits. Conducted extensive baseline experiments with state-of-the-art models to evaluate claim-evidence consistency.
Result: State-of-the-art models struggle with robust consistency: top models achieve 85.8% Micro-F1 on low-complexity medical perturbations but drop to 61.6% on high-complexity challenges like anatomical shifts. Expert evaluations reveal hallucinations when models generate scientific explanations.
Conclusion: M2-Verify addresses the need for realistic multimodal consistency evaluation, demonstrates current model limitations in complex visual reasoning, and provides a valuable resource with comprehensive usage guidelines for advancing multimodal scientific reasoning.
Abstract: Evaluating scientific arguments requires assessing the strict consistency between a claim and its underlying multimodal evidence. However, existing benchmarks lack the scale, domain diversity, and visual complexity needed to evaluate this alignment realistically. To address this gap, we introduce M2-Verify, a large-scale multimodal dataset for checking scientific claim consistency. Sourced from PubMed and arXiv, M2-Verify provides over 469K instances across 16 domains, rigorously validated through expert audits. Extensive baseline experiments show that state-of-the-art models struggle to maintain robust consistency. While top models achieve up to 85.8% Micro-F1 on low-complexity medical perturbations, performance drops to 61.6% on high-complexity challenges like anatomical shifts. Furthermore, expert evaluations expose hallucinations when models generate scientific explanations for their alignment decisions. Finally, we demonstrate our dataset’s utility and provide comprehensive usage guidelines.
[4] Tracking the emergence of linguistic structure in self-supervised models learning from speech
Marianne de Heer Kloots, Martijn Bentum, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema
Main category: cs.CL
TL;DR: Analysis of when linguistic structure emerges during training of self-supervised speech models (Wav2Vec2 and HuBERT) on Dutch speech, examining layerwise patterns and learning trajectories across different linguistic levels.
Details
Motivation: To understand when and how different levels of linguistic structure emerge during the training of self-supervised speech models, and how this is affected by model architecture and pre-training objectives.Method: Studied six Wav2Vec2 and HuBERT models trained on spoken Dutch, analyzing encoding of various linguistic structures across layers and intermediate checkpoints. Examined layerwise patterns and learning trajectories, relating them to abstraction from acoustic signal and input integration timescales.
Result: Different linguistic structures show distinct layerwise patterns and learning trajectories, partially explained by their abstraction from acoustic signal and integration timescales. Pre-training objectives strongly affect organization and learning, with higher-order prediction tasks (iteratively refined pseudo-labels) inducing greater parallelism.
Conclusion: Linguistic structure emergence in speech models follows systematic patterns related to linguistic abstraction and model architecture, with pre-training objectives playing a crucial role in how structure is organized and learned.
Abstract: Self-supervised speech models learn effective representations of spoken language, which have been shown to reflect various aspects of linguistic structure. But when does such structure emerge in model training? We study the encoding of a wide range of linguistic structures, across layers and intermediate checkpoints of six Wav2Vec2 and HuBERT models trained on spoken Dutch. We find that different levels of linguistic structure show notably distinct layerwise patterns as well as learning trajectories, which can partially be explained by differences in their degree of abstraction from the acoustic signal and the timescale at which information from the input is integrated. Moreover, we find that the level at which pre-training objectives are defined strongly affects both the layerwise organization and the learning trajectories of linguistic structures, with greater parallelism induced by higher-order prediction tasks (i.e. iteratively refined pseudo-labels).
[5] Preference learning in shades of gray: Interpretable and bias-aware reward modeling for human preferences
Simona-Vasilica Oprea, Adela Bâra
Main category: cs.CL
TL;DR: A feature-augmented framework improves human preference learning in LLMs by enriching textual representations with interpretable signals like response length, refusal indicators, toxicity scores, and semantic similarity, achieving up to 0.84 ROC AUC and providing interpretability through SHAP/LIME analysis.
Details
Motivation: Learning human preferences in language models is fundamentally challenging because reward modeling relies on subtle, subjective comparisons rather than clear-cut labels. Current approaches have limitations in capturing the multidimensional nature of human judgment, with baseline performance below 0.74 ROC AUC on the Anthropic HHRLHF dataset.Method: Proposes a feature-augmented framework that enriches textual representations with interpretable signals: response length, refusal indicators, toxicity scores, and prompt-response semantic similarity. Evaluates ten diverse LLMs under standard pairwise preference settings and uses SHAP and LIME for fine-grained interpretability analysis.
Result: The hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Analysis reveals model decisions depend on contextualized safety and supportive framing rather than isolated keywords, and shows how feature interactions influence preference learning despite weak marginal effects of individual features.
Conclusion: Feature augmentation significantly improves human preference learning in LLMs by capturing key aspects of helpfulness, safety, and relevance. The interpretability analysis provides insights into how models make preference decisions, revealing the importance of contextualized safety and framing over isolated keywords.
Abstract: Learning human preferences in language models remains fundamentally challenging, as reward modeling relies on subtle, subjective comparisons or shades of gray rather than clear-cut labels. This study investigates the limits of current approaches and proposes a feature-augmented framework to better capture the multidimensional nature of human judgment. Using the Anthropic HHRLHF dataset, we evaluate ten diverse large language models LLMs under a standard pairwise preference setting, where baseline performance remains below 0.74 ROC AUC, highlighting the difficulty of the task. To address this, we enrich textual representations with interpretable signals: response length, refusal indicators, toxicity scores and prompt response semantic similarity, enabling models to explicitly capture key aspects of helpfulness, safety and relevance. The proposed hybrid approach yields consistent improvements across all models, achieving up to 0.84 ROC AUC and significantly higher pairwise accuracy, with DeBERTav3Large demonstrating the best performance. Beyond accuracy, we integrate SHAP and LIME to provide fine-grained interpretability, revealing that model decisions depend on contextualized safety and supportive framing rather than isolated keywords. We further analyze bias amplification, showing that while individual features have weak marginal effects, their interactions influence preference learning.
[6] Prosodic ABX: A Language-Agnostic Method for Measuring Prosodic Contrast in Speech Representations
Haitong Sun, Stephen McIntosh, Kwanghee Choi, Eunjung Yeo, Daisuke Saito, Nobuaki Minematsu
Main category: cs.CL
TL;DR: Extends ABX discrimination task to evaluate prosodic contrast in self-supervised speech models using minimal pairs across English, Japanese, and Mandarin
Details
Motivation: While self-supervised speech models are known to capture phonemic contrasts well, their sensitivity to prosodic contrasts (stress, pitch accent, tone) has not been systematically measured. The paper aims to develop a framework for evaluating prosodic contrast sensitivity without requiring large datasets or explicit labels.Method: Introduces prosodic ABX, an extension of the ABX discrimination task framework to evaluate prosodic contrast. Builds and releases datasets of English and Japanese minimal pairs, and uses existing Mandarin dataset. Evaluates contrast in English stress, Japanese pitch accent, and Mandarin tone across different model layers.
Result: Shows that model and layer rankings for prosodic contrast sensitivity are often preserved across different experimental conditions, making the approach practical for low-resource settings. Provides systematic evaluation of how well different self-supervised speech models capture prosodic features.
Conclusion: The prosodic ABX framework enables efficient evaluation of prosodic contrast sensitivity in self-supervised speech models with minimal data requirements. The consistent ranking patterns across conditions suggest the approach is robust and applicable to various languages and prosodic systems.
Abstract: Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.
[7] Procedural Knowledge at Scale Improves Reasoning
Di Wu, Devendra Singh Sachan, Wen-tau Yih, Mingda Chen
Main category: cs.CL
TL;DR: Reasoning Memory: A RAG framework that retrieves and reuses procedural knowledge from decomposed reasoning trajectories to improve language model performance on reasoning tasks.
Details
Motivation: Current test-time scaling methods treat problems in isolation and underutilize procedural knowledge from prior reasoning trajectories, missing opportunities to reuse knowledge about problem reframing, approach selection, and verification strategies.Method: Decompose existing reasoning trajectories into self-contained subquestion-subroutine pairs (32M entries), then at inference time use lightweight in-thought prompting to verbalize core subquestions, retrieve relevant subroutines, and reason under diverse retrieved subroutines as procedural priors.
Result: Outperforms RAG with document, trajectory, and template knowledge across six math, science, and coding benchmarks, improving over no retrieval by up to 19.2% and over strongest compute-matched baseline by 7.9%.
Conclusion: Gains come from broad procedural coverage of source trajectories and the decomposition/retrieval design, enabling effective extraction and reuse of procedural knowledge for improved reasoning.
Abstract: Test-time scaling has emerged as an effective way to improve language models on challenging reasoning tasks. However, most existing methods treat each problem in isolation and do not systematically reuse knowledge from prior reasoning trajectories. In particular, they underutilize procedural knowledge: how to reframe a problem, choose an approach, and verify or backtrack when needed. We introduce Reasoning Memory, a retrieval-augmented generation (RAG) framework for reasoning models that explicitly retrieves and reuses procedural knowledge at scale. Starting from existing corpora of step-by-step reasoning trajectories, we decompose each trajectory into self-contained subquestion-subroutine pairs, yielding a datastore of 32 million compact procedural knowledge entries. At inference time, a lightweight in-thought prompt lets the model verbalize the core subquestion, retrieve relevant subroutines within its reasoning trace, and reason under diverse retrieved subroutines as implicit procedural priors. Across six math, science, and coding benchmarks, Reasoning Memory consistently outperforms RAG with document, trajectory, and template knowledge, as well as a compute-matched test-time scaling baseline. With a higher inference budget, it improves over no retrieval by up to 19.2% and over the strongest compute-matched baseline by 7.9% across task types. Ablation studies show that these gains come from two key factors: the broad procedural coverage of the source trajectories and our decomposition and retrieval design, which together enable effective extraction and reuse of procedural knowledge.
[8] No Attacker Needed: Unintentional Cross-User Contamination in Shared-State LLM Agents
Tiankai Yang, Jiate Li, Yi Nian, Shen Dong, Ruiyao Xu, Ryan Rossi, Kaize Ding, Yue Zhao
Main category: cs.CL
TL;DR: LLM agents serving multiple users risk unintentional cross-user contamination when shared knowledge artifacts are misapplied across user boundaries, causing silent failures without malicious intent.
Details
Motivation: As LLM-based agents operate across repeated sessions serving multiple users, they maintain shared knowledge states. This creates a risk where information valid for one user can degrade another user's outcomes when reapplied without proper scope consideration, leading to silent failures without requiring adversarial attacks.Method: The paper formalizes unintentional cross-user contamination (UCC) through a controlled evaluation protocol, introduces a taxonomy of three contamination types, and evaluates the problem in two shared-state mechanisms. It tests both raw shared state and write-time sanitization approaches across different interaction types.
Result: Under raw shared state, benign interactions alone produce contamination rates of 57-71%. Write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers.
Conclusion: Shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures. The research highlights a critical safety issue in multi-user LLM deployments that requires new defense mechanisms.
Abstract: LLM-based agents increasingly operate across repeated sessions, maintaining task states to ensure continuity. In many deployments, a single agent serves multiple users within a team or organization, reusing a shared knowledge layer across user identities. This shared persistence expands the failure surface: information that is locally valid for one user can silently degrade another user’s outcome when the agent reapplies it without regard for scope. We refer to this failure mode as unintentional cross-user contamination (UCC). Unlike adversarial memory poisoning, UCC requires no attacker; it arises from benign interactions whose scope-bound artifacts persist and are later misapplied. We formalize UCC through a controlled evaluation protocol, introduce a taxonomy of three contamination types, and evaluate the problem in two shared-state mechanisms. Under raw shared state, benign interactions alone produce contamination rates of 57–71%. A write-time sanitization is effective when shared state is conversational, but leaves substantial residual risk when shared state includes executable artifacts, with contamination often manifesting as silent wrong answers. These results indicate that shared-state agents need artifact-level defenses beyond text-level sanitization to prevent silent cross-user failures.
[9] J-CHAT: Japanese Large-scale Spoken Dialogue Corpus for Spoken Dialogue Language Modeling
Wataru Nakata, Kentaro Seki, Hitomi Yanaka, Yuki Saito, Shinnosuke Takamichi, Hiroshi Saruwatari
Main category: cs.CL
TL;DR: J-CHAT is a 76,000-hour open-source Japanese spoken dialogue corpus built from YouTube/podcast data with extensive filtering, designed to address limitations in existing spoken dialogue datasets for training effective spoken dialogue systems.
Details
Motivation: Existing spoken dialogue datasets are limited in size, spontaneity, or linguistic coherence, creating barriers for developing effective spoken dialogue systems (SDSs) that require large-scale, high-quality, diverse spoken dialogue corpora.Method: Constructed using automated, language-independent methodology from YouTube and podcast data with extensive filtering and denoising to ensure acoustic cleanliness, diversity, and natural spontaneity.
Result: Created J-CHAT, a 76,000-hour Japanese spoken dialogue corpus, with experimental results showing effectiveness for SDS development when used to train generative spoken dialogue language models.
Conclusion: J-CHAT provides a robust foundation for training advanced dialogue models and is expected to drive progress in human-AI dialogue research and applications.
Abstract: Spoken dialogue is essential for human-AI interactions, providing expressive capabilities beyond text. Developing effective spoken dialogue systems (SDSs) requires large-scale, high-quality, and diverse spoken dialogue corpora. However, existing datasets are often limited in size, spontaneity, or linguistic coherence. To address these limitations, we introduce J-CHAT, a 76,000-hour open-source Japanese spoken dialogue corpus. Constructed using an automated, language-independent methodology, J-CHAT ensures acoustic cleanliness, diversity, and natural spontaneity. The corpus is built from YouTube and podcast data, with extensive filtering and denoising to enhance quality. Experimental results with generative spoken dialogue language models trained on J-CHAT demonstrate its effectiveness for SDS development. By providing a robust foundation for training advanced dialogue models, we anticipate that J-CHAT will drive progress in human-AI dialogue research and applications.
[10] Open-Domain Safety Policy Construction
Di Wu, Siyue Liu, Zixiang Ji, Ya-Liang Chang, Zhe-Yu Liu, Andrew Pleffer, Kai-Wei Chang
Main category: cs.CL
TL;DR: DPR is an agentic system that automatically drafts content moderation policies using only seed domain information, web search, and iterative refinement, outperforming baselines on text and multimodal benchmarks.
Details
Motivation: Drafting and maintaining domain-specific content moderation policies is costly. Current approaches require extensive human effort, and there's a need for automated systems that can generate comprehensive policies from minimal seed information.Method: DPR uses a single web search tool with lightweight scaffolding to iteratively: 1) propose search queries, 2) distill diverse web sources into policy rules, and 3) organize rules into an indexed document. It employs a structured research loop specifically designed for policy drafting.
Result: DPR consistently outperforms definition-only and in-context learning baselines on the OpenAI undesired content benchmark across five domains with two compact reader LLMs. It also performs well on an in-house multimodal advertisement moderation benchmark and is competitive with expert-written policy sections in several domains. DPR outperforms a general-purpose deep research system under the same conditions.
Conclusion: Task-specific, structured research loops are more effective than generic web research for policy drafting. DPR demonstrates that automated systems can generate comprehensive content moderation policies from minimal seed information, reducing the cost of policy maintenance.
Abstract: Moderation layers are increasingly a core component of many products built on user- or model-generated content. However, drafting and maintaining domain-specific safety policies remains costly. We present Deep Policy Research (DPR), a minimal agentic system that drafts a full content moderation policy based on only human-written seed domain information. DPR uses a single web search tool and lightweight scaffolding to iteratively propose search queries, distill diverse web sources into policy rules, and organize rules into an indexed document. We evaluate DPR on (1) the OpenAI undesired content benchmark across five domains with two compact reader LLMs and (2) an in-house multimodal advertisement moderation benchmark. DPR consistently outperforms definition-only and in-context learning baselines, and in our end-to-end setting it is competitive with expert-written policy sections in several domains. Moreover, under the same seed specification and evaluation protocol, DPR outperforms a general-purpose deep research system, suggesting that a task-specific, structured research loop can be more effective than generic web research for policy drafting. We release our experiment code at https://github.com/xiaowu0162/deep-policy-research.
[11] OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
Han Zhu, Lingxuan Ye, Wei Kang, Zengwei Yao, Liyong Guo, Fangjun Kuang, Zhifeng Han, Weiji Zhuang, Long Lin, Daniel Povey
Main category: cs.CL
TL;DR: OmniVoice is a massive multilingual zero-shot TTS model covering 600+ languages using a novel diffusion language model-style discrete non-autoregressive architecture that directly maps text to acoustic tokens.
Details
Motivation: To overcome limitations of conventional discrete NAR TTS models that suffer from performance bottlenecks in complex two-stage pipelines (text-to-semantic-to-acoustic) and to achieve broad multilingual coverage with state-of-the-art performance.Method: Uses a diffusion language model-style discrete NAR architecture with two key innovations: 1) full-codebook random masking strategy for efficient training, and 2) initialization from a pre-trained LLM for superior intelligibility. Trained on 581k-hour multilingual dataset from open-source data.
Result: Achieves the broadest language coverage to date (600+ languages) and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks.
Conclusion: OmniVoice demonstrates that simplified direct text-to-acoustic mapping with diffusion-style architecture and LLM initialization can achieve superior multilingual TTS performance with unprecedented language coverage.
Abstract: We present OmniVoice, a massive multilingual zero-shot text-to-speech (TTS) model that scales to over 600 languages. At its core is a novel diffusion language model-style discrete non-autoregressive (NAR) architecture. Unlike conventional discrete NAR models that suffer from performance bottlenecks in complex two-stage (text-to-semantic-to-acoustic) pipelines, OmniVoice directly maps text to multi-codebook acoustic tokens. This simplified approach is facilitated by two key technical innovations: (1) a full-codebook random masking strategy for efficient training, and (2) initialization from a pre-trained LLM to ensure superior intelligibility. By leveraging a 581k-hour multilingual dataset curated entirely from open-source data, OmniVoice achieves the broadest language coverage to date and delivers state-of-the-art performance across Chinese, English, and diverse multilingual benchmarks. Our code and pre-trained models are publicly available at https://github.com/k2-fsa/OmniVoice.
[12] Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models
Itay Yona, Dan Barzilay, Michael Karasik, Mor Geva
Main category: cs.CL
TL;DR: Language models use sparse, early-layer MLP neurons for entity retrieval, with single-neuron interventions enabling entity-specific factual recall and canonicalization across aliases.
Details
Motivation: To understand the internal mechanisms by which language models answer entity-centric factual questions, specifically investigating whether they use compact entity retrieval mechanisms or gradual enrichment processes.Method: Localized entity-selective MLP neurons using templated prompts about entities, validated with causal interventions on PopQA-based QA examples. Used a curated set of 200 entities from PopQA, performed negative ablation and controlled injection experiments at placeholder tokens.
Result: Localized neurons concentrate in early layers. Negative ablation causes entity-specific amnesia, while controlled injection improves answer retrieval. Single neuron activation can recover entity-consistent predictions. System shows robustness to aliases, acronyms, misspellings, and multilingual forms (canonicalization). Effect is strong but not universal - coverage higher for popular entities.
Conclusion: Language models use sparse, causally actionable access points (early-layer MLP neurons) for entity-conditioned factual behavior, supporting compact entity retrieval rather than purely gradual enrichment across depth.
Abstract: Language models can answer many entity-centric factual questions, but it remains unclear which internal mechanisms are involved in this process. We study this question across multiple language models. We localize entity-selective MLP neurons using templated prompts about each entity, and then validate them with causal interventions on PopQA-based QA examples. On a curated set of 200 entities drawn from PopQA, localized neurons concentrate in early layers. Negative ablation produces entity-specific amnesia, while controlled injection at a placeholder token improves answer retrieval relative to mean-entity and wrong-cell controls. For many entities, activating a single localized neuron is sufficient to recover entity-consistent predictions once the context is initialized, consistent with compact entity retrieval rather than purely gradual enrichment across depth. Robustness to aliases, acronyms, misspellings, and multilingual forms supports a canonicalization interpretation. The effect is strong but not universal: not every entity admits a reliable single-neuron handle, and coverage is higher for popular entities. Overall, these results identify sparse, causally actionable access points for analyzing and modulating entity-conditioned factual behavior.
[13] Assessing Pause Thresholds for empirical Translation Process Research
Devi Sri Bandaru, Michael Carl, Xinyue Ren
Main category: cs.CL
TL;DR: Paper compares methods for determining pause thresholds in typing to distinguish automated vs. reflective translation processes, proposing a new method for computing Production Unit Breaks.
Details
Motivation: To address the long-standing debate about how to determine pause thresholds that separate automated translation processes from reflective/problem-solving ones in typing behavior analysis.Method: Compares three recent approaches for computing pause thresholds and proposes/evaluates a novel method for computing Production Unit Breaks based on typing behavior analysis.
Result: The paper likely presents comparative results of different pause threshold methods and demonstrates the effectiveness of the proposed novel method for identifying translation process boundaries.
Conclusion: Provides methodological contributions to translation process research by offering improved techniques for analyzing typing pauses to understand cognitive processes in translation.
Abstract: Text production (and translations) proceeds in the form of stretches of typing, interrupted by keystroke pauses. It is often assumed that fast typing reflects unchallenged/automated translation production while long(er) typing pauses are indicative of translation problems, hurdles or difficulties. Building on a long discussion concerning the determination of pause thresholds that separate automated from presumably reflective translation processes (O’Brien, 2006; Alves and Vale, 2009; Timarova et al., 2011; Dragsted and Carl, 2013; Lacruz et al., 2014; Kumpulainen, 2015; Heilmann and Neumann 2016), this paper compares three recent approaches for computing these pause thresholds, and suggest and evaluate a novel method for computing Production Unit Breaks.
[14] Adaptive Stopping for Multi-Turn LLM Reasoning
Xiaofan Zhou, Huy Nguyen, Bo Yu, Chenxi Liu, Lu Cheng
Main category: cs.CL
TL;DR: MiCP introduces a conformal prediction framework for multi-turn LLM reasoning that provides formal coverage guarantees while enabling early stopping to reduce costs and latency.
Details
Motivation: Multi-turn reasoning methods like adaptive RAG and ReAct improve LLM accuracy but lack formal guarantees about when to stop, risking either unnecessary costs (too many turns) or incorrect decisions (stopping too early), especially in high-stakes domains.Method: MiCP extends conformal prediction to multi-turn pipelines by allocating different error budgets across turns, allowing the model to stop early while maintaining overall coverage guarantees. It’s applied to adaptive RAG and ReAct agents.
Result: MiCP achieves target coverage on both single-hop and multi-hop QA benchmarks while reducing number of turns, inference cost, and prediction set size. A new joint metric evaluates both coverage validity and answering efficiency.
Conclusion: MiCP provides the first formal framework for multi-turn LLM reasoning with coverage guarantees, enabling efficient early stopping without sacrificing reliability, particularly valuable for high-stakes applications.
Abstract: Large Language Models (LLMs) increasingly rely on multi-turn reasoning and interaction, such as adaptive retrieval-augmented generation (RAG) and ReAct-style agents, to answer difficult questions. These methods improve accuracy by iteratively retrieving information, reasoning, or acting, but introduce a key challenge: \textbf{When should the model stop?} Existing approaches rely on heuristic stopping rules or fixed turn budgets and provide no formal guarantees that the final prediction still contains the correct answer. This limitation is particularly problematic in high-stakes domains such as finance and healthcare, where unnecessary turns increase cost and latency, while stopping too early risks incorrect decisions. Conformal prediction (CP) provides formal coverage guarantees, but existing LLM-CP methods only apply to a single model output and cannot handle multi-turn pipelines with adaptive stopping. To address this gap, we propose Multi-Turn Language Models with Conformal Prediction (MiCP), the first CP framework for multi-turn reasoning. MiCP allocates different error budgets across turns, enabling the model to stop early while maintaining an overall coverage guarantee. We demonstrate MiCP on adaptive RAG and ReAct, where it achieves the target coverage on both single-hop and multi-hop question answering benchmarks while reducing the number of turns, inference cost, and prediction set size. We further introduce a new metric that jointly evaluates coverage validity and answering efficiency.
[15] Cost-Efficient Estimation of General Abilities Across Benchmarks
Michael Krumdick, Adam Wiemerslage, Seth Ebner, Charles Lovering, Chris Tanner
Main category: cs.CL
TL;DR: A framework for efficient benchmarking of large language models using predictive validity, with a dataset of 65 models on 163 tasks, showing that combining modified IRT with adaptive item selection can predict performance on unseen tasks with 7% MAE after observing only 16 items.
Details
Motivation: Current LLM benchmarking uses thousands of diverse benchmarks, but model performance can be explained by a small set of latent factors. There's a need for more efficient and principled benchmarking methods grounded in predictive validity - how well they can predict model performance on unseen tasks.Method: Created the Wide-scale Item Level Dataset (WILD) with 65 models on 109,564 unique items spanning 163 tasks from 27 datasets. Used modified multidimensional item response theory (IRT) combined with adaptive item selection driven by optimal experimental design. Incorporated cost-aware discount factors to reduce evaluation costs.
Result: The method can predict performance on 112 held-out benchmark tasks with mean absolute error (MAE) less than 7% after observing only 16 items. Cost-aware selection reduced tokens needed for 7% MAE from 141,000 to 22,000 tokens (85% reduction in evaluation cost).
Conclusion: The proposed benchmarking framework based on predictive validity enables efficient and principled evaluation of LLMs, significantly reducing evaluation costs while maintaining accurate performance predictions on unseen tasks.
Abstract: Thousands of diverse benchmarks have been developed to measure the quality of large language models (LLMs). Yet prior work has demonstrated that LLM performance is often sufficiently explained by a small set of latent factors, or abilities. This suggests the potential for more efficient and principled benchmarking, but it remains difficult to compare the quality of different methods. Motivated by predictive validity, we argue that the quality of a benchmarking framework should be grounded in how efficiently it enables the prediction of model performance on unseen tasks. To analyze this objective, we collect the “Wide-scale Item Level Dataset” (WILD), a dataset of item-model response pairs, comprising evaluations of 65 models on 109,564 unique items spanning 163 tasks drawn from 27 datasets. This dataset enables the first analysis of how different techniques can predict a model’s performance on a large, diverse collection of unseen tasks under different budget constraints. We demonstrate that combining a modified multidimensional item response theory (IRT) model with adaptive item selection driven by optimal experimental design can predict performance on 112 held-out benchmark tasks with a mean absolute error (MAE) of less than 7%, and can do so after observing only 16 items. We further demonstrate that incorporating cost-aware discount factors into our selection criteria can reduce the total tokens needed to reach 7% MAE from 141,000 tokens to only 22,000, an 85% reduction in evaluation cost.
[16] The power of context: Random Forest classification of near synonyms. A case study in Modern Hindi
Jacek Bąkowski
Main category: cs.CL
TL;DR: Random Forest classifier using Hindi word embeddings can distinguish Sanskrit vs Perso-Arabic origin synonyms based on distributional patterns alone, showing etymology leaves traces in usage patterns.
Details
Motivation: To test whether synonyms with different etymological origins (Sanskrit vs Perso-Arabic in Hindi) can be distinguished using distributional data alone, regardless of semantic content, providing quantitative evidence for subtle distinctions in synonym usage patterns.Method: Used Random Forest classifier trained on word embeddings of Hindi synonyms to classify words by their origin (Sanskrit or Perso-Arabic), focusing on distributional patterns rather than semantic content.
Result: The classifier successfully distinguished synonyms by origin even when semantically unrelated, showing that usage patterns preserve etymological signals and that synonyms with different origins form distinct conceptual subspaces.
Conclusion: Synonyms carry subtle but systematic distinctions linked to their historical origin, creating distinct conceptual subspaces shaped by etymology, and context encodes etymological signals beyond traditional semantic similarity.
Abstract: Synonymy is a widespread yet puzzling linguistic phenomenon. Absolute synonyms theoretically should not exist, as they do not expand language’s expressive potential. However, it was suggested that even if synonyms denote the same concept, they may reflect different perspectives or carry distinct cultural associations, claims that have rarely been tested quantitatively. In Hindi, prolonged contact with Persian produced many Perso-Arabic loanwords coexisting with their Sanskrit counterpart, forming numerous synonym pairs. This study investigates whether centuries after these borrowings appeared in the Subcontinent their origin can still be distinguished using distributional data alone and regardless of their semantic content. A Random Forest trained on word embeddings of Hindi synonyms successfully classified words by Sanskrit or Perso-Arabic origin, even when they were semantically unrelated, suggesting that usage patterns preserve traces of etymology. These findings provide quantitative evidence that context encodes etymological signals and that synonymy may reflect subtle but systematic distinctions linked to origin. They support the idea that synonymous words can offer different perspectives and that etymologically related words may form distinct conceptual subspaces, creating a new type of semantic frame shaped by historical origin. Overall, the results highlight the power of context in capturing nuanced distinctions beyond traditional semantic similarity.
[17] Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation
Hexuan Wang, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi
Main category: cs.CL
TL;DR: Fine-grained citations degrade attribution quality; paragraph-level citations work best for models, balancing human verification needs with model capabilities.
Details
Motivation: Citation granularity (sentence vs. paragraph vs. document level) is a key design choice in attributed generation, but the impact of fine-grained citations on model performance is not well understood despite their preference for human verification.Method: Analyzed four model scales (8B-120B parameters) to study how different citation granularities affect attribution quality, comparing sentence-level, paragraph-level, and multi-paragraph citation approaches.
Result: Fine-grained citations degrade attribution quality by 16-276% compared to optimal granularity; paragraph-level citations perform best; larger models are disproportionately penalized by fine-grained constraints; citation-optimal granularity improves attribution while maintaining answer correctness.
Conclusion: Optimizing solely for human verification via fine-grained citations disregards model constraints and compromises attribution faithfulness; effective attribution requires aligning citation granularity with the model’s natural semantic scope rather than forcing atomic units.
Abstract: Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model’s natural semantic scope.
[18] Wired for Overconfidence: A Mechanistic Perspective on Inflated Verbalized Confidence in LLMs
Tianyi Zhao, Yinhan He, Wendy Zheng, Yujie Zhang, Chen Chen
Main category: cs.CL
TL;DR: Mechanistic analysis of LLM verbalized overconfidence: identifies specific circuits that inflate confidence and enables targeted recalibration interventions.
Details
Motivation: LLMs often express high confidence even when factually incorrect, which misleads users and undermines confidence scores as reliable uncertainty signals. The internal mechanisms behind this verbalized overconfidence remain poorly understood.Method: Circuit-level mechanistic analysis organized around three axes: 1) capturing verbalized confidence as differentiable internal signal, 2) identifying circuits that causally inflate it, and 3) leveraging insights for targeted inference-time recalibration. Analyzed two instruction-tuned LLMs on three datasets.
Result: Found that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. Targeted inference-time interventions on these circuits substantially improve calibration.
Conclusion: Verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention, providing a mechanistic understanding of confidence miscalibration.
Abstract: Large language models are often not just wrong, but \emph{confidently wrong}: when they produce factually incorrect answers, they tend to verbalize overly high confidence rather than signal uncertainty. Such verbalized overconfidence can mislead users and weaken confidence scores as a reliable uncertainty signal, yet its internal mechanisms remain poorly understood. We present a circuit-level mechanistic analysis of this inflated verbalized confidence in LLMs, organized around three axes: capturing verbalized confidence as a differentiable internal signal, identifying the circuits that causally inflate it, and leveraging these insights for targeted inference-time recalibration. Across two instruction-tuned LLMs on three datasets, we find that a compact set of MLP blocks and attention heads, concentrated in middle-to-late layers, consistently writes the confidence-inflation signal at the final token position. We further show that targeted inference-time interventions on these circuits substantially improve calibration. Together, our results suggest that verbalized overconfidence in LLMs is driven by identifiable internal circuits and can be mitigated through targeted intervention.
[19] A Dynamic Atlas of Persian Poetic Symbolism: Families, Fields, and the Historical Rewiring of Meaning
Kourosh Shahnazari, Seyed Moein Ayyoubzadeh, Mohammadali Keshtparvar
Main category: cs.CL
TL;DR: Computational analysis of Persian poetry reveals symbolic families and their evolving relationships across centuries, showing changing patterns in imagery, sacred references, and courtly vocabulary.
Details
Motivation: Current computational approaches flatten Persian poetry into isolated words or broad semantics, missing the organized symbolic families and recurring relations that are fundamental to Persian poetics.Method: Analyzed 129,451 poems, consolidated recurrent forms into traceable families, separated imagistic material from sacred/courtly references, and mapped relations in multi-layer graphs across 11 Hijri-century bins.
Result: Found symbolic core is sparse, referential component dense; symbolic families show temporal patterns (wine vessels/gardens strengthen later, courtly vocabulary earlier); modularity rises, cross-scope linkage declines; hub positions shift over time.
Conclusion: Persian symbolism is a dynamic system with changing internal weights and connections over time, not a fixed repertory, revealing long-term evolution in poetic expression.
Abstract: Persian poetry is often remembered through recurrent symbols before it is remembered through plot. Wine vessels, gardens, flames, sacred titles, bodily beauty, and courtly names return across centuries, yet computational work still tends to flatten this material into isolated words or broad document semantics. That misses a practical unit of organization in Persian poetics: related forms travel as families and gain force through recurring relations. Using a corpus of 129,451 poems, we consolidate recurrent forms into traceable families, separate imagistic material from sacred and courtly reference, and map their relations in a multi-layer graph. The symbolic core is relatively sparse, the referential component much denser, and the attachment zone between them selective rather than diffuse. Across 11 Hijri-century bins, some families remain widely distributed, especially Shab (Night), Ruz (Day), and Khaak (Earth). Wine vessels, garden space, flame, and lyric sound strengthen later, while prestige-coded and heroic-courtly vocabulary is weighted earlier. Century-specific graphs show change in arrangement as well as membership. Modularity rises, cross-scope linkage declines, courtly bridges weaken, and sacred bridges strengthen. Hub positions shift too: Kherqe (Sufi Robe) gains late prominence, Farkhondeh {Blessed} and Banafsheh (Violet) recede, and Saaghar (Wine Cup) stays central across the chronology. In this corpus, Persian symbolism appears less as a fixed repertory than as a long-lived system whose internal weights and connections change over time.
[20] Magic, Madness, Heaven, Sin: LLM Output Diversity is Everything, Everywhere, All at Once
Harnoor Dhingra
Main category: cs.CL
TL;DR: The paper introduces a framework (Magic, Madness, Heaven, Sin) to analyze output variation in LLMs across four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness), arguing for context-aware evaluation rather than treating variation as an intrinsic model trait.
Details
Motivation: Current research on LLM output variation uses fragmented terminology because normative objectives underlying tasks are rarely made explicit. There's a need for a unified framework to understand how output variation should be evaluated differently based on task contexts and objectives.Method: The authors introduce the Magic, Madness, Heaven, Sin framework that models output variation along a homogeneity-heterogeneity axis. They organize tasks into four normative contexts and examine failure modes and vocabulary for each. They analyze all pairwise cross-contextual interactions between these contexts.
Result: The framework reveals that optimizing for one objective (e.g., improving safety) can inadvertently harm other objectives (e.g., demographic representation or creative diversity). It provides a systematic way to understand trade-offs between different normative goals in LLM evaluation.
Conclusion: Output variation should be reframed as a property shaped by task objectives rather than a model’s intrinsic trait. Context-aware evaluation is necessary because what constitutes desirable variation differs across normative contexts (epistemic, interactional, societal, safety).
Abstract: Research on Large Language Models (LLMs) studies output variation across generation, reasoning, alignment, and representational analysis, often under the umbrella of “diversity.” Yet the terminology remains fragmented, largely because the normative objectives underlying tasks are rarely made explicit. We introduce the Magic, Madness, Heaven, Sin framework, which models output variation along a homogeneity-heterogeneity axis, where valuation is determined by the task and its normative objective. We organize tasks into four normative contexts: epistemic (factuality), interactional (user utility), societal (representation), and safety (robustness). For each, we examine the failure modes and vocabulary such as hallucination, mode collapse, bias, and erasure through which variation is studied. We apply the framework to analyze all pairwise cross-contextual interactions, revealing that optimizing for one objective, such as improving safety, can inadvertently harm demographic representation or creative diversity. We argue for context-aware evaluation of output variation, reframing it as a property shaped by task objectives rather than a model’s intrinsic trait.
[21] Why Instruction-Based Unlearning Fails in Diffusion Models?
Zeliang Zhang, Rui Sun, Jiani Liu, Qi Wu, Chenliang Xu
Main category: cs.CL
TL;DR: Diffusion models fail to suppress targeted concepts through natural-language unlearning instructions alone, unlike LLMs, revealing a fundamental limitation of prompt-level control in image generation.
Details
Motivation: To investigate whether instruction-based unlearning, which works well for large language models, can be extended to diffusion-based image generation models for modifying model behavior at inference time.Method: Conducted controlled experiments across multiple concepts and prompt variants, analyzing CLIP text encoder and cross-attention dynamics during denoising to understand why unlearning instructions fail.
Result: Diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. Unlearning instructions do not induce sustained reductions in attention to targeted concept tokens, causing concept representations to persist throughout generation.
Conclusion: There’s a fundamental limitation of prompt-level instruction in diffusion models, suggesting effective unlearning requires interventions beyond inference-time language control, unlike in LLMs.
Abstract: Instruction-based unlearning has proven effective for modifying the behavior of large language models at inference time, but whether this paradigm extends to other generative models remains unclear. In this work, we investigate instruction-based unlearning in diffusion-based image generation models and show, through controlled experiments across multiple concepts and prompt variants, that diffusion models systematically fail to suppress targeted concepts when guided solely by natural-language unlearning instructions. By analyzing both the CLIP text encoder and cross-attention dynamics during the denoising process, we find that unlearning instructions do not induce sustained reductions in attention to the targeted concept tokens, causing the targeted concept representations to persist throughout generation. These results reveal a fundamental limitation of prompt-level instruction in diffusion models and suggest that effective unlearning requires interventions beyond inference-time language control.
[22] Read More, Think More: Revisiting Observation Reduction for Web Agents
Masafumi Enomoto, Ryoma Obara, Haochen Zhang, Masafumi Oyamada
Main category: cs.CL
TL;DR: HTML vs. compact web page representations for LLM web agents: optimal choice depends on model capability and token budget, with HTML better for high-capability models and compact representations better for lower-capability ones.
Details
Motivation: Prior work treats verbose HTML as an obstacle for LLM web agents and adopts observation reduction, but this paper revisits this trend to understand optimal observation representation strategies.Method: Systematic evaluation comparing different observation representations (HTML vs. accessibility trees) across models with varying capabilities and thinking token budgets, with error analysis to understand failure modes.
Result: Compact observations work better for lower-capability models, while detailed HTML benefits higher-capability models, especially with more thinking tokens. Higher-capability models exploit HTML layout information for action grounding, while lower-capability models suffer hallucination with longer inputs. Observation history improves performance, and diff-based representations offer token efficiency.
Conclusion: Observation representation should be adaptively selected based on model capability and token budget, with diff-based representations for incorporating observation history. HTML is valuable for capable models with sufficient thinking tokens.
Abstract: Web agents based on large language models (LLMs) rely on observations of web pages – commonly represented as HTML – as the basis for identifying available actions and planning subsequent steps. Prior work has treated the verbosity of HTML as an obstacle to performance and adopted observation reduction as a standard practice. We revisit this trend and demonstrate that the optimal observation representation depends on model capability and thinking token budget: (1) compact observations (accessibility trees) are preferable for lower-capability models, while detailed observations (HTML) are advantageous for higher-capability models; moreover, increasing thinking tokens further amplifies the benefit of HTML. (2) Our error analysis suggests that higher-capability models exploit layout information in HTML for better action grounding, while lower-capability models suffer from increased hallucination under longer inputs. We also find that incorporating observation history improves performance across most models and settings, and a diff-based representation offers a token-efficient alternative. Based on these findings, we suggest practical guidelines: adaptively select observation representations based on model capability and thinking token budget, and incorporate observation history using diff-based representations.
[23] Countering Catastrophic Forgetting of Large Language Models for Better Instruction Following via Weight-Space Model Merging
Mengxian Lyu, Cheng Peng, Ziyi Chen, Mengyuan Zhang, Jieting Li Lu, Yonghui Wu
Main category: cs.CL
TL;DR: Medical LLM adaptation framework using model merging to prevent catastrophic forgetting when fine-tuning general-purpose LLMs for clinical applications.
Details
Motivation: LLMs often lose instruction-following ability when fine-tuned on medical datasets, creating a barrier for clinical adoption. Need efficient adaptation methods that preserve both domain expertise and general capabilities.Method: Proposes model merging framework using interpolation-based merge methods to combine a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) to create domain-adapted models.
Result: Merged models effectively mitigate catastrophic forgetting, preserve clinical expertise, retain instruction-following ability, and achieve performance comparable to fully fine-tuned baselines with significantly less supervision (64-shot vs 256-shot).
Conclusion: Weight-space merging provides scalable solution for adapting open-source LLMs to clinical applications, enabling broader deployment in resource-constrained healthcare environments.
Abstract: Large language models have been adopted in the medical domain for clinical documentation to reduce clinician burden. However, studies have reported that LLMs often “forget” a significant amount of instruction-following ability when fine-tuned using a task-specific medical dataset, a critical challenge in adopting general-purpose LLMs for clinical applications. This study presents a model merging framework to efficiently adapt general-purpose LLMs to the medical domain by countering this forgetting issue. By merging a clinical foundation model (GatorTronLlama) with a general instruct model (Llama-3.1-8B-Instruct) via interpolation-based merge methods, we seek to derive a domain-adapted model with strong performance on clinical tasks while retaining instruction-following ability. Comprehensive evaluation across medical benchmarks and five clinical generation tasks (e.g., radiology and discharge summarization) shows that merged models can effectively mitigate catastrophic forgetting, preserve clinical domain expertise, and retain instruction-following ability. In addition, our model merging strategies demonstrate training efficiency, achieving performance on par with fully fine-tuned baselines under severely constrained supervision (e.g., 64-shot vs. 256-shot). Consequently, weight-space merging constitutes a highly scalable solution for adapting open-source LLMs to clinical applications, facilitating broader deployment in resource-constrained healthcare environments.
[24] DeltaMem: Towards Agentic Memory Management via Reinforcement Learning
Qi Zhang, Shen Huang, Chu Liu, Shouqing Yang, Junbo Zhao, Haobo Wang, Pengjun Xie
Main category: cs.CL
TL;DR: DeltaMem: An agentic memory management system for persona-centric memory that formulates memory management as an end-to-end task using reinforcement learning with Memory-based Levenshtein Distance rewards.
Details
Motivation: Existing persona-centric memory systems in multi-agent frameworks suffer from information loss and fragility across varying scenarios, leading to suboptimal performance in conversational settings.Method: Proposes DeltaMem as a single-agent memory management system, creates a user-assistant dialogue dataset with operation-level memory updating labels, introduces Memory-based Levenshtein Distance for reward formalization, and develops a tailored reinforcement learning framework.
Result: DeltaMem outperforms all product-level baselines across diverse long-term memory benchmarks including LoCoMo, HaluMem, and PersonaMem, with both training-free and RL-trained versions showing superior performance.
Conclusion: DeltaMem provides an effective single-agent solution for persona-centric memory management that addresses limitations of complex multi-agent frameworks through end-to-end formulation and reinforcement learning optimization.
Abstract: Recent advances in persona-centric memory have revealed the powerful capability of multi-agent systems in managing persona memory, especially in conversational scenarios. However, these complex frameworks often suffer from information loss and are fragile across varying scenarios, resulting in suboptimal performance. In this paper, we propose DeltaMem, an agentic memory management system that formulates persona-centric memory management as an end-to-end task within a single-agent setting. To further improve the performance of our agentic memory manager, we draw inspiration from the evolution of human memory and synthesize a user-assistant dialogue dataset along with corresponding operation-level memory updating labels. Building on this, we introduce a novel Memory-based Levenshtein Distance to formalize the memory updating reward, and propose a tailored reinforcement learning framework to further enhance the management capabilities of DeltaMem. Extensive experiments show that both training-free and RL-trained DeltaMem outperform all product-level baselines across diverse long-term memory benchmarks, including LoCoMo, HaluMem, and PersonaMem.
[25] Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression
Ruoling Qi, Yirui Liu, Xuaner Wu, Xiangyu Wang, Ming Li, Chen Chen, Jian Chen, Yin Chen, Qizhen Weng
Main category: cs.CL
TL;DR: Swift-SVD: A training-free, activation-aware SVD compression framework for LLMs that achieves optimal reconstruction with 3-70x speedup over existing methods.
Details
Motivation: LLM deployment is constrained by memory/bandwidth demands of weights and KV cache. Existing SVD compression methods are either suboptimal in reconstruction error or theoretically optimal but practically inefficient.Method: Proposes Swift-SVD: activation-aware, closed-form compression framework that incrementally aggregates covariance of output activations, performs single eigenvalue decomposition after aggregation, uses effective rank analysis for layer-wise compressibility, and employs dynamic rank allocation considering both local reconstruction loss and layer importance.
Result: Outperforms state-of-the-art baselines across six LLMs and eight datasets, achieving optimal compression accuracy with 3-70x speedups in end-to-end compression time.
Conclusion: Swift-SVD provides a hardware-friendly solution for LLM compression that simultaneously guarantees theoretical optimum, practical efficiency, and numerical stability.
Abstract: The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.
[26] Grounding AI-in-Education Development in Teachers’ Voices: Findings from a National Survey in Indonesia
Nurul Aisyah, Muhammad Dehan Al Kautsar, Arif Hidayat, Fajri Koto
Main category: cs.CL
TL;DR: Nationwide survey of 349 Indonesian K-12 teachers reveals increasing but uneven AI adoption for pedagogy, content development, and teaching media, with elementary teachers showing more consistent use than senior high teachers.
Details
Motivation: Limited large-scale, teacher-centred evidence on AI use in Indonesian classrooms hinders development of context-appropriate AI systems and policies, creating a need for empirical data on actual AI adoption patterns and teacher support requirements.Method: Conducted a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools in Indonesia to gather empirical data on AI usage patterns, adoption barriers, and support needs.
Result: Found increasing AI use for pedagogy, content development, and teaching media, but adoption is uneven: elementary teachers use AI more consistently than senior high teachers; mid-career teachers assign higher importance to AI; teachers in Eastern Indonesia perceive greater value; primary use is to reduce instructional preparation workload.
Conclusion: While AI adoption is growing in Indonesian classrooms, generic outputs, infrastructure constraints, and limited contextual alignment hinder effective integration, highlighting the need for context-aware AI systems and targeted teacher support.
Abstract: Despite emerging use in Indonesian classrooms, there is limited large-scale, teacher-centred evidence on how AI is used in practice and what support teachers need, hindering the development of context-appropriate AI systems and policies. To address this gap, we conduct a nationwide survey of 349 K-12 teachers across elementary, junior high, and senior high schools. We find increasing use of AI for pedagogy, content development, and teaching media, although adoption remains uneven. Elementary teachers report more consistent use, while senior high teachers engage less; mid-career teachers assign higher importance to AI, and teachers in Eastern Indonesia perceive greater value. Across levels, teachers primarily use AI to reduce instructional preparation workload (e.g., assessment, lesson planning, and material development). However, generic outputs, infrastructure constraints, and limited contextual alignment continue to hinder effective classroom integration.
[27] Fragile Reasoning: A Mechanistic Analysis of LLM Sensitivity to Meaning-Preserving Perturbations
Shou-Tzu Han, Rodrigue Rizk, KC Santosh
Main category: cs.CL
TL;DR: LLMs show fragility to semantic-preserving perturbations in math reasoning, with systematic evaluation revealing high answer-flip rates and architectural differences in failure mechanisms.
Details
Motivation: Large language models perform well on mathematical reasoning benchmarks but are surprisingly vulnerable to simple surface-level changes that preserve meaning, raising concerns about their robustness and reliability.Method: Systematically evaluated three LLMs (Mistral-7B, Llama-3-8B, Qwen2.5-7B) on 677 GSM8K problems with semantically equivalent variants using name substitution and number format paraphrasing. Introduced Mechanistic Perturbation Diagnostics (MPD) framework combining logit lens analysis, activation patching, component ablation, and novel Cascading Amplification Index (CAI) metric.
Result: All models showed substantial answer-flip rates (28.8%-45.1%), with number paraphrasing more disruptive than name swaps. CAI outperformed first divergence layer as failure predictor for two architectures. Logit lens showed flipped samples diverge earlier than stable ones. Activation patching revealed architectural divide: Llama-3 failures recoverable by patching (43/60), while Mistral and Qwen failures distributed (3/60 and 0/60).
Conclusion: Proposed mechanistic failure taxonomy (localized, distributed, entangled) validated through repair experiments: steering vectors and fine-tuning recovered 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures, showing architectural differences in failure mechanisms.
Abstract: Large language models demonstrate strong performance on mathematical reasoning benchmarks, yet remain surprisingly fragile to meaning-preserving surface perturbations. We systematically evaluate three open-weight LLMs, Mistral-7B, Llama-3-8B, and Qwen2.5-7B, on 677 GSM8K problems paired with semantically equivalent variants generated through name substitution and number format paraphrasing. All three models exhibit substantial answer-flip rates (28.8%-45.1%), with number paraphrasing consistently more disruptive than name swaps. To trace the mechanistic basis of these failures, we introduce the Mechanistic Perturbation Diagnostics (MPD) framework, combining logit lens analysis, activation patching, component ablation, and the Cascading Amplification Index (CAI) into a unified diagnostic pipeline. CAI, a novel metric quantifying layer-wise divergence amplification, outperforms first divergence layer as a failure predictor for two of three architectures (AUC up to 0.679). Logit lens reveals that flipped samples diverge from correct predictions at significantly earlier layers than stable samples. Activation patching reveals a stark architectural divide in failure localizability: Llama-3 failures are recoverable by patching at specific layers (43/60 samples), while Mistral and Qwen failures are broadly distributed (3/60 and 0/60). Based on these diagnostic signals, we propose a mechanistic failure taxonomy (localized, distributed, and entangled) and validate it through targeted repair experiments: steering vectors and layer fine-tuning recover 12.2% of localized failures (Llama-3) but only 7.2% of entangled (Qwen) and 5.2% of distributed (Mistral) failures.
[28] What Do Claim Verification Datasets Actually Test? A Reasoning Trace Analysis
Delip Rao, Chris Callison-Burch
Main category: cs.CL
TL;DR: Analysis of reasoning patterns in claim verification benchmarks reveals dominance of direct evidence extraction and under-representation of complex reasoning like multi-sentence synthesis and numerical reasoning, with dataset-specific biases affecting evaluation of verification systems.
Details
Motivation: To systematically understand what reasoning capabilities are actually being tested in claim verification benchmarks, as current benchmarks may not adequately evaluate complex reasoning needed for real-world verification tasks.Method: Generated structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini, then analyzed reasoning patterns and used a compact 1B-parameter reasoning verifier to characterize error types across different domains.
Result: Found direct evidence extraction dominates (77% of cases), while multi-sentence synthesis (4%) and numerical reasoning (2%) are severely under-represented. Dataset-level analysis shows stark biases: some datasets test almost exclusively lexical matching, while others require information synthesis in roughly half of cases. Error profiles vary dramatically by domain: general-domain verification dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures.
Conclusion: High benchmark scores primarily reflect retrieval-plus-entailment ability rather than complex reasoning. The paper outlines recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need for real-world applications.
Abstract: Despite rapid progress in claim verification, we lack a systematic understanding of what reasoning these benchmarks actually exercise. We generate structured reasoning traces for 24K claim-verification examples across 9 datasets using GPT-4o-mini and find that direct evidence extraction dominates, while multi-sentence synthesis and numerical reasoning are severely under-represented. A dataset-level breakdown reveals stark biases: some datasets almost exclusively test lexical matching, while others require information synthesis in roughly half of cases. Using a compact 1B-parameter reasoning verifier, we further characterize five error types and show that error profiles vary dramatically by domain – general-domain verification is dominated by lexical overlap bias, scientific verification by overcautiousness, and mathematical verification by arithmetic reasoning failures. Our findings suggest that high benchmark scores primarily reflect retrieval-plus-entailment ability. We outline recommendations for building more challenging evaluation suites that better test the reasoning capabilities verification systems need.
[29] PRCCF: A Persona-guided Retrieval and Causal-aware Cognitive Filtering Framework for Emotional Support Conversation
Yanxin Luo, Xiaoyu Zhang, Jing Li, Yan Gao, Donghong Han
Main category: cs.CL
TL;DR: PRCCF is a framework for emotional support conversations that uses persona-guided retrieval and causality-aware cognitive filtering to generate more empathetic and contextually appropriate responses.
Details
Motivation: Existing emotional support conversation methods struggle with deep contextual understanding, limiting their ability to provide effective emotional support. The authors aim to enhance response generation by better incorporating persona information and causal reasoning.Method: Proposes PRCCF framework with two key components: 1) Persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment, and 2) Causality-aware cognitive filtering module that prioritizes causally relevant external knowledge for better emotional reasoning.
Result: Extensive experiments on ESConv dataset show PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations.
Conclusion: The proposed framework effectively enhances emotional support conversation by improving contextual cognitive understanding through persona alignment and causal reasoning.
Abstract: Emotional Support Conversation (ESC) aims to alleviate individual emotional distress by generating empathetic responses. However, existing methods face challenges in effectively supporting deep contextual understanding. To address this issue, we propose PRCCF, a Persona-guided Retrieval and Causality-aware Cognitive Filtering framework. Specifically, the framework incorporates a persona-guided retrieval mechanism that jointly models semantic compatibility and persona alignment to enhance response generation. Furthermore, it employs a causality-aware cognitive filtering module to prioritize causally relevant external knowledge, thereby improving contextual cognitive understanding for emotional reasoning. Extensive experiments on the ESConv dataset demonstrate that PRCCF outperforms state-of-the-art baselines on both automatic metrics and human evaluations. Our code is publicly available at: https://github.com/YancyLyx/PRCCF.
[30] PRISM: Probability Reallocation with In-Span Masking for Knowledge-Sensitive Alignment
Chenning Xu, Mao Zheng, Mingyang Song
Main category: cs.CL
TL;DR: PRISM is a risk-gated fine-tuning framework that reduces hallucinations in language models by using sentence-level factuality risk labels and inter-sentence dependencies to selectively penalize overconfident predictions on factually unsupported tokens.
Details
Motivation: Standard supervised fine-tuning with token-level hard labels can cause language models to become overconfident and imitate factually unsupported targets, leading to hallucinations that propagate in multi-sentence generation tasks.Method: PRISM uses coarse sentence-level factuality risk labels and inter-sentence dependency annotations to create a differentiable risk-gated framework. It modifies learning only at fact-critical positions through a lightweight probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with scope controlled by span-level risk weights and model-aware gating.
Result: Experiments on hallucination-sensitive factual benchmarks show PRISM improves factual aggregates across different model backbones while maintaining competitive overall capabilities. Ablations reveal the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles.
Conclusion: PRISM provides an effective framework for reducing hallucinations in language models by selectively targeting fact-critical positions during fine-tuning, balancing factual correction with capability preservation through risk-gated probability reallocation.
Abstract: Supervised fine-tuning (SFT) with token-level hard labels can amplify overconfident imitation of factually unsupported targets, causing hallucinations that propagate in multi-sentence generation. We study an augmented SFT setting in which training instances include coarse sentence-level factuality risk labels and inter-sentence dependency annotations, providing structured signals about where factual commitments are weakly supported. We propose \textbf{PRISM}, a differentiable risk-gated framework that modifies learning only at fact-critical positions. PRISM augments standard SFT with a lightweight, model-aware probability reallocation objective that penalizes high-confidence predictions on risky target tokens, with its scope controlled by span-level risk weights and model-aware gating. Experiments on hallucination-sensitive factual benchmarks and general evaluations show that PRISM improves factual aggregates across backbones while maintaining a competitive overall capability profile. Ablations further show that the auxiliary signal is most effective when used conservatively, and that knowledge masking and model-aware reallocation play complementary roles in balancing factual correction and capability preservation.
[31] On the Role of Reasoning Patterns in the Generalization Discrepancy of Long Chain-of-Thought Supervised Fine-Tuning
Zhaoyi Li, Xiangyu Xi, Zhengyu Chen, Wei Wang, Gangwei Jiang, Ranran Shen, Linqi Song, Ying Wei, Defu Lian
Main category: cs.CL
TL;DR: SFT on different CoT trajectory sources shows a paradox: lower training loss doesn’t guarantee better generalization; DeepSeek-R1 data has lower loss but worse generalization than gpt-oss-120b data due to divergent reasoning patterns.
Details
Motivation: To understand how Chain-of-Thought trajectories from different sources influence the generalization performance of large reasoning models, particularly when there's a paradox where lower training loss doesn't translate to better generalization.Method: Comparative study using verified CoT trajectories from DeepSeek-R1-0528 and gpt-oss-120b with identical problem sets, followed by multi-faceted analysis of token-level SFT loss and step-level reasoning behaviors, and proposing trajectory filtering based on branching patterns.
Result: DeepSeek-R1 data achieves lower training loss but significantly worse generalization on reasoning benchmarks; gpt-oss-120b exhibits convergent deductive reasoning while DeepSeek-R1 shows divergent branch-heavy exploration; filtering branching trajectories improves DeepSeek-R1 performance by up to 5.5% on benchmarks.
Conclusion: The quality of reasoning patterns in CoT trajectories matters more than training loss for generalization; divergent exploration behaviors can hinder reasoning performance, and simple filtering of branching trajectories can significantly improve SFT outcomes.
Abstract: Supervised Fine-Tuning (SFT) on long Chain-of-Thought (CoT) trajectories has become a pivotal phase in building large reasoning models. However, how CoT trajectories from different sources influence the generalization performance of models remains an open question. In this paper, we conduct a comparative study using two sources of verified CoT trajectories generated by two competing models, \texttt{DeepSeek-R1-0528} and \texttt{gpt-oss-120b}, with their problem sets controlled to be identical. Despite their comparable performance, we uncover a striking paradox: lower training loss does not translate to better generalization. SFT on \texttt{DeepSeek-R1-0528} data achieves remarkably lower training loss, yet exhibits significantly worse generalization performance on reasoning benchmarks compared to those trained on \texttt{gpt-oss-120b}. To understand this paradox, we perform a multi-faceted analysis probing token-level SFT loss and step-level reasoning behaviors. Our analysis reveals a difference in reasoning patterns. \texttt{gpt-oss-120b} exhibits highly convergent and deductive trajectories, whereas \texttt{DeepSeek-R1-0528} favors a divergent and branch-heavy exploration pattern. Consequently, models trained with \texttt{DeepSeek-R1} data inherit inefficient exploration behaviors, often getting trapped in redundant exploratory branches that hinder them from reaching correct solutions. Building upon this insight, we propose a simple yet effective remedy of filtering out frequently branching trajectories to improve the generalization of SFT. Experiments show that training on selected \texttt{DeepSeek-R1-0528} subsets surprisingly improves reasoning performance by up to 5.1% on AIME25, 5.5% on BeyondAIME, and on average 3.6% on five benchmarks.
[32] Development and multi-center evaluation of domain-adapted speech recognition for human-AI teaming in real-world gastrointestinal endoscopy
Ruijie Yang, Yan Zhu, Peiyao Fu, Te Luo, Zhihua Wang, Xian Yang, Quanlin Li, Pinghong Zhou, Shuo Wang
Main category: cs.CL
TL;DR: EndoASR: A domain-adapted automatic speech recognition system for gastrointestinal endoscopy that improves transcription accuracy and clinical usability through synthetic data adaptation and noise robustness techniques.
Details
Motivation: Current ASR systems are unreliable in real-world clinical endoscopy settings due to domain-specific terminology and complex acoustic conditions, limiting their effectiveness for human-AI interaction in medical workflows.Method: Two-stage adaptation strategy using synthetic endoscopy reports for domain-specific language modeling and noise robustness, with real-time deployment capabilities and integration with large language models.
Result: Substantially improved transcription accuracy (CER reduced from 20.52% to 14.14%) and medical term accuracy (increased from 54.30% to 87.59%), with consistent generalization across multi-center studies and real-time performance (RTF 0.005).
Conclusion: Domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with validated performance across real-world clinical settings and enhanced downstream information extraction.
Abstract: Automatic speech recognition (ASR) is a critical interface for human-AI interaction in gastrointestinal endoscopy, yet its reliability in real-world clinical settings is limited by domain-specific terminology and complex acoustic conditions. Here, we present EndoASR, a domain-adapted ASR system designed for real-time deployment in endoscopic workflows. We develop a two-stage adaptation strategy based on synthetic endoscopy reports, targeting domain-specific language modeling and noise robustness. In retrospective evaluation across six endoscopists, EndoASR substantially improves both transcription accuracy and clinical usability, reducing character error rate (CER) from 20.52% to 14.14% and increasing medical term accuracy (Med ACC) from 54.30% to 87.59%. In a prospective multi-center study spanning five independent endoscopy centers, EndoASR demonstrates consistent generalization under heterogeneous real-world conditions. Compared with the baseline Paraformer model, CER is reduced from 16.20% to 14.97%, while Med ACC is improved from 61.63% to 84.16%, confirming its robustness in practical deployment scenarios. Notably, EndoASR achieves a real-time factor (RTF) of 0.005, significantly faster than Whisper-large-v3 (RTF 0.055), while maintaining a compact model size of 220M parameters, enabling efficient edge deployment. Furthermore, integration with large language models demonstrates that improved ASR quality directly enhances downstream structured information extraction and clinician-AI interaction. These results demonstrate that domain-adapted ASR can serve as a reliable interface for human-AI teaming in gastrointestinal endoscopy, with consistent performance validated across multi-center real-world clinical settings.
[33] Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework
Yanchen Wu, Tenghui Lin, Yingli Zhou, Fangyuan Zhang, Qintian Guo, Xun Zhou, Sibo Wang, Xilin Liu, Yuchi Ma, Yixiang Fang
Main category: cs.CL
TL;DR: A comprehensive survey and experimental comparison of memory methods for LLM-based agents, proposing a unified framework and introducing a new memory method that outperforms existing approaches.
Details
Motivation: Memory is crucial for LLM-based agents handling long-horizon complex tasks, but existing memory methods haven't been systematically compared under consistent experimental settings, making it difficult to understand their relative effectiveness.Method: 1) Proposes a unified framework incorporating all existing agent memory methods; 2) Conducts extensive experimental comparison of representative methods on two benchmarks; 3) Designs a new memory method by combining effective modules from existing approaches.
Result: The new memory method outperforms state-of-the-art approaches. The comprehensive analysis provides insights into the effectiveness of different memory techniques and their behavior in various scenarios.
Conclusion: Systematic comparison reveals strengths and weaknesses of existing memory methods, and the proposed new method demonstrates superior performance. The work provides valuable insights for future research on memory in LLM-based agents.
Abstract: Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.
[34] Human-Guided Reasoning with Large Language Models for Vietnamese Speech Emotion Recognition
Truc Nguyen, Then Tran, Binh Truong, Phuoc Nguyen T. H
Main category: cs.CL
TL;DR: A human-machine collaborative framework for Vietnamese Speech Emotion Recognition that combines acoustic feature models with LLM-based reasoning, using confidence-based routing to handle ambiguous cases and iterative refinement for continuous improvement.
Details
Motivation: Vietnamese Speech Emotion Recognition faces challenges due to ambiguous acoustic patterns and lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. Current data-driven models struggle with these ambiguous cases.Method: Proposes a human-machine collaborative framework integrating human knowledge into learning. Uses acoustic feature-based models to provide confidence scores and feature-level evidence, with confidence-based routing to distinguish easy vs. ambiguous samples. Uncertain cases are delegated to LLMs for deeper reasoning guided by structured rules from human annotation behavior. Includes iterative refinement through error analysis and rule updates.
Result: Achieves strong performance on Vietnamese speech dataset (2,764 samples, 3 emotion classes) with up to 86.59% accuracy and Macro F1 around 0.85-0.86. High inter-annotator agreement (Fleiss Kappa = 0.8574) ensures reliable ground truth.
Conclusion: The framework effectively handles ambiguous and hard-to-classify cases in speech emotion recognition, demonstrating the importance of combining data-driven models with human reasoning. Provides a robust, model-agnostic approach suitable for low-resource settings.
Abstract: Vietnamese Speech Emotion Recognition (SER) remains challenging due to ambiguous acoustic patterns and the lack of reliable annotated data, especially in real-world conditions where emotional boundaries are not clearly separable. To address this problem, this paper proposes a human-machine collaborative framework that integrates human knowledge into the learning process rather than relying solely on data-driven models. The proposed framework is centered around LLM-based reasoning, where acoustic feature-based models are used to provide auxiliary signals such as confidence and feature-level evidence. A confidence-based routing mechanism is introduced to distinguish between easy and ambiguous samples, allowing uncertain cases to be delegated to LLMs for deeper reasoning guided by structured rules derived from human annotation behavior. In addition, an iterative refinement strategy is employed to continuously improve system performance through error analysis and rule updates. Experiments are conducted on a Vietnamese speech dataset of 2,764 samples across three emotion classes (calm, angry, panic), with high inter-annotator agreement (Fleiss Kappa = 0.8574), ensuring reliable ground truth. The proposed method achieves strong performance, reaching up to 86.59% accuracy and Macro F1 around 0.85-0.86, demonstrating its effectiveness in handling ambiguous and hard-to-classify cases. Overall, this work highlights the importance of combining data-driven models with human reasoning, providing a robust and model-agnostic approach for speech emotion recognition in low-resource settings.
[35] Detecting Toxic Language: Ontology and BERT-based Approaches for Bulgarian Text
Melania Berbatova, Tsvetoslav Vasev
Main category: cs.CL
TL;DR: A Bulgarian toxic content detection system using BERT that distinguishes toxic language from medical terms and minority group references to avoid over-blocking valuable information.
Details
Motivation: Current toxic content detection systems often over-block by incorrectly flagging medical terminology and minority group references as toxic, preventing access to valuable information. There's a need for more nuanced detection that can distinguish actual toxicity from important non-toxic content.Method: 1) Created an ontology modeling potentially toxic words in Bulgarian; 2) Built a dataset of 4,384 manually annotated sentences from Bulgarian online forums across four categories (toxic language, medical terminology, non-toxic language, minority community terms); 3) Trained a BERT-based model for toxic language classification.
Result: The BERT-based model achieved a 0.89 F1 macro score for toxic language classification. The trained model is ready for real-world deployment and can be integrated into toxic content detection systems.
Conclusion: The paper presents a successful approach to nuanced toxic content detection in Bulgarian that avoids over-blocking important information like medical terms and minority group references, with a high-performing model ready for practical application.
Abstract: Toxic content detection in online communication remains a significant challenge, with current solutions often inadvertently blocking valuable information, including medical terms and text related to minority groups. This paper presents a more nu-anced approach to identifying toxicity in Bulgarian text while preserving access to essential information. The research explores two distinct methodologies for detecting toxic content. The developed methodologies have po-tential applications across diverse online platforms and content moderation systems. First, we propose an ontology that models the potentially toxic words in Bulgarian language. Then, we compose a dataset that comprises 4,384 manually anno-tated sentences from Bulgarian online forums across four categories: toxic language, medical terminology, non-toxic lan-guage, and terms related to minority communities. We then train a BERT-based model for toxic language classification, which reaches a 0.89 F1 macro score. The trained model is directly applicable in a real environment and can be integrated as a com-ponent of toxic content detection systems.
[36] LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches
Linyang He, Qiyao Yu, Hanze Dong, Baohao Liao, Xinxing Xu, Micah Goldblum, Jiang Bian, Nima Mesgarani
Main category: cs.CL
TL;DR: LiveMathematicianBench: A dynamic benchmark for research-level mathematical reasoning using recent arXiv papers, featuring proof-sketch-guided distractors and substitution-resistant evaluation to test genuine understanding beyond memorization.
Details
Motivation: Existing mathematical reasoning benchmarks are limited by synthetic settings and data contamination. There's a need for rigorous evaluation of LLMs' mathematical capabilities in realistic, research-level contexts beyond memorized patterns.Method: Built from recent arXiv papers published after model training cutoffs, with 13-category logical taxonomy of theorem types. Uses proof-sketch-guided distractor pipeline to construct plausible but invalid answer choices. Implements substitution-resistant mechanism to distinguish answer recognition from substantive reasoning.
Result: Benchmark is far from saturated: Gemini-3.1-pro-preview achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply (GPT-5.4: 30.6%, Gemini-3.1-pro-preview: 17.6%). Proof-sketch access yields consistent accuracy gains.
Conclusion: LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs, revealing current models’ limitations in genuine mathematical understanding.
Abstract: Mathematical reasoning is a hallmark of human intelligence, and whether large language models (LLMs) can meaningfully perform it remains a central question in artificial intelligence and cognitive science. As LLMs are increasingly integrated into scientific workflows, rigorous evaluation of their mathematical capabilities becomes a practical necessity. Existing benchmarks are limited by synthetic settings and data contamination. We present LiveMathematicianBench, a dynamic multiple-choice benchmark for research-level mathematical reasoning built from recent arXiv papers published after model training cutoffs. By grounding evaluation in newly published theorems, it provides a realistic testbed beyond memorized patterns. The benchmark introduces a thirteen-category logical taxonomy of theorem types (e.g., implication, equivalence, existence, uniqueness), enabling fine-grained evaluation across reasoning forms. It employs a proof-sketch-guided distractor pipeline that uses high-level proof strategies to construct plausible but invalid answer choices reflecting misleading proof directions, increasing sensitivity to genuine understanding over surface-level matching. We also introduce a substitution-resistant mechanism to distinguish answer recognition from substantive reasoning. Evaluation shows the benchmark is far from saturated: Gemini-3.1-pro-preview, the best model, achieves only 43.5%. Under substitution-resistant evaluation, accuracy drops sharply: GPT-5.4 scores highest at 30.6%, while Gemini-3.1-pro-preview falls to 17.6%, below the 20% random baseline. A dual-mode protocol reveals that proof-sketch access yields consistent accuracy gains, suggesting models can leverage high-level proof strategies for reasoning. Overall, LiveMathematicianBench offers a scalable, contamination-resistant testbed for studying research-level mathematical reasoning in LLMs.
[37] Taming CATS: Controllable Automatic Text Simplification through Instruction Fine-Tuning with Control Tokens
Hanna Hubarava, Yingqiang Gao
Main category: cs.CL
TL;DR: Instruction fine-tuning with discrete control tokens enables controllable text simplification across readability levels and compression rates, but controllability depends on training data variation and standard metrics fail to measure control effectively.
Details
Motivation: Current controllable automatic text simplification (CATS) approaches treat controllability as a decoding problem and use evaluation metrics that don't properly measure control, with controllability being constrained by data and evaluation limitations.Method: Domain-agnostic CATS framework using instruction fine-tuning with discrete control tokens to steer open-source models (Llama, Mistral, Qwen; 1-14B) to target readability levels (FKGL, ARI, Dale-Chall) and compression rates across four domains.
Result: Smaller models (1-3B) can be competitive, but reliable controllability depends on sufficient variation in training data; readability control is learned consistently while compression control underperforms due to limited signal variability in existing corpora.
Conclusion: Standard simplification metrics are insufficient for measuring control, motivating error-based measures for target-output alignment, and naive data splits can introduce distributional mismatch that undermines both training and evaluation.
Abstract: Controllable Automatic Text Simplification (CATS) produces user-tailored outputs, yet controllability is often treated as a decoding problem and evaluated with metrics that are not reflective to the measure of control. We observe that controllability in ATS is significantly constrained by data and evaluation. To this end, we introduce a domain-agnostic CATS framework based on instruction fine-tuning with discrete control tokens, steering open-source models to target readability levels and compression rates. Across three model families with different model sizes (Llama, Mistral, Qwen; 1-14B) and four domains (medicine, public administration, news, encyclopedic text), we find that smaller models (1-3B) can be competitive, but reliable controllability strongly depends on whether the training data encodes sufficient variation in the target attribute. Readability control (FKGL, ARI, Dale-Chall) is learned consistently, whereas compression control underperforms due to limited signal variability in the existing corpora. We further show that standard simplification and similarity metrics are insufficient for measuring control, motivating error-based measures for target-output alignment. Finally, our sampling and stratification experiments demonstrate that naive splits can introduce distributional mismatch that undermines both training and evaluation.
[38] DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment
Liang Zhu, Feiteng Fang, Yuelin Bai, Longze Chen, Zhexiang Zhang, Minghuan Tan, Min Yang
Main category: cs.CL
TL;DR: DEFT is an efficient alignment framework that uses distribution-guided data filtering to enhance RLHF methods, improving both alignment and generalization while reducing training time.
Details
Motivation: Current RLHF methods like PPO are costly and unstable, while alternatives require large datasets and may weaken LLM generalization. There's a need for more efficient alignment that preserves generalization ability.Method: DEFT calculates differential distribution rewards based on model output distribution and preference data discrepancy, filters a small high-quality data subset, and incorporates this into existing alignment methods to guide model output distribution.
Result: Methods enhanced by DEFT outperform original methods in both alignment capability and generalization ability, with significantly reduced training time.
Conclusion: DEFT provides an efficient framework for LLM alignment that improves performance while reducing computational costs and preserving generalization ability.
Abstract: Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model’s output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.
[39] PLOT: Enhancing Preference Learning via Optimal Transport
Liang Zhu, Yuelin Bai, Xiankun Ren, Jiaxi Yang, Lei Zhang, Feiteng Fang, Hamid Alinejad-Rokny, Minghuan Tan, Min Yang
Main category: cs.CL
TL;DR: PLOT introduces a token-level optimal transport loss for LLM preference learning that improves alignment performance while maintaining model stability and fluency.
Details
Motivation: Existing preference learning methods for LLMs have limitations including modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships.Method: PLOT formulates preference learning as an Optimal Transport Problem, using a token-level loss derived from Optimal Transport that aligns model outputs with human preferences while preserving the original LLM distribution. It leverages token embeddings to capture semantic relationships for globally informed optimization.
Result: Experiments across two preference categories (Human Values and Logic & Problem Solving) spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence.
Conclusion: Optimal transport provides a principled methodology for preference learning, establishing a theoretically grounded framework that offers new insights for LLM preference learning.
Abstract: Preference learning in Large Language Models (LLMs) has advanced significantly, yet existing methods remain limited by modest performance gains, high computational costs, hyperparameter sensitivity, and insufficient modeling of global token-level relationships. We introduce PLOT, which enhances Preference Learning in fine-tuning-based alignment through a token-level loss derived from Optimal Transport. By formulating preference learning as an Optimal Transport Problem, PLOT aligns model outputs with human preferences while preserving the original distribution of LLMs, ensuring stability and robustness. Furthermore, PLOT leverages token embeddings to capture semantic relationships, enabling globally informed optimization. Experiments across two preference categories - Human Values and Logic & Problem Solving - spanning seven subpreferences demonstrate that PLOT consistently improves alignment performance while maintaining fluency and coherence. These results substantiate optimal transport as a principled methodology for preference learning, establishing a theoretically grounded framework that provides new insights for preference learning of LLMs.
[40] From Guessing to Placeholding: A Cost-Theoretic Framework for Uncertainty-Aware Code Completion
Liang Zhu, Haolin Chen, Lidong Zhao, Xian Wu
Main category: cs.CL
TL;DR: APC introduces adaptive placeholder completion for code generation, strategically inserting placeholders at uncertain positions to reduce editing costs compared to traditional hard completion.
Details
Motivation: Traditional LLMs for code completion force generation of fully concrete code even with insufficient context, leading to high editing costs when models make erroneous predictions at specific token positions.Method: Proposes Adaptive Placeholder Completion (APC) framework that outputs explicit placeholders at high-entropy positions, formulates code completion as cost-minimization problem, constructs training data from real-world edit logs, and uses reinforcement learning with cost-based reward function.
Result: APC reduces expected editing costs from 19% to 50% across 1.5B-14B parameter models while preserving standard hard completion performance.
Conclusion: Provides theoretical foundation and practical training framework for uncertainty-aware code completion, showing adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.
Abstract: While Large Language Models (LLMs) have demonstrated exceptional proficiency in code completion, they typically adhere to a Hard Completion (HC) paradigm, compelling the generation of fully concrete code even amidst insufficient context. Our analysis of 3 million real-world interactions exposes the limitations of this strategy: 61% of the generated suggestions were either edited after acceptance or rejected despite exhibiting over 80% similarity to the user’s subsequent code, suggesting that models frequently make erroneous predictions at specific token positions. Motivated by this observation, we propose Adaptive Placeholder Completion (APC), a collaborative framework that extends HC by strategically outputting explicit placeholders at high-entropy positions, allowing users to fill directly via IDE navigation. Theoretically, we formulate code completion as a cost-minimization problem under uncertainty. Premised on the observation that filling placeholders incurs lower cost than correcting errors, we prove the existence of a critical entropy threshold above which APC achieves strictly lower expected cost than HC. We instantiate this framework by constructing training data from filtered real-world edit logs and design a cost-based reward function for reinforcement learning. Extensive evaluations across 1.5B–14B parameter models demonstrate that APC reduces expected editing costs from 19% to 50% while preserving standard HC performance. Our work provides both a theoretical foundation and a practical training framework for uncertainty-aware code completion, demonstrating that adaptive abstention can be learned end-to-end without sacrificing conventional completion quality.
[41] Beyond Detection: Ethical Foundations for Automated Dyslexic Error Attribution
Samuel Rose, Debarati Chakraborty
Main category: cs.CL
TL;DR: Paper develops neural model for classifying spelling errors as dyslexic vs. non-dyslexic with 93% accuracy, while emphasizing ethical risks and deployment guidelines.
Details
Motivation: Prior work focused on error correction rather than attribution for dyslexic spelling errors, and neglected ethical risks like harmful labeling, algorithmic bias, and institutional misuse.Method: Formulates dyslexic error attribution as binary classification task; develops comprehensive feature set capturing orthographic, phonological, and morphological properties; proposes twin-input neural model evaluated against traditional ML baselines under writer-independent conditions.
Result: Neural model achieves 93.01% accuracy and 94.01% F1-score; phonetically plausible errors and vowel confusions emerge as strongest attribution signals; includes fairness analysis across subgroups.
Conclusion: Dyslexic error attribution is technically feasible at high accuracy, but feasibility alone is insufficient for deployment in high-stakes educational contexts; requires ethical frameworks with conditions for consent, transparency, human oversight, and recourse.
Abstract: Dyslexic spelling errors exhibit systematic phonological and orthographic patterns that distinguish them from the errors produced by typically developing writers. While this observation has motivated dyslexic-specific spell-checking and assistive writing tools, prior work has focused predominantly on error correction rather than attribution, and has largely neglected the ethical risks. The risk of harmful labelling, covert screening, algorithmic bias, and institutional misuse that automated classification of learners entails requires the development of robust ethical and legal frameworks for research in this area. This paper addresses both gaps. We formulate dyslexic error attribution as a binary classification task. Given a misspelt word and its correct target form, determine whether the error pattern is characteristic of a dyslexic or non-dyslexic writer. We develop a comprehensive feature set capturing orthographic, phonological, and morphological properties of each error, and propose a twin-input neural model evaluated against traditional machine learning baselines under writer-independent conditions. The neural model achieves 93.01% accuracy and an F1-score of 94.01%, with phonetically plausible errors and vowel confusions emerging as the strongest attribution signals. We situate these technical results within an explicit ethics-first framework, analysing fairness across subgroups, the interpretability requirements of educational deployment, and the conditions, consent, transparency, human oversight, and recourse, under which a system could be responsibly used. We provide concrete guidelines for ethical deployment and an open discussion of the systems limitations and misuse potential. Our results demonstrate that dyslexic error attribution is feasible at high accuracy while underscoring that feasibility alone is insufficient for deployment in high-stakes educational contexts.
[42] SURE: Synergistic Uncertainty-aware Reasoning for Multimodal Emotion Recognition in Conversations
Yiqiang Cai, Chengyan Wu, Bolei Ma, Bo Chen, Yun Xue, Julia Hirschberg, Ziwei Gong
Main category: cs.CL
TL;DR: SURE framework for multimodal emotion recognition in conversations improves robustness through uncertainty-aware noise handling and iterative contextual reasoning.
Details
Motivation: Existing MERC approaches focus on fusion but overlook uncertainty in noisy features and fine-grained reasoning, limiting robustness and contextual modeling.Method: Three-component framework: Uncertainty-Aware Mixture-of-Experts for modality-specific noise handling, Iterative Reasoning for multi-turn contextual reasoning, and Transformer Gate for intra/inter-modal interactions.
Result: SURE consistently outperforms state-of-the-art methods on benchmark MERC datasets, demonstrating effectiveness in robust multimodal reasoning.
Conclusion: Uncertainty modeling and iterative reasoning are crucial for advancing emotion recognition in conversational settings.
Abstract: Multimodal emotion recognition in conversations (MERC) requires integrating multimodal signals while being robust to noise and modeling contextual reasoning. Existing approaches often emphasize fusion but overlook uncertainty in noisy features and fine-grained reasoning. We propose SURE (Synergistic Uncertainty-aware REasoning) for MERC, a framework that improves robustness and contextual modeling. SURE consists of three components: an Uncertainty-Aware Mixture-of-Experts module to handle modality-specific noise, an Iterative Reasoning module for multi-turn reasoning over context, and a Transformer Gate module to capture intra- and inter-modal interactions. Experiments on benchmark MERC datasets show that SURE consistently outperforms state-of-the-art methods, demonstrating its effectiveness in robust multimodal reasoning. These results highlight the importance of uncertainty modeling and iterative reasoning in advancing emotion recognition in conversational settings.
[43] Is Clinical Text Enough? A Multimodal Study on Mortality Prediction in Heart Failure Patients
Oumaima El Khettari, Virgile Barthet, Guillaume Hocquet, Joconde Weller, Emmanuel Morin, Pierre Zweigenbaum
Main category: cs.CL
TL;DR: Transformer models for heart failure mortality prediction show multimodal fusion of clinical text and structured EHR data works best, while LLMs underperform for this clinical task.
Details
Motivation: Accurate short-term mortality prediction in heart failure is challenging with structured EHR data alone, prompting exploration of transformer-based approaches that leverage both text and structured data.Method: Evaluated transformer-based models on French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches with different fusion strategies and entity-level representations.
Result: Entity-level representations in clinical text improved predictions over CLS embeddings; supervised multimodal fusion of text and structured variables achieved best performance; LLMs performed inconsistently with text-only prompts working best.
Conclusion: Entity-aware multimodal transformers offer most reliable solution for HF outcome prediction, while current LLM prompting remains limited for clinical decision support despite their general capabilities.
Abstract: Accurate short-term mortality prediction in heart failure (HF) remains challenging, particularly when relying on structured electronic health record (EHR) data alone. We evaluate transformer-based models on a French HF cohort, comparing text-only, structured-only, multimodal, and LLM-based approaches. Our results show that enriching clinical text with entity-level representations improves prediction over CLS embeddings alone, and that supervised multimodal fusion of text and structured variables achieves the best overall performance. In contrast, large language models perform inconsistently across modalities and decoding strategies, with text-only prompts outperforming structured or multimodal inputs. These findings highlight that entity-aware multimodal transformers offer the most reliable solution for short-term HF outcome prediction, while current LLM prompting remains limited for clinical decision support.
[44] ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
Bhaskara Hanuma Vedula, Darshan Anghan, Ishita Goyal, Ponnurangam Kumaraguru, Abhijnan Chakraborty
Main category: cs.CL
TL;DR: ImplicitBBQ benchmark evaluates LLM biases using characteristic-based cues (not just names) across multiple demographics, finding implicit bias 6x higher than explicit bias, with current mitigation strategies insufficient.
Details
Motivation: Current benchmarks use name-based proxies for bias detection, which have weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. There's a need to evaluate implicit bias through characteristic-based cues that signal demographic identity indirectly.Method: Introduces ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic-based cues across age, gender, region, religion, caste, and socioeconomic status. Evaluates 11 models comparing implicit vs. explicit bias, and tests mitigation strategies including safety prompting, chain-of-thought reasoning, and few-shot prompting.
Result: Implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap. Few-shot prompting reduces implicit bias by 84%, but leaves caste bias at four times the level of any other dimension.
Conclusion: Current alignment and prompting strategies address surface-level bias evaluation but leave culturally grounded stereotypic associations largely unresolved. The ImplicitBBQ benchmark provides a tool for researchers to develop better mitigation techniques.
Abstract: Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally associated attributes that signal implicitly, across age, gender, region, religion, caste, and socioeconomic status. Evaluating 11 models, we find that implicit bias in ambiguous contexts is over six times higher than explicit bias in open weight models. Safety prompting and chain-of-thought reasoning fail to substantially close this gap; even few-shot prompting, which reduces implicit bias by 84%, leaves caste bias at four times the level of any other dimension. These findings indicate that current alignment and prompting strategies address the surface of bias evaluation while leaving culturally grounded stereotypic associations largely unresolved. We publicly release our code and dataset for model providers and researchers to benchmark potential mitigation techniques.
[45] Reliable News or Propagandist News? A Neurosymbolic Model Using Genre, Topic, and Persuasion Techniques to Improve Robustness in Classification
Géraud Faye, Benjamin Icard, Morgane Casanova, Guillaume Gadek, Guillaume Gravier, Wassila Ouerdane, Céline Hudelot, Sylvain Gatepaille, Paul Égré
Main category: cs.CL
TL;DR: Neurosymbolic approach combining text embeddings with symbolic features improves propaganda detection robustness and generalization
Details
Motivation: Propaganda news mixes biased messages with factual reports, making detection challenging. Existing LM-based approaches often overfit due to dataset biases, needing more robust methods that generalize to new sources.Method: Hybrid neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features including genre, topic, and persuasion techniques for propaganda classification.
Result: Shows improvements over text-only methods, with ablation studies and explainability analyses confirming benefits of added symbolic features for robustness and generalization.
Conclusion: Symbolic conceptual features enhance propaganda detection by improving model robustness and generalization beyond text-only approaches, addressing overfitting issues in LM-based methods.
Abstract: Among news disorders, propagandist news are particularly insidious, because they tend to mix oriented messages with factual reports intended to look like reliable news. To detect propaganda, extant approaches based on Language Models such as BERT are promising but often overfit their training datasets, due to biases in data collection. To enhance classification robustness and improve generalization to new sources, we propose a neurosymbolic approach combining non-contextual text embeddings (fastText) with symbolic conceptual features such as genre, topic, and persuasion techniques. Results show improvements over equivalent text-only methods, and ablation studies as well as explainability analyses confirm the benefits of the added features. Keywords: Information disorder, Fake news, Propaganda, Classification, Topic modeling, Hybrid method, Neurosymbolic model, Ablation, Robustness
[46] How to measure the optimality of word or gesture order with respect to the principle of swap distance minimization
Ramon Ferrer-i-Cancho
Main category: cs.CL
TL;DR: A mathematical framework using permutohedron theory to analyze word order optimality in languages, showing crosslinguistic gestures are at least 77% optimal with respect to swap distance minimization.
Details
Motivation: To develop a mathematical framework for measuring how optimal word orders are with respect to minimizing swap distances between permutations, testing the hypothesis that languages tend to minimize these distances in communication systems.Method: Uses permutohedron theory to represent permutations as vertices in a graph, measures swap distances between word orders, and applies the quadratic assignment problem (QAP) as an umbrella framework for optimization problems in language.
Result: Crosslinguistic gestures are shown to be at least 77% optimal with respect to swap distance minimization, with results unlikely to be due to chance, establishing empirical evidence for optimization in language structure.
Conclusion: Establishes theoretical foundations for studying word order optimality, introduces QAP into language research, and postulates a general principle of optimal assignment unifying various linguistic optimization principles.
Abstract: The structure of all the permutations of a sequence can be represented as a permutohedron, a graph where vertices are permutations and two vertices are linked if a swap of adjacent elements in the permutation of one of the vertices produces the permutation of the other vertex. It has been hypothesized that word orders in languages minimize the swap distance in the permutohedron: given a source order, word orders that are closer in the permutohedron should be less costly and thus more likely. Here we explain how to measure the degree of optimality of word order variation with respect to swap distance minimization. We illustrate the power of our novel mathematical framework by showing that crosslinguistic gestures are at least $77%$ optimal. It is unlikely that the multiple times where crosslinguistic gestures hit optimality are due to chance. We establish the theoretical foundations for research on the optimality of word or gesture order with respect to swap distance minimization in communication systems. Finally, we introduce the quadratic assignment problem (QAP) into language research as an umbrella for multiple optimization problems and, accordingly, postulate a general principle of optimal assignment that unifies various linguistic principles including swap distance minimization.
[47] Diagnosing Translated Benchmarks: An Automated Quality Assurance Study of the EU20 Benchmark Suite
Klaudia Thellmann, Bernhard Stadler, Michael Färber
Main category: cs.CL
TL;DR: Automated quality assurance framework for machine-translated benchmark datasets using structural audits, neural quality metrics, and LLM-based error analysis to identify and prioritize translation issues at scale.
Details
Motivation: Machine-translated benchmark datasets offer cost and scale advantages but suffer from noise, structural loss, and quality issues that undermine confidence. Need scalable methods to measure and verify translation reliability rather than just translation capability.Method: Three-step automated quality assurance approach: 1) Structural corpus audit with targeted fixes, 2) Quality profiling using neural metrics (COMET reference-free and reference-based) with translation service comparisons (DeepL/ChatGPT/Google), 3) LLM-based span-level translation error landscape analysis.
Result: Consistent trends: datasets with lower COMET scores show higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is cleaner). Reference-based COMET on MMLU against human-edited samples confirms findings. Released cleaned/corrected EU20 datasets and reproducible code.
Conclusion: Automated quality assurance provides practical, scalable indicators to prioritize review, complementing but not replacing human gold standards for machine-translated benchmark evaluation.
Abstract: Machine-translated benchmark datasets reduce costs and offer scale, but noise, loss of structure, and uneven quality weaken confidence. What matters is not merely whether we can translate, but also whether we can measure and verify translation reliability at scale. We study translation quality in the EU20 benchmark suite, which comprises five established benchmarks translated into 20 languages, via a three-step automated quality assurance approach: (i) a structural corpus audit with targeted fixes; (ii) quality profiling using a neural metric (COMET, reference-free and reference-based) with translation service comparisons (DeepL / ChatGPT / Google); and (iii) an LLM-based span-level translation error landscape. Trends are consistent: datasets with lower COMET scores exhibit a higher share of accuracy/mistranslation errors at span level (notably HellaSwag; ARC is comparatively clean). Reference-based COMET on MMLU against human-edited samples points in the same direction. We release cleaned/corrected versions of the EU20 datasets, and code for reproducibility. In sum, automated quality assurance offers practical, scalable indicators that help prioritize review – complementing, not replacing, human gold standards.
[48] SAFE: Stepwise Atomic Feedback for Error correction in Multi-hop Reasoning
Daeyong Kwon, Soyoung Yoon, Seung-won Hwang
Main category: cs.CL
TL;DR: SAFE is a dynamic benchmarking framework for multi-hop QA that replaces ungrounded Chain-of-Thought reasoning with verifiable sequences of grounded entities, using train-time verification to clean benchmarks and inference-time verification to detect ungrounded reasoning steps.
Details
Motivation: Current multi-hop QA benchmarks frequently reward LLMs for spurious correctness, masking ungrounded or flawed reasoning steps. There's a need to shift toward rigorous, verifiable reasoning rather than accepting ungrounded Chain-of-Thought outputs.Method: Two-phase framework: (1) Train-time verification establishes atomic error taxonomy and KG-grounded verification pipeline to eliminate noisy supervision, identifying unanswerable instances. (2) Inference-time verification uses feedback model trained on verified dataset to dynamically detect ungrounded steps in real-time.
Result: SAFE identifies up to 14% of instances in standard benchmarks as unanswerable, exposes critical flaws in existing benchmarks, and significantly outperforms standard baselines with average accuracy gain of 8.4 percentage points while guaranteeing verifiable trajectories.
Conclusion: SAFE provides a framework for rigorous reasoning verification in multi-hop QA, addressing spurious correctness issues in current benchmarks and enabling more reliable, verifiable reasoning processes in LLMs.
Abstract: Multi-hop QA benchmarks frequently reward Large Language Models (LLMs) for spurious correctness, masking ungrounded or flawed reasoning steps. To shift toward rigorous reasoning, we propose SAFE, a dynamic benchmarking framework that replaces the ungrounded Chain-of-Thought (CoT) with a strictly verifiable sequence of grounded entities. Our framework operates across two phases: (1) train-time verification, where we establish an atomic error taxonomy and a Knowledge Graph (KG)-grounded verification pipeline to eliminate noisy supervision in standard benchmarks, identifying up to 14% of instances as unanswerable, and (2) inference-time verification, where a feedback model trained on this verified dataset dynamically detects ungrounded steps in real-time. Experimental results demonstrate that SAFE not only exposes the critical flaws of existing benchmarks at train-time, but also significantly outperforms standard baselines, achieving an average accuracy gain of 8.4 pp while guaranteeing verifiable trajectories at inference-time.
[49] $k$NNProxy: Efficient Training-Free Proxy Alignment for Black-Box Zero-Shot LLM-Generated Text Detection
Kahim Wong, Kemou Li, Haiwei Wu, Jiantao Zhou
Main category: cs.CL
TL;DR: kNNProxy: Training-free proxy alignment framework for LLM-generated text detection using kNN-LM retrieval mechanism to align proxy LLMs without fine-tuning or API queries
Details
Motivation: Existing zero-shot LLM-generated text detectors rely on proxy LLMs aligned with unknown source LLMs, which rarely holds in real-world black-box scenarios. Current proxy alignment methods require supervised fine-tuning or repeated API interactions, increasing costs and limiting robustness.Method: Proposes kNNProxy framework that constructs lightweight datastore from target-reflective LGT corpus, uses kNN-LM retrieval to get token-level distributions interpolated with proxy outputs. Extends to mixture of proxies (MoP) for domain shift robustness by routing inputs to domain-specific datastores.
Result: Extensive experiments demonstrate strong detection performance. Method is training-free, query-efficient, and robust to domain shift without requiring proxy fine-tuning or per-token API outputs.
Conclusion: kNNProxy provides effective training-free proxy alignment for LLM-generated text detection, addressing limitations of existing methods while maintaining strong performance and robustness.
Abstract: LLM-generated text (LGT) detection is essential for reliable forensic analysis and for mitigating LLM misuse. Existing LGT detectors can generally be categorized into two broad classes: learning-based approaches and zero-shot methods. Compared with learning-based detectors, zero-shot methods are particularly promising because they eliminate the need to train task-specific classifiers. However, the reliability of zero-shot methods fundamentally relies on the assumption that an off-the-shelf proxy LLM is well aligned with the often unknown source LLM, a premise that rarely holds in real-world black-box scenarios. To address this discrepancy, existing proxy alignment methods typically rely on supervised fine-tuning of the proxy or repeated interactions with commercial APIs, thereby increasing deployment costs, exposing detectors to silent API changes, and limiting robustness under domain shift. Motivated by these limitations, we propose the $k$-nearest neighbor proxy ($k$NNProxy), a training-free and query-efficient proxy alignment framework that repurposes the $k$NN language model ($k$NN-LM) retrieval mechanism as a domain adapter for a fixed proxy LLM. Specifically, a lightweight datastore is constructed once from a target-reflective LGT corpus, either via fixed-budget querying or from existing datasets. During inference, nearest-neighbor evidence induces a token-level predictive distribution that is interpolated with the proxy output, yielding an aligned prediction without proxy fine-tuning or per-token API outputs. To improve robustness under domain shift, we extend $k$NNProxy into a mixture of proxies (MoP) that routes each input to a domain-specific datastore for domain-consistent retrieval. Extensive experiments demonstrate strong detection performance of our method.
[50] Why Gaussian Diffusion Models Fail on Discrete Data?
Alexander Shabalin, Simon Elistratov, Viacheslav Meshchaninov, Ildus Sadrtdinov, Dmitry Vetrov
Main category: cs.CL
TL;DR: Analysis of why Gaussian diffusion models struggle with discrete data, identifying a critical sampling interval where DDPM enters low-density regions between modes, and proposing solutions using self-conditioning and q-sampling.
Details
Motivation: Diffusion models work well for continuous data but face challenges with discrete data represented as mixtures of delta-distributions in continuous space. The paper aims to understand why DDPM struggles with discrete distributions and find solutions.Method: Uses a toy Random Hierarchy Model to analyze the problem, identifies a critical sampling interval where data density becomes multimodal, and proposes combining self-conditioning with switching from DDPM to q-sampling within this interval.
Result: Shows that DDPM enters low-density regions between modes in the critical interval, degrading sample quality. Demonstrates that self-conditioning and q-sampling help alleviate this issue, and their combination improves generation quality on real data including text, code, and proteins.
Conclusion: Identifies fundamental challenges in applying Gaussian diffusion to discrete data, provides theoretical understanding of the failure modes, and offers practical solutions that improve performance across multiple discrete domains.
Abstract: Diffusion models have become a standard approach for generative modeling in continuous domains, yet their application to discrete data remains challenging. We investigate why Gaussian diffusion models with the DDPM solver struggle to sample from discrete distributions that are represented as a mixture of delta-distributions in the continuous space. Using a toy Random Hierarchy Model, we identify a critical sampling interval in which the density of noisified data becomes multimodal. In this regime, DDPM occasionally enters low-density regions between modes producing out-of-distribution inputs for the model and degrading sample quality. We show that existing heuristics, including self-conditioning and a solver we term q-sampling, help alleviate this issue. Furthermore, we demonstrate that combining self-conditioning with switching from DDPM to q-sampling within the critical interval improves generation quality on real data. We validate these findings across conditional and unconditional tasks in multiple domains, including text, programming code, and proteins.
[51] BidirLM: From Text to Omnimodal Bidirectional Encoders by Adapting and Composing Causal LLMs
Nicolas Boizard, Théo Deschamps-Berger, Hippolyte Gisserot-Boukhlef, Céline Hudelot, Pierre Colombo
Main category: cs.CL
TL;DR: BidirLM transforms causal language models into bidirectional encoders through systematic adaptation methods, addressing limitations in current approaches and achieving strong performance on multimodal benchmarks.
Details
Motivation: Current approaches for transforming causal generative language models into bidirectional encoders lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate specialized generative models.Method: Systematic ablations on Gemma3 and Qwen3 families to identify key adaptation factors; dual strategy combining linear weight merging with lightweight multi-domain data mixture to mitigate catastrophic forgetting; augmentation by merging with specialized causal models for modality-specific capabilities.
Result: BidirLM family of five encoders outperforms alternatives on text, vision, and audio representation benchmarks, demonstrating effective multimodal capabilities.
Conclusion: The proposed open-source recipe successfully transforms any causal decoder LLM into powerful bidirectional encoders with strong multimodal representation capabilities.
Abstract: Transforming causal generative language models into bidirectional encoders offers a powerful alternative to BERT-style architectures. However, current approaches remain limited: they lack consensus on optimal training objectives, suffer from catastrophic forgetting at scale, and fail to flexibly integrate the vast ecosystem of specialized generative models. In this work, through systematic ablations on the Gemma3 and Qwen3 families, we identify the key factors driving successful adaptation, highlighting the critical role of an often-omitted prior masking phase. To scale this process without original pre-training data, we introduce a dual strategy combining linear weight merging with a lightweight multi-domain data mixture that mitigates catastrophic forgetting. Finally, we augment our encoders by merging them with specialized causal models, seamlessly transferring modality- and domain-specific capabilities. This open-source recipe, designed for any causal decoder LLM, yields BidirLM, a family of five encoders that outperform alternatives on text, vision, and audio representation benchmarks.
[52] Goose: Anisotropic Speculation Trees for Training-Free Speculative Decoding
Tao Jin, Phuong Minh Nguyen, Naoya Inoue
Main category: cs.CL
TL;DR: GOOSE introduces an adaptive spine tree structure for speculative decoding that organizes tokens based on quality differences between context-matched tokens (high acceptance) and statistical predictions (low acceptance), achieving significant speedups over balanced-tree approaches.
Details
Motivation: Existing speculative decoding methods use balanced trees that don't distinguish between token sources with dramatically different acceptance rates, limiting optimization potential. The paper aims to leverage quality differences between context-matched tokens (high acceptance) and statistical predictions (low acceptance) to design more efficient tree structures.Method: GOOSE builds an adaptive spine tree with a deep chain of high-acceptance context-matched tokens as the spine, and wide branches of low-acceptance statistical predictions at each node. This anisotropic (asymmetric) structure breaks through the depth limits of balanced trees by organizing tokens based on their expected acceptance rates.
Result: On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same verification budget. The method proves that the number of tokens accepted per step is at least as large as using either source alone.
Conclusion: The paper demonstrates that recognizing and exploiting quality differences between token sources enables more efficient speculative decoding tree structures, with the adaptive spine tree approach significantly outperforming traditional balanced-tree methods.
Abstract: Speculative decoding accelerates large language model inference by drafting multiple candidate tokens and verifying them in a single forward pass. Candidates are organized as a tree: deeper trees accept more tokens per step, but adding depth requires sacrificing breadth (fallback options) under a fixed verification budget. Existing training-free methods draft from a single token source and shape their trees without distinguishing candidate quality across origins. We observe that two common training-free token sources - n-gram matches copied from the input context, and statistical predictions from prior forward passes - differ dramatically in acceptance rate (~6x median gap, range 2-18x across five models and five benchmarks). We prove that when such a quality gap exists, the optimal tree is anisotropic (asymmetric): reliable tokens should form a deep chain while unreliable tokens spread as wide branches, breaking through the depth limit of balanced trees. We realize this structure in GOOSE, a training-free framework that builds an adaptive spine tree - a deep chain of high-acceptance context-matched tokens with wide branches of low-acceptance alternatives at each node. We prove that the number of tokens accepted per step is at least as large as that of either source used alone. On five LLMs (7B-33B) and five benchmarks, GOOSE achieves 1.9-4.3x lossless speedup, outperforming balanced-tree baselines by 12-33% under the same budget.
[53] Optimizing RAG Rerankers with LLM Feedback via Reinforcement Learning
Yuhang Wu, Xiangqing Shen, Fanfan Wang, Cangqi Zhou, Zhen Wu, Xinyu Dai, Rui Xia
Main category: cs.CL
TL;DR: RRPO is a reinforcement learning framework that aligns document reranking with LLM generation quality using LLM feedback instead of static human annotations.
Details
Motivation: Current reranking models are optimized on static human annotations decoupled from downstream generation, causing misalignment between topical relevance and actual utility for LLM answer generation.Method: ReRanking Preference Optimization (RRPO) formulates reranking as sequential decision-making, optimizes for context utility using LLM feedback, and introduces a reference-anchored deterministic baseline for training stability.
Result: RRPO significantly outperforms strong baselines including RankZephyr, generalizes to diverse readers like GPT-4o, integrates with query expansion modules, and remains robust with noisy supervisors.
Conclusion: RRPO effectively bridges the gap between document relevance and LLM generation utility, eliminating need for expensive human annotations while improving retrieval-augmented generation performance.
Abstract: Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM’s generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.
[54] Reliable Control-Point Selection for Steering Reasoning in Large Language Models
Haomin Zhuang, Hojun Yoo, Xiaonan Luo, Kehan Guo, Xiangliang Zhang
Main category: cs.CL
TL;DR: A method for constructing effective steering vectors by filtering out behaviorally unstable boundaries and removing question-specific noise, improving reasoning performance on math tasks.
Details
Motivation: Current methods for constructing steering vectors for controlling reasoning behaviors in LLMs rely on keyword matching in chain-of-thought traces, assuming detected boundaries encode genuine behavioral signals. However, most boundaries (93.3%) are behaviorally unstable and fail to reproduce detected behaviors under re-generation.Method: Develop a probabilistic model formalizing intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities. Propose stability filtering to retain only boundaries where the model consistently reproduces target behavior, combined with content-subspace projection to remove residual question-specific noise.
Result: Achieves 0.784 accuracy on MATH-500 (+5.0 over strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0).
Conclusion: Stability filtering and content-subspace projection effectively construct steering vectors for controlling reasoning behaviors, addressing the problem of behavioral instability in detected boundaries and enabling cross-model transfer.
Abstract: Steering vectors offer a training-free mechanism for controlling reasoning behaviors in large language models, but constructing effective vectors requires identifying genuine behavioral signals in the model’s hidden states. For behaviors that can be toggled via prompts, this is straightforward. However, many reasoning behaviors – such as self-reflection – emerge spontaneously and resist prompt-level control. Current methods detect these behaviors through keyword matching in chain-of-thought traces, implicitly assuming that every detected boundary encodes a genuine behavioral signal. We show that this assumption is overwhelmingly wrong: across 541 keyword-detected boundaries, 93.3% are behaviorally unstable, failing to reproduce the detected behavior under re-generation from the same prefix. We develop a probabilistic model that formalizes intrinsic reasoning behaviors as stochastic events with context-dependent trigger probabilities, and show that unstable boundaries dilute the steering signal. Guided by this analysis, we propose stability filtering, which retains only boundaries where the model consistently reproduces the target behavior. Combined with a content-subspace projection that removes residual question-specific noise, our method achieves 0.784 accuracy on MATH-500 (+5.0 over the strongest baseline). The resulting steering vectors transfer across models in the same architecture family without re-extraction, improving Nemotron-Research-Reasoning-1.5B (+5.0) and DeepScaleR-1.5B-Preview (+6.0). Code is available at https://github.com/zhmzm/stability-steering.
[55] GaelEval: Benchmarking LLM Performance for Scottish Gaelic
Peter Devine, William Lamb, Beatrice Alex, Ignatius Ezeani, Dawn Knight, Mícheál J. Ó Meachair, Paul Rayson, Martin Wynne
Main category: cs.CL
TL;DR: GaelEval: First multi-dimensional benchmark for Scottish Gaelic evaluating LLMs on morphosyntactic MCQA, cultural translation, and cultural knowledge Q&A tasks, revealing proprietary models outperform open-weight systems and Gaelic prompting provides small advantages.
Details
Motivation: Multilingual LLMs have emergent 'shadow' capabilities in unsupported languages, but performance is uneven and under-measured, especially for morphosyntactically rich minority languages like Scottish Gaelic where translation benchmarks fail to capture structural competence.Method: Created GaelEval benchmark with three components: (1) expert-authored morphosyntactic multiple-choice QA task, (2) culturally grounded translation benchmark, and (3) large-scale cultural knowledge Q&A task. Evaluated 19 LLMs against fluent-speaker human baseline (n=30).
Result: Gemini 3 Pro Preview achieved 83.3% accuracy on linguistic task, surpassing human baseline (78.1%). Proprietary models consistently outperformed open-weight systems. Gaelic prompting yielded small but stable advantage (+2.4%). On cultural task, leading models exceeded 90% accuracy, though most performed worse under Gaelic prompting.
Conclusion: Frontier models achieve above-human performance on Gaelic grammar dimensions, Gaelic prompting has small positive effects, and proprietary models consistently outperform open-weight systems, revealing significant gaps in LLM evaluation for minority languages.
Abstract: Multilingual large language models (LLMs) often exhibit emergent ‘shadow’ capabilities in languages without official support, yet their performance on these languages remains uneven and under-measured. This is particularly acute for morphosyntactically rich minority languages such as Scottish Gaelic, where translation benchmarks fail to capture structural competence. We introduce GaelEval, the first multi-dimensional benchmark for Gaelic, comprising: (i) an expert-authored morphosyntactic MCQA task; (ii) a culturally grounded translation benchmark and (iii) a large-scale cultural knowledge Q&A task. Evaluating 19 LLMs against a fluent-speaker human baseline ($n=30$), we find that Gemini 3 Pro Preview achieves $83.3%$ accuracy on the linguistic task, surpassing the human baseline ($78.1%$). Proprietary models consistently outperform open-weight systems, and in-language (Gaelic) prompting yields a small but stable advantage (+$2.4%$). On the cultural task, leading models exceed $90%$ accuracy, though most systems perform worse under Gaelic prompting and absolute scores are inflated relative to the manual benchmark. Overall, GaelEval reveals that frontier models achieve above-human performance on several dimensions of Gaelic grammar, demonstrates the effect of Gaelic prompting and shows a consistent performance gap favouring proprietary over open-weight models.
[56] Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents
Xuan Qi
Main category: cs.CL
TL;DR: Systematic study of chain-of-thought reasoning length effects on function-calling agents reveals non-monotonic relationship where brief reasoning (8-32 tokens) dramatically improves accuracy by acting as function routing, while extended reasoning degrades performance below baseline.
Details
Motivation: To understand the relationship between reasoning length and accuracy in structured tool-use settings for language agents, as chain-of-thought reasoning is widely assumed to improve performance but its effects in function-calling contexts remain poorly understood.Method: Systematic study sweeping six token budgets (0-512) across 200 tasks from Berkeley Function Calling Leaderboard v3 Multiple benchmark, with error decomposition analysis and oracle analysis to determine optimal reasoning length, leading to proposed Function-Routing CoT (FR-CoT) method.
Result: Brief reasoning (32 tokens) improves accuracy by 45% relative (44.0% to 64.0%), while extended reasoning (256 tokens) degrades performance below baseline to 25.0%. FR-CoT achieves equivalent accuracy to free-form brief CoT while eliminating function hallucination completely (0.0%).
Conclusion: Optimal reasoning for function-calling agents is brief (8-16 tokens), acting primarily as function routing; structured brief reasoning with FR-CoT provides reliability guarantees without budget tuning, challenging assumptions about longer reasoning being better.
Abstract: How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0–512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p < 0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8–16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as “Function: [name] / Key args: […],” forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.
[57] AstroConcepts: A Large-Scale Multi-Label Classification Corpus for Astrophysics
Atilla Kaan Alkan, Felix Grezes, Sergi Blanco-Cuaresma, Jennifer Lynn Bartlett, Daniel Chivvis, Anna Kelbert, Kelly Lockhart, Alberto Accomazzi
Main category: cs.CL
TL;DR: AstroConcepts: A corpus of astrophysics abstracts with 2,367 concepts for studying extreme class imbalance in scientific multi-label text classification, revealing insights about vocabulary-constrained LLMs and domain adaptation.
Details
Motivation: Scientific multi-label classification suffers from extreme class imbalance with specialized terminology following power-law distributions. Existing scientific corpora lack comprehensive controlled vocabularies and focus on broad categories, limiting systematic study of extreme imbalance.Method: Created AstroConcepts corpus with 21,702 astrophysics abstracts labeled with 2,367 concepts from Unified Astronomy Thesaurus. Evaluated traditional, neural, and vocabulary-constrained LLM methods with frequency-stratified evaluation to reveal hidden performance patterns.
Result: 1) Vocabulary-constrained LLMs achieve competitive performance vs domain-adapted models in astrophysics classification; 2) Domain adaptation yields larger improvements for rare terminology but absolute performance remains limited; 3) Frequency-stratified evaluation reveals hidden patterns not captured by aggregate scores.
Conclusion: The corpus enables systematic study of extreme class imbalance in scientific domains. Vocabulary-constrained LLMs show promise for parameter-efficient approaches, and frequency-stratified evaluation should be central to scientific multi-label evaluation for robustness assessment.
Abstract: Scientific multi-label text classification suffers from extreme class imbalance, where specialized terminology exhibits severe power-law distributions that challenge standard classification approaches. Existing scientific corpora lack comprehensive controlled vocabularies, focusing instead on broad categories and limiting systematic study of extreme imbalance. We introduce AstroConcepts, a corpus of English abstracts from 21,702 published astrophysics papers, labeled with 2,367 concepts from the Unified Astronomy Thesaurus. The corpus exhibits severe label imbalance, with 76% of concepts having fewer than 50 training examples. By releasing this resource, we enable systematic study of extreme class imbalance in scientific domains and establish strong baselines across traditional, neural, and vocabulary-constrained LLM methods. Our evaluation reveals three key patterns that provide new insights into scientific text classification. First, vocabulary-constrained LLMs achieve competitive performance relative to domain-adapted models in astrophysics classification, suggesting a potential for parameter-efficient approaches. Second, domain adaptation yields relatively larger improvements for rare, specialized terminology, although absolute performance remains limited across all methods. Third, we propose frequency-stratified evaluation to reveal performance patterns that are hidden by aggregate scores, thereby making robustness assessment central to scientific multi-label evaluation. These results offer actionable insights for scientific NLP and establish benchmarks for research on extreme imbalance.
[58] Do Lexical and Contextual Coreference Resolution Systems Degrade Differently under Mention Noise? An Empirical Study on Scientific Software Mentions
Atilla Kaan Alkan, Felix Grezes, Jennifer Lynn Bartlett, Anna Kelbert, Kelly Lockhart, Alberto Accomazzi
Main category: cs.CL
TL;DR: The paper presents two fine-tuning-free approaches (Fuzzy Matching and Context Aware Representations) for cross-document software mention coreference resolution, achieving competitive performance in SOMD 2026 shared task with CAR slightly outperforming FM.
Details
Motivation: To address the underexlexplored task of cross-document software mention coreference resolution, comparing simple lexical methods with more sophisticated embedding-based approaches to understand their relative strengths and failure modes.Method: Two fine-tuning-free approaches: 1) Fuzzy Matching (FM) using lexical string similarity, and 2) Context Aware Representations (CAR) combining mention-level and document-level embeddings. Both methods were evaluated on SOMD 2026 shared task with controlled noise-injection studies.
Result: Both methods achieved competitive performance (CoNLL F1 0.94-0.96), with CAR consistently outperforming FM by 1 point. CAR showed better robustness to boundary noise, while FM degraded more gracefully under mention substitution. CAR scales approximately linearly with corpus size vs FM’s superlinear scaling.
Conclusion: System selection should consider both the noise profile of upstream mention detectors and corpus scale. CAR is more efficient for large-scale applications, while both methods perform well due to high surface regularity of software names reducing need for complex semantic reasoning.
Abstract: We present our participation in the SOMD 2026 shared task on cross-document software mention coreference resolution, where our systems ranked second across all three subtasks. We compare two fine-tuning-free approaches: Fuzzy Matching (FM), a lexical string-similarity method, and Context Aware Representations (CAR), which combines mention-level and document-level embeddings. Both achieve competitive performance across all subtasks (CoNLL F1 of 0.94-0.96), with CAR consistently outperforming FM by 1 point on the official test set, consistent with the high surface regularity of software names, which reduces the need for complex semantic reasoning. A controlled noise-injection study reveals complementary failure modes: as boundary noise increases, CAR loses only 0.07 F1 points from clean to fully corrupted input, compared to 0.20 for FM, whereas under mention substitution, FM degrades more gracefully (0.52 vs. 0.63). Our inference-time analysis shows that FM scales superlinearly with corpus size, whereas CAR scales approximately linearly, making CAR the more efficient choice at large scale. These findings suggest that system selection should be informed by both the noise profile of the upstream mention detector and the scale of the target corpus. We release our code to support future work on this underexplored task.
[59] Adam’s Law: Textual Frequency Law on Large Language Models
Hongyuan Adam Lu, Z. L., Victor Wei, Zefan Zhang, Zhao Hong, Qiqi Xiang, Bowen Cao, Wai Lam
Main category: cs.CL
TL;DR: Textual frequency analysis framework for LLMs with three components: Textual Frequency Law for preferring frequent data, Textual Frequency Distillation for corpus enhancement, and Curriculum Textual Frequency Training for fine-tuning in frequency order.
Details
Motivation: Textual frequency is known to affect human cognition in reading but is understudied in LLMs. The paper aims to explore how textual data frequency impacts LLM performance and develop methods to leverage this insight.Method: Three-part framework: 1) Textual Frequency Law (TFL) - use online resources to estimate sentence frequency and paraphrase inputs to more frequent expressions; 2) Textual Frequency Distillation (TFD) - query LLMs for story completion to enhance corpora; 3) Curriculum Textual Frequency Training (CTFT) - fine-tune LLMs in increasing order of sentence-level frequency.
Result: Experiments on Textual Frequency Paired Dataset (TFPD) across math reasoning, machine translation, commonsense reasoning, and tool calling show effectiveness of the framework.
Conclusion: Textual frequency is a valuable dimension for improving LLM performance, and the proposed framework effectively leverages frequency information for better prompting, fine-tuning, and corpus enhancement.
Abstract: While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction in terms of textual data frequency, which is an understudied topic, to the best of our knowledge. Our framework is composed of three units. First, this paper proposes Textual Frequency Law (TFL), which indicates that frequent textual data should be preferred for LLMs for both prompting and fine-tuning. Since many LLMs are closed-source in their training data, we propose using online resources to estimate the sentence-level frequency. We then utilize an input paraphraser to paraphrase the input into a more frequent textual expression. Next, we propose Textual Frequency Distillation (TFD) by querying LLMs to conduct story completion by further extending the sentences in the datasets, and the resulting corpora are used to adjust the initial estimation. Finally, we propose Curriculum Textual Frequency Training (CTFT) that fine-tunes LLMs in an increasing order of sentence-level frequency. Experiments are conducted on our curated dataset Textual Frequency Paired Dataset (TFPD) on math reasoning, machine translation, commonsense reasoning and agentic tool calling. Results show the effectiveness of our framework.
[60] The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level
Jeremy Herbst, Jae Hee Lee, Stefan Wermter
Main category: cs.CL
TL;DR: MoE architectures show experts are less polysemantic than dense FFN neurons, making them more interpretable at expert level where they function as fine-grained task specialists rather than broad domain experts.
Details
Motivation: While MoE architectures are widely adopted for computational efficiency, it's unclear whether their sparsity makes them inherently more interpretable than dense feed-forward networks. The paper investigates whether MoE experts are easier to interpret due to their specialized activation patterns.Method: Used k-sparse probing to compare MoE experts and dense FFNs, analyzing neuron polysemanticity and expert specialization patterns. Zoomed out from neuron-level to expert-level analysis and automatically interpreted hundreds of experts to understand their functional roles.
Result: MoE expert neurons are consistently less polysemantic than dense FFN neurons, with the gap increasing as routing becomes sparser. Experts function as fine-grained task specialists (e.g., linguistic operations, semantic tasks) rather than broad domain specialists or simple token-level processors.
Conclusion: MoE architectures are inherently more interpretable at the expert level due to reduced polysemanticity, providing a clearer path toward large-scale model interpretability. Expert-level analysis is more effective than neuron-level analysis for understanding MoE models.
Abstract: Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in LaTeX). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis
[61] Neuro-RIT: Neuron-Guided Instruction Tuning for Robust Retrieval-Augmented Language Model
Jaemin Kim, Jae O Lee, Sumyeong Ahn, Seo Yeon Park
Main category: cs.CL
TL;DR: Neuro-RIT is a neuron-guided robust instruction tuning framework that enhances retrieval-augmented language models’ robustness to irrelevant contexts through precision neuron alignment rather than coarse parameter updates.
Details
Motivation: Retrieval-augmented language models degrade with irrelevant/noisy contexts. Existing robustness methods use coarse-grained parameter updates, ignoring LLMs' neuron-level sparsity, limiting effectiveness.Method: Proposes Neuro-RIT with attribution-based neuron mining to disentangle neurons for relevant vs irrelevant contexts, then two-stage instruction tuning: noise suppression via deactivating irrelevant-context neurons, and optimizing layers for evidence distillation.
Result: Extensive experiments across diverse QA benchmarks show Neuro-RIT consistently outperforms strong baselines and other robustness-enhancing methods.
Conclusion: Shifting from dense adaptation to precision neuron alignment effectively enhances RALM robustness to noisy contexts, demonstrating the value of neuron-level analysis for retrieval-augmented systems.
Abstract: Retrieval-Augmented Language Models (RALMs) have demonstrated significant potential in knowledge-intensive tasks; however, they remain vulnerable to performance degradation when presented with irrelevant or noisy retrieved contexts. Existing approaches to enhance robustness typically operate via coarse-grained parameter updates at the layer or module level, often overlooking the inherent neuron-level sparsity of Large Language Models (LLMs). To address this limitation, we propose Neuro-RIT (Neuron-guided Robust Instruction Tuning), a novel framework that shifts the paradigm from dense adaptation to precision-driven neuron alignment. Our method explicitly disentangles neurons that are responsible for processing relevant versus irrelevant contexts using attribution-based neuron mining. Subsequently, we introduce a two-stage instruction tuning strategy that enforces a dual capability for noise robustness: achieving direct noise suppression by functionally deactivating neurons exclusive to irrelevant contexts, while simultaneously optimizing targeted layers for evidence distillation. Extensive experiments across diverse QA benchmarks demonstrate that Neuro-RIT consistently outperforms strong baselines and robustness-enhancing methods.
[62] Towards Position-Robust Talent Recommendation via Large Language Models
Silin Du, Hongyan Liu
Main category: cs.CL
TL;DR: L3TR is a listwise talent recommendation framework using LLMs that addresses position bias and token consumption issues through block attention, local positional encoding, and ID sampling methods.
Details
Motivation: Existing talent recommendation systems using LLMs follow pointwise paradigms that fail to capture candidate relationships, consume excessive tokens, and suffer from position bias and lost-in-the-middle issues when processing multiple documents.Method: Proposes L3TR framework with: 1) Block attention mechanism and local positional encoding to enhance inter-document processing and mitigate position/token bias, 2) ID sampling method to handle candidate set size inconsistency between training and inference, 3) Evaluation methods for bias detection and training-free debiasing techniques.
Result: Extensive experiments on two real-world datasets show consistent improvements over existing baselines, validating the effectiveness of L3TR in talent recommendation.
Conclusion: L3TR successfully addresses key limitations of LLM-based talent recommendation systems by introducing listwise processing, bias mitigation techniques, and practical solutions for real-world deployment challenges.
Abstract: Talent recruitment is a critical, yet costly process for many industries, with high recruitment costs and long hiring cycles. Existing talent recommendation systems increasingly adopt large language models (LLMs) due to their remarkable language understanding capabilities. However, most prior approaches follow a pointwise paradigm, which requires LLMs to repeatedly process some text and fails to capture the relationships among candidates in the list, resulting in higher token consumption and suboptimal recommendations. Besides, LLMs exhibit position bias and the lost-in-the-middle issue when answering multiple-choice questions and processing multiple long documents. To address these issues, we introduce an implicit strategy to utilize LLM’s potential output for the recommendation task and propose L3TR, a novel framework for listwise talent recommendation with LLMs. In this framework, we propose a block attention mechanism and a local positional encoding method to enhance inter-document processing and mitigate the position bias and concurrent token bias issue. We also introduce an ID sampling method for resolving the inconsistency between candidate set sizes in the training phase and the inference phase. We design evaluation methods to detect position bias and token bias and training-free debiasing methods. Extensive experiments on two real-world datasets validated the effectiveness of L3TR, showing consistent improvements over existing baselines.
[63] CV-18 NER: Augmented Common Voice for Named Entity Recognition from Arabic Speech
Youssef Saidi, Haroun Elleuch, Fethi Bougares
Main category: cs.CL
TL;DR: First publicly available dataset for Arabic speech NER (CV-18 NER) created from Common Voice 18 with manual annotations, showing E2E models outperform pipeline approaches in this low-resource setting.
Details
Motivation: Arabic speech NER remains under-explored due to morphological complexity, absence of short vowels, and limited annotated resources, creating a need for benchmark datasets and models.Method: Created CV-18 NER dataset by augmenting Arabic Common Voice 18 with manual NER annotations following Wojood schema (21 entity types). Benchmarked pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ.
Result: E2E systems substantially outperformed best pipeline configuration, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Arabic-specific pretraining yields strong ASR, while multilingual weak supervision transfers better to joint speech-to-entity learning.
Conclusion: E2E approaches are effective for Arabic speech NER despite low-resource setting, with dataset and models publicly released to establish first open benchmark for this task.
Abstract: End-to-end speech Named Entity Recognition (NER) aims to directly extract entities from speech. Prior work has shown that end-to-end (E2E) approaches can outperform cascaded pipelines for English, French, and Chinese, but Arabic remains under-explored due to its morphological complexity, the absence of short vowels, and limited annotated resources. We introduce CV-18 NER, the first publicly available dataset for NER from Arabic speech, created by augmenting the Arabic Common Voice 18 corpus with manual NER annotations following the fine-grained Wojood schema (21 entity types). We benchmark both pipeline systems (ASR + text NER) and E2E models based on Whisper and AraBEST-RQ. E2E systems substantially outperform the best pipeline configuration on the test set, reaching 37.0% CoER (AraBEST-RQ 300M) and 38.0% CVER (Whisper-medium). Further analysis shows that Arabic-specific self-supervised pretraining yields strong ASR performance, while multilingual weak supervision transfers more effectively to joint speech-to-entity learning, and that larger models may be harder to adapt in this low-resource setting. Our dataset and models are publicly released, providing the first open benchmark for end-to-end named entity recognition from Arabic speech https://huggingface.co/datasets/Elyadata/CV18-NER.
[64] No Single Best Model for Diversity: Learning a Router for Sample Diversity
Yuhan Liu, Fangyuan Xu, Vishakh Padmakumar, Daphne Ippolito, Eunsol Choi
Main category: cs.CL
TL;DR: The paper introduces diversity coverage metric to evaluate LLMs’ ability to generate comprehensive sets of valid responses to open-ended prompts, and proposes a router to select the best model per query for maximizing diversity.
Details
Motivation: When prompts allow many valid answers, current LLMs struggle to comprehensively generate diverse responses. There's a need to evaluate and improve models' ability to produce comprehensive answer sets that satisfy diverse user needs.Method: 1) Introduces diversity coverage metric measuring total quality scores of unique answers relative to optimal set; 2) Evaluates 18 LLMs on open-ended prompts; 3) Develops a router that predicts the best model for each query based on prompt characteristics.
Result: No single model dominates across all prompts, but per-prompt there exists a significantly better model. The trained router outperforms single best model baseline by 2.5% (26.3% vs 23.8%) on NB-Wildchat and generalizes to out-of-domain datasets and different prompting strategies.
Conclusion: The work establishes foundation for studying comprehensive answer generation using multiple models, showing that model selection per query can significantly improve diversity coverage compared to using any single model.
Abstract: When posed with prompts that permit a large number of valid answers, comprehensively generating them is the first step towards satisfying a wide range of users. In this paper, we study methods to elicit a comprehensive set of valid responses. To evaluate this, we introduce \textbf{diversity coverage}, a metric that measures the total quality scores assigned to each \textbf{unique} answer in the predicted answer set relative to the best possible answer set with the same number of answers. Using this metric, we evaluate 18 LLMs, finding no single model dominates at generating diverse responses to a wide range of open-ended prompts. Yet, per each prompt, there exists a model that outperforms all other models significantly at generating a diverse answer set. Motivated by this finding, we introduce a router that predicts the best model for each query. On NB-Wildchat, our trained router outperforms the single best model baseline (26.3% vs $23.8%). We further show generalization to an out-of-domain dataset (NB-Curated) as well as different answer-generation prompting strategies. Our work lays foundation for studying generating comprehensive answers when we have access to a suite of models.
[65] Grounded Token Initialization for New Vocabulary in LMs for Generative Recommendation
Daiwei Chen, Zhoutong Fu, Chengming Jiang, Haichao Zhang, Ran Zhou, Tan Wang, Chunnan Yao, Guoyao Li, Rui Cai, Yihan Cao, Ruijie Jiang, Fedor Borisyuk, Jianqiang Shen, Jingwei Wu, Ramya Korlakai Vinayak
Main category: cs.CL
TL;DR: GTI (Grounded Token Initialization) improves vocabulary extension for language models by grounding new tokens in semantically meaningful locations before fine-tuning, outperforming standard mean initialization.
Details
Motivation: Current practice of initializing new vocabulary tokens as the mean of existing embeddings collapses tokens into a degenerate subspace, erasing inter-token distinctions that fine-tuning struggles to recover, making token initialization a key bottleneck.Method: Proposes GTI - a lightweight grounding stage that maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision before fine-tuning.
Result: GTI outperforms both mean initialization and existing auxiliary-task adaptation methods across multiple generative recommendation benchmarks, including industry-scale and public datasets, producing richer inter-token structure that persists through fine-tuning.
Conclusion: Token initialization quality is a key bottleneck in vocabulary extension, and linguistically grounding novel tokens before fine-tuning enables better leverage of the model’s general-purpose knowledge for novel-token domains.
Abstract: Language models (LMs) are increasingly extended with new learnable vocabulary tokens for domain-specific tasks, such as Semantic-ID tokens in generative recommendation. The standard practice initializes these new tokens as the mean of existing vocabulary embeddings, then relies on supervised fine-tuning to learn their representations. We present a systematic analysis of this strategy: through spectral and geometric diagnostics, we show that mean initialization collapses all new tokens into a degenerate subspace, erasing inter-token distinctions that subsequent fine-tuning struggles to fully recover. These findings suggest that \emph{token initialization} is a key bottleneck when extending LMs with new vocabularies. Motivated by this diagnosis, we propose the \emph{Grounded Token Initialization Hypothesis}: linguistically grounding novel tokens in the pretrained embedding space before fine-tuning better enables the model to leverage its general-purpose knowledge for novel-token domains. We operationalize this hypothesis as GTI (Grounded Token Initialization), a lightweight grounding stage that, prior to fine-tuning, maps new tokens to distinct, semantically meaningful locations in the pretrained embedding space using only paired linguistic supervision. Despite its simplicity, GTI outperforms both mean initialization and existing auxiliary-task adaptation methods in the majority of evaluation settings across multiple generative recommendation benchmarks, including industry-scale and public datasets. Further analyses show that grounded embeddings produce richer inter-token structure that persists through fine-tuning, corroborating the hypothesis that initialization quality is a key bottleneck in vocabulary extension.
[66] Analyzing Language Bias Between French and English in Conventional Multilingual Sentiment Analysis Models
Ethan Parker Wong, Faten M’hiri
Main category: cs.CL
TL;DR: Analysis of language bias in English-French sentiment classification using SVM and Naive Bayes models reveals French outperforms English on accuracy metrics, but SVM shows near-equitable treatment across languages while Naive Bayes exhibits greater disparities.
Details
Motivation: Inspired by Statistics Canada's report on bias in bilingual NLP, this study investigates potential language biases in multilingual sentiment analysis between English and French, aiming to understand how dataset diversity affects equity in multilingual NLP systems.Method: Used Support Vector Machine (SVM) and Naive Bayes models on three balanced 50-50 English-French datasets, employing Fairlearn tool to assess bias through demographic parity ratios and performance metrics (accuracy, recall, F1 score).
Result: French data outperformed English across all metrics (accuracy, recall, F1) in both models, suggesting language bias. However, SVM showed near-equitable treatment with demographic parity ratios of 0.963-0.985, while Naive Bayes exhibited greater disparities with ratios of 0.813-0.961.
Conclusion: The study highlights the importance of developing equitable multilingual NLP systems, revealing that model choice significantly impacts fairness outcomes, with SVM performing more equitably than Naive Bayes despite French outperforming English in both models.
Abstract: Inspired by the ‘Bias Considerations in Bilingual Natural Language Processing’ report by Statistics Canada, this study delves into potential biases in multilingual sentiment analysis between English and French. Given a 50-50 dataset of French and English, we aim to determine if there exists a language bias and explore how the incorporation of more diverse datasets in the future might affect the equity of multilingual Natural Language Processing (NLP) systems. By employing Support Vector Machine (SVM) and Naive Bayes models on three balanced datasets, we reveal potential biases in multilingual sentiment classification. Utilizing Fairlearn, a tool for assessing bias in machine learning models, our findings indicate nuanced outcomes. With French data outperforming English across accuracy, recall, and F1 score metrics in both models, hinting at a language bias favoring French. However, Fairlearn’s metrics suggest that the SVM approaches equitable levels with a demographic parity ratio of 0.963, 0.989, and 0.985 for the three separate datasets, indicating near-equitable treatment across languages. In contrast, Naive Bayes demonstrates greater disparities, evidenced by a demographic parity ratio of 0.813, 0.908, and 0.961. These findings reveal the importance of developing equitable multilingual NLP systems, particularly as we anticipate the inclusion of more datasets in various languages in the future.
[67] Lexical categories of stem-forming roots in Mapudüngun verb forms
Andrés Chandía
Main category: cs.CL
TL;DR: Analysis of Mapuche language verbal root classification to verify linguistic assumptions for a computational morphological analysis system.
Details
Motivation: The need to verify linguistic assumptions underlying a computational morphological analysis system for Mapudüngun (Mapuche language) after initial development and evaluation.Method: Lexical category classification analysis of Mapudüngun roots recognized as verbal in the source material used for the morphological analysis system development.
Result: Results directly benefit the computational analyzer through immediate implementation of verified classifications, and help clarify uncertainties about lexical categories in Mapuche language.
Conclusion: This work serves as preliminary task for identifying valency of true verbal roots, with results to be complemented by subsequent work on valency analysis.
Abstract: After developing a computational system for morphological analysis of the Mapuche language, and evaluating it with texts from various authors and styles, it became necessary to verify the linguistic assumptions of the source used as the basis for implementing this tool. In the present work, the primary focus is on the lexical category classification of Mapudüngun roots recognised as verbal in the source utilised for the development of the morphological analysis system. The results of this lexical category revision directly benefit the computational analyser, as they are implemented as soon as they are verified. Additionally, it is hoped that these results will help clarify some uncertainties about lexical categories in the Mapuche language. This work addresses a preliminary task to identify the valency of true verbal roots, the results of which will be presented in a subsequent work that complements this article.
[68] No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Michael Krumdick, Charles Lovering, Varshini Reddy, Seth Ebner, Chris Tanner
Main category: cs.CL
TL;DR: LLM-as-a-Judge evaluation framework shows limitations in assessing correctness in business/finance domains without reference answers, requiring expert verification for reliable assessment.
Details
Motivation: As LLM deployment expands in high-stakes domains like business and finance, reliable evaluation methods are critical. The LLM-as-a-Judge framework is scalable and cost-effective but its accuracy in assessing correctness (vs. style) remains unclear.Method: Created BFF-Bench with 160 challenging business/finance questions and long-form responses by financial professionals. Experts evaluated 1,200 LLM responses on BFF-Bench and MT-Bench subset. Analyzed agreement between automated grading methods (LLM Judges) and human experts using expert-annotated VERDICTS dataset.
Result: LLM Judges are more reliable than other automated methods but show high agreement with human experts only on questions they can answer correctly themselves. Providing expert-written references to judges largely mitigates this limitation.
Conclusion: LLM-as-a-Judge has limitations for correctness assessment without human verification or reference answers, especially in specialized domains like business/finance where correctness matters more than style.
Abstract: Reliable evaluation of large language models (LLMs) is critical as their deployment rapidly expands, particularly in high-stakes domains such as business and finance. The LLM-as-a-Judge framework, which uses prompted LLMs to evaluate response quality, is appealing due to its scalability, low cost, and strong correlations with human stylistic preferences. However, it remains unclear how accurately these methods can assess response quality in domains where correctness matters more than style. To address this gap, we introduce the Business and Finance Fundamentals Benchmark (BFF-Bench), a dataset of 160 challenging questions and long-form responses authored by financial professionals. These experts subsequently evaluated the correctness of 1,200 responses generated by a diverse set of LLMs on both BFF-Bench and a challenging subset of MT-Bench. With this expert-annotated dataset of judgments (VERDICTS), we analyze the agreement between a suite of automated grading methods and human experts. While we observe that LLM Judges are more reliable than other grading methods, our findings reveal a clear pattern in LLM Judge performance: when not provided with a correct reference, judges show high agreement with human experts only on questions the judges were able to correctly answer themselves. We demonstrate that providing the judges with expert-written references largely mitigates this issue, highlighting the limits of using LLM-as-a-Judge without any form of human verification.
[69] Green Prompting: Characterizing Prompt-driven Energy Costs of LLM Inference
Marta Adamska, Daria Smirnova, Hamid Nasiri, Zhengxin Yu, Peter Garraghan
Main category: cs.CL
TL;DR: Empirical study showing that prompt semantic meaning and task-related keywords significantly impact LLM inference energy costs, more than prompt length.
Details
Motivation: High inference costs of LLMs impact sustainability and financial feasibility, but the relationship between prompt/response characteristics and energy consumption is not well understood.Method: Experiments with three open-source transformer-based LLMs across question answering, sentiment analysis, and text generation tasks, analyzing prompt/response characteristics (length, semantic meaning, time, energy consumption).
Result: Prompt semantic meaning matters more than length; specific task-related keywords correlate with higher/lower energy usage; identical tasks produce varying energy consumption patterns across models.
Conclusion: Prompt design significantly impacts inference efficiency, with semantic meaning and task keywords being key factors, paving way for energy-adaptive LLMs.
Abstract: Large Language Models (LLMs) have become widely used across various domains spanning search engines, code generation, and text creation. However, a major concern associated with their adoption is the high cost of inference, impacting both their sustainability and financial feasibility. In this study, we empirically study how different prompt and response characteristics directly impact LLM inference energy cost. We conduct experiments leveraging three open-source transformer-based LLMs across three task types$-$question answering, sentiment analysis, and text generation. For each inference, we analyzed prompt and response characteristics (length, semantic meaning, time taken, energy consumption). Our results demonstrate that even when presented with identical tasks, models generate responses with varying characteristics and subsequently exhibit distinct energy consumption patterns. We found that prompt length is less significant than the semantic meaning of the task itself. In addition, we identified specific keywords associated with higher or lower energy usage that vary between associated tasks. These findings highlight the importance of prompt design in optimizing inference efficiency. We conclude that the semantic meaning of prompts and certain task-related keywords significantly impact inference costs, leading the way for deeper exploration towards creating energy-adaptive LLMs.
[70] One Pic is All it Takes: Poisoning Visual Document Retrieval Augmented Generation with a Single Image
Ezzeldin Shereen, Dan Ristea, Shae McFadden, Burak Hasircioglu, Vasilios Mavroudis, Chris Hicks
Main category: cs.CL
TL;DR: VD-RAG systems are vulnerable to poisoning attacks where a single adversarial image in the knowledge base can manipulate both retrieval and generation components to spread disinformation or cause denial-of-service.
Details
Motivation: Visual Document RAG (VD-RAG) uses document screenshots as knowledge base but introduces new attack vectors through image modality. The paper aims to demonstrate VD-RAG's vulnerability to poisoning attacks targeting both retrieval and generation components.Method: Two attack objectives defined: targeted attacks for spreading disinformation and universal attacks for denial-of-service. Attacks implemented under white-box and black-box assumptions using multi-objective gradient-based optimization and prompting state-of-the-art generative models. Evaluated on two visual document datasets with diverse retrievers (embedding models) and generators (vision language models).
Result: VD-RAG is vulnerable to poisoning attacks in both targeted and universal settings, though shows robustness to black-box attacks in universal setting. A single adversarial image injection can successfully manipulate system behavior.
Conclusion: VD-RAG systems have significant security vulnerabilities through image-based poisoning attacks, highlighting the need for robust defenses in multimodal RAG architectures.
Abstract: Retrieval-augmented generation (RAG) is instrumental for inhibiting hallucinations in large language models (LLMs) through the use of a factual knowledge base (KB). Although PDF documents are prominent sources of knowledge, text-based RAG pipelines are ineffective at capturing their rich multi-modal information. In contrast, visual document RAG (VD-RAG) uses screenshots of document pages as the KB, which has been shown to achieve state-of-the-art results. However, by introducing the image modality, VD-RAG introduces new attack vectors for adversaries to disrupt the system by injecting malicious documents into the KB. In this paper, we demonstrate the vulnerability of VD-RAG to poisoning attacks targeting both retrieval and generation. We define two attack objectives and demonstrate that both can be realized by injecting only a single adversarial image into the KB. Firstly, we introduce a targeted attack against one or a group of queries with the goal of spreading targeted disinformation. Secondly, we present a universal attack that, for any potential user query, influences the response to cause a denial-of-service in the VD-RAG system. We investigate the two attack objectives under both white-box and black-box assumptions, employing a multi-objective gradient-based optimization approach as well as prompting state-of-the-art generative models. Using two visual document datasets, a diverse set of state-of-the-art retrievers (embedding models) and generators (vision language models), we show VD-RAG is vulnerable to poisoning attacks in both the targeted and universal settings, yet demonstrating robustness to black-box attacks in the universal setting.
[71] SAKE: Structured Agentic Knowledge Extrapolation for Complex LLM Reasoning via Reinforcement Learning
Jiashu He, Jinxuan Fan, Bowen Jiang, Ignacio Houine, Dan Roth, Alejandro Ribeiro
Main category: cs.CL
TL;DR: SAKE is a reinforcement learning framework that trains LLMs to autonomously retrieve and extrapolate structured knowledge from knowledge graphs using entity grouping and triplet retrieval tools.
Details
Motivation: Knowledge extrapolation is crucial for solving complex questions in specialized domains where retrieving comprehensive external knowledge is impractical. Current approaches often require large models with complex prompting, which is inefficient.Method: SAKE uses RL-powered agentic framework with two KG tools: entity group construction and cross-group triplet retrieval. The model learns to interleave these tools in a three-turn rollout: extracting entities, filtering concept groups, and associative reasoning through analogy. Optimized end-to-end with GRPO using curriculum reward.
Result: SAKE fine-tuned Qwen2.5-7B model surpasses GPT-3.5-Turbo on biomedical (75.4% vs. 70.1%) and commonsense (81.3% vs. 74.7%) benchmarks while reducing token usage by over 90%.
Conclusion: Associative reasoning over incomplete structured knowledge doesn’t require large models with complex prompting; can be learned end-to-end by small open-weight models through RL with proper tools and training signals.
Abstract: Knowledge extrapolation is the process of inferring novel information by combining and extending existing knowledge that is explicitly available. It is essential for solving complex questions in specialized domains where retrieving comprehensive external knowledge is impractical. We propose SAKE (Structured Agentic Knowledge Extrapolation), a RL powered agentic framework that trains LLMs to autonomously retrieve and extrapolate structured knowledge through tool-augmented reinforcement learning. SAKE defines two external KG tools: entity group construction and cross-group triplet retrieval. The model learns to interleave these 2 retrieval tools during a three-turn rollout: extracting key entities, filtering relevant concept groups, and associative reasoning by constructing new triplets through analogy. The entire pipeline is optimized end-to-end with GRPO using a curriculum reward, teaching the model what to retrieve and how to reason over it. Our experiments proved that SAKE fine-tuned Qwen2.5-7B model surpasses GPT-3.5-Turbo with state-of-the-art agentic KG reasoning on both biomedical (75.4% vs. 70.1%) and commonsense (81.3% vs. 74.7%) benchmarks, while reducing token usage by over 90%. These results demonstrate that associative reasoning over incomplete structured knowledge does not requiring large models with complex, multi-step prompting, thus can be learned end-to-end by small, open-weight models through reinforcement learning with the right tools and training signal. Our code is available at https://anonymous.4open.science/r/SAKE-7585.
[72] Reviewing Scientific Papers for Critical Problems With Reasoning LLMs: Baseline Approaches and Automatic Evaluation
Tianmai M. Zhang, Neil F. Abernethy
Main category: cs.CL
TL;DR: LLMs used as manuscript quality checkers to identify critical errors in scientific papers, with evaluation framework using top reasoning LLMs as judges on arXiv withdrawn papers.
Details
Motivation: Address peer review crisis by using LLMs as quality checkers rather than full review generators to avoid irresponsible use and manipulation, while providing scalable assistance to human reviewers.Method: Proposed baseline approaches and automatic evaluation framework using top reasoning LLMs as judges, validated on papers withdrawn from arXiv to identify critical errors and unsoundness problems.
Result: o3 model showed best problem identification performance among all tested LLMs at modest cost, providing insights into document-based scientific understanding/reasoning.
Conclusion: LLMs can effectively serve as manuscript quality checkers rather than full review generators, with o3 demonstrating strong performance for identifying critical errors in scientific papers.
Abstract: Recent advancements in large language models have sparked interest in utilizing them to aid the peer review process of scientific publication amid the peer review crisis. However, having AI models generate full reviews in the same way as human reviewers risks exacerbating the irresponsible use of LLM-generated reviews and instigating intentional manipulation. As an alternative, we propose adopting LLMs as manuscript quality checkers. We introduce several baseline approaches and an extendable automatic evaluation framework using top reasoning LLMs as judges to tackle the difficulty of recruiting domain experts for manual evaluation. Utilizing papers withdrawn from arXiv, we validated our proposed methods with several leading reasoning LLMs available in May-June 2025 and assessed their performance and API costs for identifying critical errors and unsoundness problems in scientific papers. o3 exhibited the best problem identification performance among all models at a modest cost. This paper provides insights into document-based scientific understanding/reasoning and lays a foundation for future applications. Our dataset, code, and model outputs are publicly available.
[73] A Comparative Study of Competency Question Elicitation Methods from Ontology Requirements
Reham Alharbi, Valentina Tamma, Terry R. Payne, Jacopo de Berardinis
Main category: cs.CL
TL;DR: Empirical comparison of three Competency Question generation approaches: manual formulation, pattern instantiation, and LLM generation, evaluated across multiple dimensions using cultural heritage requirements.
Details
Motivation: There's a lack of systematic comparison between different Competency Question (CQ) formulation approaches in knowledge engineering, despite their importance for ontology design, validation, and testing.Method: Generated CQs using three approaches (manual by engineers, pattern instantiation, LLM generation) from cultural heritage requirements, then assessed them across dimensions: acceptability, ambiguity, relevance, readability, and complexity using multi-annotator evaluation.
Result: Different CQ generation approaches produce CQs with distinct characteristics; LLMs can be useful for initial CQ elicitation but are sensitive to the specific model used and generally require further refinement before being usable for requirements modeling.
Conclusion: First multi-annotator dataset of CQs from different methods and systematic comparison showing each approach has different characteristics, with LLMs serving as promising but imperfect starting points requiring refinement.
Abstract: Competency Questions (CQs) are pivotal in knowledge engineering, guiding the design, validation, and testing of ontologies. A number of diverse formulation approaches have been proposed in the literature, ranging from completely manual to Large Language Model (LLM) driven ones. However, attempts to characterise the outputs of these approaches and their systematic comparison are scarce. This paper presents an empirical comparative evaluation of three distinct CQ formulation approaches: manual formulation by ontology engineers, instantiation of CQ patterns, and generation using state of the art LLMs. We generate CQs using each approach from a set of requirements for cultural heritage, and assess them across different dimensions: degree of acceptability, ambiguity, relevance, readability and complexity. Our contribution is twofold: (i) the first multi-annotator dataset of CQs generated from the same source using different methods; and (ii) a systematic comparison of the characteristics of the CQs resulting from each approach. Our study shows that different CQ generation approaches have different characteristics and that LLMs can be used as a way to initially elicit CQs, however these are sensitive to the model used to generate CQs and they generally require a further refinement step before they can be used to model requirements.
[74] LLM Analysis of 150+ years of German Parliamentary Debates on Migration Reveals Shift from Post-War Solidarity to Anti-Solidarity in the Last Decade
Aida Kostikova, Ole Pütz, Steffen Eger, Olga Sabelfeld, Benjamin Paassen
Main category: cs.CL
TL;DR: LLMs achieve human-level agreement in annotating solidarity/anti-solidarity in German parliamentary debates about migration, but systematic errors require statistical correction for valid social science inference.
Details
Motivation: Traditional manual annotation limits large-scale analysis of political speech on migration. LLMs offer potential for automated analysis but need rigorous validation for social science applications.Method: Theory-driven annotation scheme applied to German parliamentary debates. Comprehensive evaluation of multiple LLMs (GPT-5, gpt-oss-120B) examining model size, prompting strategies, fine-tuning, and error patterns. Used Design-based Supervised Learning (DSL) to correct bias in trend estimates.
Result: Strongest LLMs achieve human-level agreement on solidarity/anti-solidarity annotation. Systematic errors bias downstream results. DSL reduces bias in long-term trend estimates. Analysis shows high postwar solidarity and marked rise in anti-solidarity since 2015.
Conclusion: LLMs can support large-scale social-scientific text analysis but require rigorous validation and statistical correction of systematic errors to ensure valid inference.
Abstract: Migration has been a core topic in German political debate, from the postwar displacement of millions of expellees to labor migration and recent refugee movements. Studying political speech across such wide-ranging phenomena in depth has traditionally required extensive manual annotation, limiting analysis to small subsets of the data. Large language models (LLMs) offer a potential way to overcome this constraint. Using a theory-driven annotation scheme, we examine how well LLMs annotate subtypes of solidarity and anti-solidarity in German parliamentary debates and whether the resulting labels support valid downstream inference. We first provide a comprehensive evaluation of multiple LLMs, analyzing the effects of model size, prompting strategies, fine-tuning, historical versus contemporary data, and systematic error patterns. We find that the strongest models, especially GPT-5 and gpt-oss-120B, achieve human-level agreement on this task, although their errors remain systematic and bias downstream results. To address this issue, we combine soft-label model outputs with Design-based Supervised Learning (DSL) to reduce bias in long-term trend estimates. Beyond the methodological evaluation, we interpret the resulting annotations from a social-scientific perspective to trace trends in solidarity and anti-solidarity toward migrants in postwar and contemporary Germany. Our approach shows relatively high levels of solidarity in the postwar period, especially in group-based and compassionate forms, and a marked rise in anti-solidarity since 2015, framed through exclusion, undeservingness, and resource burden. We argue that LLMs can support large-scale social-scientific text analysis, but only when their outputs are rigorously validated and statistically corrected.
[75] Support-Contra Asymmetry in LLM Explanations
Avinash Patil
Main category: cs.CL
TL;DR: LLM explanations align with predictive lexical evidence from external models - correct predictions reference supporting cues while incorrect predictions reference contradicting cues, showing support-contra asymmetry.
Details
Motivation: To understand whether LLM-generated explanations actually reference predictive cues present in input text, rather than being post-hoc rationalizations that don't reflect the model's actual reasoning process.Method: Compare LLM explanations against interpretable feature importance signals from transparent linear classifiers (logistic regression and linear SVM) on three text classification datasets (WIKIONTOLOGY, AG NEWS, IMDB). Partition predictive lexical cues into supporting vs contradicting evidence relative to predicted labels.
Result: Consistent support-contra asymmetry pattern: explanations for correct predictions reference more supporting cues and fewer contradicting cues, while explanations for incorrect predictions reference substantially more contradicting evidence. Pattern holds across datasets, model families, and feature retrieval depths.
Conclusion: LLM explanations often reflect actual predictive lexical signals when predictions are correct, but incorrect predictions are associated with explanations referencing misleading cues. External predictive evidence sources can effectively analyze LLM explanation behavior.
Abstract: Large Language Models (LLMs) increasingly produce natural language explanations alongside their predictions, yet it remains unclear whether these explanations reference predictive cues present in the input text. In this work, we present an empirical study of how LLM-generated explanations align with predictive lexical evidence from an external model in text classification tasks. To analyze this relationship, we compare explanation content against interpretable feature importance signals extracted from transparent linear classifiers. These reference models allow us to partition predictive lexical cues into supporting and contradicting evidence relative to the predicted label. Across three benchmark datasets-WIKIONTOLOGY, AG NEWS, and IMDB-we observe a consistent empirical pattern that we term support-contra asymmetry. Explanations accompanying correct predictions tend to reference more supporting lexical cues and fewer contradicting cues, whereas explanations associated with incorrect predictions reference substantially more contradicting evidence. This pattern appears consistently across datasets, across reference model families (logistic regression and linear SVM), and across multiple feature retrieval depths. These results suggest that LLM explanations often reflect lexical signals that are predictive for the task when predictions are correct, while incorrect predictions are more frequently associated with explanations that reference misleading cues present in the input. Our findings provide a simple empirical perspective on explanation-evidence alignment and illustrate how external sources of predictive evidence can be used to analyze the behavior of LLM-generated explanations.
[76] Penalizing Length: Uncovering Systematic Bias in Quality Estimation Metrics
Yilin Zhang, Wenda Xu, Zhongtao Liu, Tetsuji Nakagawa, Markus Freitag
Main category: cs.CL
TL;DR: Systematic study reveals length bias in Quality Estimation metrics for machine translation, showing they over-predict errors for longer translations and prefer shorter candidates, with proposed length normalization as solution.
Details
Motivation: Quality Estimation (QE) metrics are crucial for reference-free evaluation in machine translation and are increasingly used for data filtering and candidate reranking, but the prevalence and impact of length bias in these metrics has been underexplored.Method: Conducted systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, analyzed length biases, traced root causes to skewed supervision distributions, and applied length normalization during training as diagnostic intervention.
Result: Revealed two critical length biases: 1) QE metrics consistently over-predict errors with increasing translation length even for high-quality error-free texts, 2) they exhibit systematic preference for shorter translations when multiple candidates of comparable quality are available. Length normalization effectively decouples error prediction from sequence length.
Conclusion: Length bias in QE metrics risks unfairly penalizing longer correct translations and can propagate into downstream pipelines; length normalization during training yields more reliable QE signals across translations of varying length.
Abstract: Quality Estimation (QE) metrics are vital in machine translation for reference-free evaluation and increasingly serve as selection criteria in data filtering and candidate reranking. However, the prevalence and impact of length bias in QE metrics have been underexplored. Through a systematic study of top-performing learned and LLM-as-a-Judge QE metrics across 10 diverse language pairs, we reveal two critical length biases: First, QE metrics consistently over-predict errors with increasing translation length, even for high-quality, error-free texts. Second, they exhibit a systematic preference for shorter translations when multiple candidates of comparable quality are available for the same source text. These biases risk unfairly penalizing longer, correct translations and can propagate into downstream pipelines that rely on QE signals for data selection or system optimization. We trace the root cause of learned QE metrics to skewed supervision distributions, where longer error-free examples are underrepresented in training data. As a diagnostic intervention, we apply length normalization during training and show that this simple modification effectively decouples error prediction from sequence length, yielding more reliable QE signals across translations of varying length.
[77] Affect, Body, Cognition, Demographics, and Emotion: The ABCDE of Text Features for Computational Affective Science
Jan Philip Wahle, Krishnapriya Vishnubhotla, Bela Gipp, Saif M. Mohammad
Main category: cs.CL
TL;DR: ABCDE dataset provides 400M+ text utterances annotated with affective, behavioral, cognitive, demographic, and emotional features to facilitate computational social science research
Details
Motivation: Current computational affective and social science research faces barriers in accessing and using annotated language data, particularly for non-CS practitioners, limiting interdisciplinary research across fields like affective science, cognitive science, and digital humanities.Method: Created the ABCDE dataset by collecting over 400 million text utterances from diverse sources including social media, blogs, books, and AI-generated content, then annotating them with a comprehensive range of features relevant to computational affective and social science.
Result: Produced a large-scale, multi-source dataset with annotations covering affect, body, cognition, demographics, and emotion features, enabling easier access to labeled data for researchers across multiple disciplines.
Conclusion: ABCDE dataset addresses the accessibility gap in computational affective and social science research by providing a comprehensive, annotated text corpus that facilitates interdisciplinary work and lowers barriers for practitioners outside computer science.
Abstract: Work in Computational Affective Science and Computational Social Science explores a wide variety of research questions about people, emotions, behavior, and health. Such work often relies on language data that is first labeled with relevant information, such as the use of emotion words or the age of the speaker. Although many resources and algorithms exist to enable this type of labeling, discovering, accessing, and using them remains a substantial impediment, particularly for practitioners outside of computer science. Here, we present the ABCDE dataset (Affect, Body, Cognition, Demographics, and Emotion), a large-scale collection of over 400 million text utterances drawn from social media, blogs, books, and AI-generated sources. The dataset is annotated with a wide range of features relevant to computational affective and social science. ABCDE facilitates interdisciplinary research across numerous fields, including affective science, cognitive science, the digital humanities, sociology, political science, and computational linguistics.
[78] Semantic Refinement with LLMs for Graph Representations
Safal Thapaliya, Zehong Wang, Jiazheng Li, Ziming Li, Yanfang Ye, Chuxu Zhang
Main category: cs.CL
TL;DR: GES framework uses in-graph exemplars to guide LLM-based semantic refinement of nodes, adapting to structure-semantics heterogeneity in graphs
Details
Motivation: Graph data has heterogeneous predictive signals - sometimes semantics dominate, sometimes structure dominates. Fixed inductive biases in models can't generalize across diverse graph domains. Existing methods add more biases but remain limited. Need a data-centric approach that treats node semantics as task-adaptive.Method: Graph-Exemplar-guided Semantic Refinement (GES): 1) Train GNN to produce predictive states, 2) Use structural and semantic similarity to retrieve in-graph exemplars, 3) Use LLM to refine node descriptions guided by these exemplars (unlike existing LLM methods that generate descriptions without graph context). Works on both text-rich and text-free graphs.
Result: Consistent improvements on both semantics-rich and structure-dominated graphs. Demonstrates effectiveness of data-centric semantic refinement under structure-semantics heterogeneity.
Conclusion: Data-centric approach using in-graph exemplars to guide LLM-based semantic refinement is effective for handling structure-semantics heterogeneity in graphs, outperforming model-centric approaches with fixed inductive biases.
Abstract: Graph-structured data exhibit substantial heterogeneity in where their predictive signals originate: in some domains, node-level semantics dominate, while in others, structural patterns play a central role. This structure-semantics heterogeneity implies that no graph learning model with a fixed inductive bias can generalize optimally across diverse graph domains. However, most existing methods address this challenge from the model side by incrementally injecting new inductive biases, which remains fundamentally limited given the open-ended diversity of real-world graphs. In this work, we take a data-centric perspective and treat node semantics as a task-adaptive variable. We propose a Graph-Exemplar-guided Semantic Refinement (GES) framework for graph representation learning which – unlike existing LLM-enhanced methods that generate node descriptions without graph context – leverages structurally and semantically similar nodes from the graph itself to guide semantic refinement. Specifically, a GNN is first trained to produce predictive states, which along with structural and semantic similarity are used to retrieve in-graph exemplars that inform an LLM in refining node descriptions. We evaluate our approach on both text-rich and text-free graphs. Results show consistent improvements on semantics-rich and structure-dominated graphs, demonstrating the effectiveness of data-centric semantic refinement under structure-semantics heterogeneity.
[79] Syntactic Framing Fragility: An Audit of Robustness in LLM Ethical Decisions
Katherine Elkins, Jon Chun
Main category: cs.CL
TL;DR: A framework called Syntactic Framing Fragility (SFF) is introduced to measure how large language models’ decisions change under logically equivalent syntactic transformations, particularly focusing on negation sensitivity in high-stakes scenarios.
Details
Motivation: Large language models show systematic negation sensitivity, but there's no operational framework to measure this vulnerability at deployment scale, especially for high-stakes decisions where inconsistent responses to logically equivalent questions could have serious consequences.Method: The authors introduce Syntactic Framing Fragility (SFF) with Logical Polarity Normalization to isolate syntactic effects, enabling comparison across positive and negative framings while controlling for polarity inversion. They use the Syntactic Variation Index (SVI) as a robustness metric and audit 23 models across 14 high-stakes scenarios (39,975 decisions).
Result: Open-source models show 2.2x higher fragility than commercial counterparts. Negation-bearing syntax is the dominant failure mode, with some models endorsing actions at 80-97% rates even when asked whether agents should not act. Chain-of-thought reasoning reduces fragility in some but not all cases.
Conclusion: The paper establishes ground-truth effect sizes for a phenomenon previously characterized only qualitatively, provides scenario-stratified risk profiles, and offers an operational checklist compatible with EU AI Act and NIST RMF requirements for deployment safety.
Abstract: Large language models exhibit systematic negation sensitivity, yet no operational framework exists to measure this vulnerability at deployment scale, especially in high-stakes decisions. We introduce Syntactic Framing Fragility (SFF), a framework for quantifying decision consistency under logically equivalent syntactic transformations. SFF isolates syntactic effects via Logical Polarity Normalization, enabling direct comparison across positive and negative framings while controlling for polarity inversion, and provides the Syntactic Variation Index (SVI) as a robustness metric suitable for CI/CD integration. Auditing 23 models across 14 high-stakes scenarios (39,975 decisions), we establish ground-truth effect sizes for a phenomenon previously characterized only qualitatively and find that open-source models exhibit $2.2x higher fragility than commercial counterparts. Negation-bearing syntax is the dominant failure mode, with some models endorsing actions at 80-97% rates even when asked whether agents not act. These patterns are consistent with negation suppression failure documented in prior work, with chain-of-thought reasoning reducing fragility in some but not all cases. We provide scenario-stratified risk profiles and offer an operational checklist compatible with EU AI Act and NIST RMF requirements. Code, data, and scenarios will be released upon publication.
[80] Generation-Step-Aware Framework for Cross-Modal Representation and Control in Multilingual Speech-Text Models
Toshiki Nakai, Varsha Suresh, Vera Demberg
Main category: cs.CL
TL;DR: Analysis of cross-modal language alignment in multilingual speech-text models reveals time-dependent and function-specific shared computation patterns between speech and text modalities.
Details
Motivation: To understand whether cross-modal language alignment in multilingual speech-text models reflects shared computation for the same language or modality-specific processing, and to develop a framework for evaluating cross-modal computation dynamics.Method: Introduced a generation-step-aware framework that: (1) identifies language-selective neurons for each modality at different decoding steps, (2) decomposes them into language-representation and language-control roles, and (3) enables cross-modal comparison via overlap measures and causal intervention including cross-modal steering of output language. Applied to SeamlessM4T v2 model.
Result: Cross-modal language alignment is strongest at the first decoding step where language-representation neurons are shared across modalities, but weakens as generation proceeds. Language-control neurons identified from speech transfer causally to text generation, revealing partially shared circuitry for output-language control that strengthens at later decoding steps.
Conclusion: Cross-modal processing in multilingual speech-text models is both time- and function-dependent, with language-representation shared early and language-control circuitry shared later, providing a nuanced view of multilingual computation.
Abstract: Multilingual speech-text models rely on cross-modal language alignment to transfer knowledge between speech and text, but it remains unclear whether this reflects shared computation for the same language or modality-specific processing. We introduce a generation-step-aware framework for evaluating cross-modal computation that (i) identifies language-selective neurons for each modality at different decoding steps, (ii) decomposes them into language-representation and language-control roles, and (iii) enables cross-modal comparison via overlap measures and causal intervention, including cross-modal steering of output language. Applying our framework to SeamlessM4T v2, we find that cross-modal language alignment is strongest at the first decoding step, where language-representation neurons are shared across modalities, but weakens as generation proceeds, indicating a shift toward modality-specific autoregressive processing. In contrast, language-control neurons identified from speech transfer causally to text generation, revealing partially shared circuitry for output-language control that strengthens at later decoding steps. These results show that cross-modal processing is both time- and function-dependent, providing a more nuanced view of multilingual computation in speech-text models.
[81] Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective
Hao Fang, Tianyi Zhang, Tianqu Zhuang, Jiawei Kong, Kuofeng Gao, Bin Chen, Leqi Liang, Shu-Tao Xia, Ke Xu
Main category: cs.CL
TL;DR: Proposes an information-theoretic defense against logit-based distillation attacks on proprietary LLMs by minimizing conditional mutual information between teacher logits and queries.
Details
Motivation: Existing defenses only address text-based distillation, leaving logit-based distillation attacks unexplored. Proprietary LLMs have economic value and are exposed as black-box APIs, making them vulnerable to knowledge extraction via distillation attacks.Method: Characterizes distillation-relevant information using conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. Proposes learning a transformation matrix to purify original outputs, optimized via a CMI-inspired anti-distillation objective that removes distillation-relevant information while preserving utility.
Result: Extensive experiments across multiple LLMs and strong distillation algorithms show the method significantly degrades distillation performance while preserving task accuracy, effectively protecting models’ intellectual property.
Conclusion: The proposed information-theoretic approach provides an effective defense against logit-based distillation attacks on proprietary LLMs, addressing a previously unexplored vulnerability in model protection.
Abstract: Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models’ intellectual property.
[82] What Makes a Good Doctor Response? A Study on Text-Based Telemedicine
Adrian Cosma, Cosmin Dumitrache, Emilian Radoi
Main category: cs.CL
TL;DR: Analysis of patient satisfaction in Romanian text-based telemedicine using language features and metadata to predict feedback, finding metadata dominates predictions while politeness/hedging correlate with positive feedback.
Details
Motivation: With the rise of text-based telemedicine and increasing reliance on patient ratings, clinicians need to understand what drives patient satisfaction in written medical consultations, especially since ratings often reflect communication quality rather than clinical accuracy.Method: Analyzed anonymized Romanian text-based telemedicine consultations, modeling feedback as binary (thumbs-up vs. negative/absent). Extracted language-agnostic features (length, structure, readability), Romanian LIWC psycholinguistic features, and politeness/hedging markers. Trained classifier with time-based split and performed SHAP-based analyses.
Result: Metadata dominated predictions as a strong prior, while response text characteristics provided smaller but actionable signal. Politeness and hedging consistently associated with positive feedback, while lexical diversity showed negative association.
Conclusion: In text-based telemedicine, metadata serves as strong predictor of patient satisfaction, but linguistic features like politeness and hedging offer actionable insights for improving communication quality and patient feedback.
Abstract: Text-based telemedicine has become an increasingly used mode of care, requiring clinicians to deliver medical advice clearly and effectively in writing. As platforms increasingly rely on patient ratings and feedback, clinicians face growing pressure to maintain satisfaction scores, even though these evaluations often reflect communication quality more than clinical accuracy. We analyse patient satisfaction signals in Romanian text-based telemedicine. Using a sample of anonymised text-based telemedicine consultations, we model feedback as a binary outcome, treating thumbs-up responses as positive and grouping negative or absent feedback into the other class. We extract from doctor responses interpretable, predominantly language-agnostic features (e.g., length, structural characteristics, readability proxies), along with Romanian LIWC psycholinguistic features and politeness/hedging markers where available. We train a classifier with a time-based split and perform SHAP-based analyses, which indicate that metadata dominates prediction, functioning as a strong prior, while characteristics of the response text provide a smaller but actionable signal. In subgroup correlation analyses, politeness and hedging are consistently associated with positive patient feedback, whereas lexical diversity shows a negative association.
[83] LLM2Vec-Gen: Generative Embeddings from Large Language Models
Parishad BehnamGhader, Vaibhav Adlakha, Fabian David Schmidt, Nicolas Chapados, Marius Mosbach, Siva Reddy
Main category: cs.CL
TL;DR: LLM2Vec-Gen is a self-supervised method that produces embeddings in LLM’s output space by learning to represent the model’s potential response, achieving SOTA on MTEB while preserving LLM capabilities like safety alignment and reasoning.
Details
Motivation: Traditional contrastive fine-tuning of LLM-based text embedders discards the LLM's output semantics by mapping inputs/outputs into a new representational space. The authors want to preserve these valuable semantics while creating effective embeddings.Method: Append trainable special tokens to input and optimize them to compress the LLM’s own response into fixed-length embeddings. Uses unsupervised embedding teacher and reconstruction objective. LLM backbone remains frozen, training requires only unlabeled queries.
Result: Achieves state-of-the-art self-supervised performance on MTEB (8.8% improvement over unsupervised embedding teacher). Preserves LLM capabilities: 22.6% reduction in harmful content retrieval (safety alignment) and 35.6% improvement on reasoning-intensive retrieval. Embeddings are interpretable and can be decoded back to text.
Conclusion: LLM2Vec-Gen provides a self-supervised approach that produces embeddings directly in LLM’s output space, preserving valuable semantic information while achieving strong performance on embedding benchmarks.
Abstract: Fine-tuning LLM-based text embedders via contrastive learning maps inputs and outputs into a new representational space, discarding the LLM’s output semantics. We propose LLM2Vec-Gen, a self-supervised alternative that instead produces embeddings directly in the LLM’s output space by learning to represent the model’s potential response. Specifically, trainable special tokens are appended to the input and optimized to compress the LLM’s own response into a fixed-length embedding, guided by an unsupervised embedding teacher and a reconstruction objective. Crucially, the LLM backbone remains frozen and training requires only unlabeled queries. LLM2Vec-Gen achieves state-of-the-art self-supervised performance on the Massive Text Embedding Benchmark (MTEB), improving by 8.8% over the unsupervised embedding teacher. Since the embeddings preserve the LLM’s response-space semantics, they inherit capabilities such as safety alignment (up to 22.6% reduction in harmful content retrieval) and reasoning (up to 35.6% improvement on reasoning-intensive retrieval). Finally, the learned embeddings are also interpretable: they can be decoded back into text to reveal their semantic content.
[84] Current LLMs still cannot ’talk much’ about grammar modules: Evidence from syntax
Mohammed Q. Shormani, Yehia A. AlSohbani
Main category: cs.CL
TL;DR: LLMs like ChatGPT struggle with accurate translation of technical syntax terminology from English to Arabic, achieving only 25% accuracy on core generative syntax terms.
Details
Motivation: To evaluate how well Large Language Models can handle specialized linguistic terminology, particularly in translating core syntax properties from English to Arabic, given the technical nature of generative syntax terms.Method: Collected 44 technical terms from generative syntax literature, had them translated by both human experts and ChatGPT-5, then conducted comparative analysis using analytical and comparative approaches to assess translation accuracy.
Result: Only 25% of ChatGPT translations were accurate, 38.6% were inaccurate, and 36.4% were partially correct. The model struggled with syntactic and semantic challenges in technical terminology translation.
Conclusion: LLMs still cannot effectively handle specialized linguistic terminology translation, requiring closer collaboration between AI specialists and linguists to improve models’ understanding of technical domains.
Abstract: We aim to examine the extent to which Large Language Models (LLMs) can ’talk much’ about grammar modules, providing evidence from syntax core properties translated by ChatGPT into Arabic. We collected 44 terms from generative syntax previous works, including books and journal articles, as well as from our experience in the field. These terms were translated by humans, and then by ChatGPT-5. We then analyzed and compared both translations. We used an analytical and comparative approach in our analysis. Findings unveil that LLMs still cannot ’talk much’ about the core syntax properties embedded in the terms under study involving several syntactic and semantic challenges: only 25% of ChatGPT translations were accurate, while 38.6% were inaccurate, and 36.4.% were partially correct, which we consider appropriate. Based on these findings, a set of actionable strategies were proposed, the most notable of which is a close collaboration between AI specialists and linguists to better LLMs’ working mechanism for accurate or at least appropriate translation.
[85] PAVE: Premise-Aware Validation and Editing for Retrieval-Augmented LLMs
Tianyi Huang, Caden Yang, Emily Yin, Eric Wang, Michael Zhang
Main category: cs.CL
TL;DR: PAVE adds an inference-time validation layer to retrieval-augmented language models that decomposes retrieved context into atomic facts, validates answer support, and revises low-support outputs before finalization.
Details
Motivation: Current retrieval-augmented language models often commit to answers without explicitly checking whether the retrieved evidence actually supports their conclusions, leading to potential inconsistencies in evidence-grounded reasoning.Method: PAVE implements a four-step process: 1) decomposes retrieved context into question-conditioned atomic facts (premises), 2) drafts an initial answer, 3) scores how well the draft is supported by extracted premises, and 4) revises low-support outputs before finalization, creating an auditable trace of reasoning.
Result: PAVE outperforms simpler post-retrieval baselines in evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark, demonstrating improved evidence-grounded consistency.
Conclusion: Explicit premise extraction combined with support-gated revision can significantly strengthen evidence-grounded consistency in retrieval-augmented LLM systems, providing auditable reasoning traces.
Abstract: Retrieval-augmented language models can retrieve relevant evidence yet still commit to answers before explicitly checking whether the retrieved context supports the conclusion. We present PAVE (Premise-Grounded Answer Validation and Editing), an inference-time validation layer for evidence-grounded question answering. PAVE decomposes retrieved context into question-conditioned atomic facts, drafts an answer, scores how well that draft is supported by the extracted premises, and revises low-support outputs before finalization. The resulting trace makes answer commitment auditable at the level of explicit premises, support scores, and revision decisions. In controlled ablations with a fixed retriever and backbone, PAVE outperforms simpler post-retrieval baselines in two evidence-grounded QA settings, with the largest gain reaching 32.7 accuracy points on a span-grounded benchmark. We view these findings as proof-of-concept evidence that explicit premise extraction plus support-gated revision can strengthen evidence-grounded consistency in retrieval-augmented LLM systems.
[86] Beyond Via: Analysis and Estimation of the Impact of Large Language Models in Academic Papers
Mingmeng Geng, Yuhang Dong, Thierry Poibeau
Main category: cs.CL
TL;DR: Analysis shows LLMs are changing academic writing patterns in detectable ways, with word frequency shifts and model-specific signatures that current classifiers struggle to identify.
Details
Motivation: To investigate how large language models are influencing academic writing patterns in arXiv papers, particularly focusing on detectable shifts in word usage and model-specific signatures that may not have received sufficient attention.Method: Analyzed arXiv papers to identify word usage shifts, used linear interpretable approaches to quantify LLM effects, accounting for differences between models and prompts, and tested classifier performance on multi-class model identification tasks.
Result: Found significant word frequency shifts (increased “beyond” and “via” in titles, decreased “the” and “of” in abstracts), current classifiers struggle to identify specific LLM sources in multi-class tasks, and real-world LLM usage shows heterogeneous and dynamic patterns.
Conclusion: LLMs are measurably influencing academic writing with detectable patterns, but current detection methods face challenges due to model similarities and evolving usage patterns, requiring more sophisticated approaches.
Abstract: Through an analysis of arXiv papers, we report several shifts in word usage that are likely driven by large language models (LLMs) but have not previously received sufficient attention, such as the increased frequency of “beyond” and “via” in titles and the decreased frequency of “the” and “of” in abstracts. Due to the similarities among different LLMs, experiments show that current classifiers struggle to accurately determine which specific model generated a given text in multi-class classification tasks. Meanwhile, variations across LLMs also result in evolving patterns of word usage in academic papers. By adopting a direct and highly interpretable linear approach and accounting for differences between models and prompts, we quantitatively assess these effects and show that real-world LLM usage is heterogeneous and dynamic.
[87] Tailoring AI-Driven Reading Scaffolds to the Distinct Needs of Neurodiverse Learners
Soufiane Jhilal, Eleonora Pasqua, Caterina Marchesi, Riccardo Corradi, Martina Galletti
Main category: cs.CL
TL;DR: Study examines different reading scaffolds (text segmentation, pictograms, labels) for neurodiverse learners, finding heterogeneous responses and no universally optimal approach.
Details
Motivation: Neurodiverse learners need reading supports, but rich scaffolds can sometimes overload attention and working memory rather than improve comprehension. The study aims to understand how different types of scaffolds (structural vs. semantic) affect comprehension and reading experience in supervised inclusive contexts.Method: Used an adapted reading interface with four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. Conducted within-subject pilot with 14 primary-school learners with special educational needs and disabilities, measuring reading comprehension with standardized questions and collecting child- and therapist-reported experience measures with open-ended feedback.
Result: Results showed heterogeneous responses - some learners benefited from segmentation and pictograms, while others showed increased coordination costs with visual scaffolds. Experience ratings showed limited differences between modalities, with some effects linked to clinical complexity. Learners frequently requested simpler wording and additional visual supports in open-ended feedback.
Conclusion: No single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding. Provides design implications for human-AI co-regulation in supervised inclusive reading contexts.
Abstract: Neurodiverse learners often require reading supports, yet increasing scaffold richness can sometimes overload attention and working memory rather than improve comprehension. Grounded in the Construction-Integration model and a contingent scaffolding perspective, we examine how structural versus semantic scaffolds shape comprehension and reading experience in a supervised inclusive context. Using an adapted reading interface, we compared four modalities: unmodified text, sentence-segmented text, segmented text with pictograms, and segmented text with pictograms plus keyword labels. In a within-subject pilot with 14 primary-school learners with special educational needs and disabilities, we measured reading comprehension using standardized questions and collected brief child- and therapist-reported experience measures alongside open-ended feedback. Results highlight heterogeneous responses as some learners showed patterns consistent with benefits from segmentation and pictograms, while others showed patterns consistent with increased coordination costs when visual scaffolds were introduced. Experience ratings showed limited differences between modalities, with some apparent effects linked to clinical complexity, particularly for perceived ease of understanding. Open-ended feedback of the learners frequently requested simpler wording and additional visual supports. These findings suggest that no single scaffold is universally optimal, reinforcing the need for calibrated, adjustable scaffolding and provide design implications for human-AI co-regulation in supervised inclusive reading contexts.
[88] MemRerank: Preference Memory for Personalized Product Reranking
Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong
Main category: cs.CL
TL;DR: MemRerank: A preference memory framework that distills user purchase history into concise signals for personalized product reranking in LLM-based shopping agents, improving accuracy by up to +10.61 points.
Details
Motivation: Current LLM-based shopping agents struggle with personalization because they naively append raw purchase histories to prompts, which is ineffective due to noise, length, and relevance mismatch. There's a need for better ways to distill user preferences from purchase history for effective personalization.Method: Proposes MemRerank, a preference memory framework that extracts concise, query-independent signals from user purchase history. Builds an end-to-end benchmark with a 1-in-5 selection task to evaluate memory quality and reranking utility. Trains the memory extractor using reinforcement learning with downstream reranking performance as supervision.
Result: MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, achieving up to +10.61 absolute points improvement in 1-in-5 accuracy with two different LLM-based rerankers.
Conclusion: Explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems, demonstrating that distilled preference signals outperform raw history for LLM-based shopping agents.
Abstract: LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.
[89] MemFactory: Unified Inference & Training Framework for Agent Memory
Ziliang Guo, Ziheng Li, Bo Tang, Feiyu Xiong, Zhiyu Li
Main category: cs.CL
TL;DR: MemFactory is a unified training and inference framework for memory-augmented LLM agents that abstracts memory lifecycle into modular components and integrates GRPO for policy optimization.
Details
Motivation: Existing RL-based memory optimization approaches for LLMs are fragmented and task-specific, lacking unified infrastructure for integration, training, and evaluation of memory-augmented agents.Method: MemFactory abstracts memory lifecycle (extraction, updating, retrieval) into atomic plug-and-play components with “Lego-like” architecture, integrates Group Relative Policy Optimization (GRPO) for fine-tuning memory management policies, and supports cutting-edge paradigms like Memory-R1, RMM, and MemAgent.
Result: Empirical validation on MemAgent architecture shows average performance improvements over base models with relative gains up to 14.8% across evaluation sets.
Conclusion: MemFactory provides standardized, extensible infrastructure that lowers barrier to entry for memory-driven AI agents and enables future innovations in memory-augmented LLM research.
Abstract: Memory-augmented Large Language Models (LLMs) are essential for developing capable, long-term AI agents. Recently, applying Reinforcement Learning (RL) to optimize memory operations, such as extraction, updating, and retrieval, has emerged as a highly promising research direction. However, existing implementations remain highly fragmented and task-specific, lacking a unified infrastructure to streamline the integration, training, and evaluation of these complex pipelines. To address this gap, we present MemFactory, the first unified, highly modular training and inference framework specifically designed for memory-augmented agents. Inspired by the success of unified fine-tuning frameworks like LLaMA-Factory, MemFactory abstracts the memory lifecycle into atomic, plug-and-play components, enabling researchers to seamlessly construct custom memory agents via a “Lego-like” architecture. Furthermore, the framework natively integrates Group Relative Policy Optimization (GRPO) to fine-tune internal memory management policies driven by multi-dimensional environmental rewards. MemFactory provides out-of-the-box support for recent cutting-edge paradigms, including Memory-R1, RMM, and MemAgent. We empirically validate MemFactory on the open-source MemAgent architecture using its publicly available training and evaluation data. Across the evaluation sets, MemFactory improves performance over the corresponding base models on average, with relative gains of up to 14.8%. By providing a standardized, extensible, and easy-to-use infrastructure, MemFactory significantly lowers the barrier to entry, paving the way for future innovations in memory-driven AI agents.
[90] Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
Zaifu Zhan, Mengyuan Cui, Rui Zhang
Main category: cs.CL
TL;DR: Self-reflective prompting for medical QA shows inconsistent results - modest gains on MedQA but limited/negative impact on other datasets, with no guarantee of improvement from more reflection steps.
Details
Motivation: To evaluate the effectiveness of self-reflective (self-corrective) prompting in safety-critical medical settings, where it's widely claimed to enhance model reliability but its actual performance remains unclear.Method: Comparative analysis using GPT-4o and GPT-4o-mini on three medical QA benchmarks (MedQA, HeadQA, PubMedQA), comparing standard chain-of-thought prompting with iterative self-reflection loops and tracking prediction evolution across reflection steps.
Result: Self-reflective prompting does not consistently improve accuracy - modest gains on MedQA but limited or negative benefits on HeadQA and PubMedQA. Increasing reflection steps doesn’t guarantee better performance. Impact is highly dataset- and model-dependent.
Conclusion: Self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability, highlighting a gap between reasoning transparency and correctness.
Abstract: Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error persistence, or the introduction of new errors. Our results show that self-reflective prompting does not consistently improve accuracy and its impact is highly dataset- and model-dependent: it yields modest gains on MedQA but provides limited or negative benefits on HeadQA and PubMedQA, and increasing the number of reflection steps does not guarantee better performance. These findings highlight a gap between reasoning transparency and reasoning correctness, suggesting that self-reflective reasoning is better viewed as an analytical tool for understanding model behavior rather than a standalone solution for improving medical QA reliability.
[91] Valency Classification of Mapudungun Verbal Roots. Established by the language’s own morphotactics
Andrés Chandía
Main category: cs.CL
TL;DR: This paper analyzes valency classification of verbal roots in Mapudungun using the language’s morphotactics to examine permissible suffix combinations, aiming to improve a morphological analyzer and contribute to theoretical understanding of Mapuche verb valency.
Details
Motivation: To accurately classify the valency of Mapudungun verbal roots using the language's own morphological patterns, building on previous work that identified and confirmed verbal roots.Method: Examines permissible and restricted combinations of various suffixes with verbal roots or stems in Mapuche verb forms, using the language’s morphotactics as the analytical framework.
Result: The findings improve the Dungupeyum morphological analyzer by incorporating verified valency classifications, and contribute to theoretical understanding of Mapuche verb valency issues.
Conclusion: The paper successfully applies morphotactic analysis to valency classification, enhancing both practical morphological analysis tools and theoretical understanding of Mapudungun verb structure.
Abstract: In the previous work, a lexical (re)categorisation – or confirmation of the given category – of roots identified as verbal was undertaken to determine their original category accurately. Building on this, the present paper offers an account of the valency classification of those Mapudungun roots confirmed to be verbal, using the language’s own morphotactics; specifically, by examining the permissible and restricted combinations of various suffixes with roots or verbal stems in the Mapuche verb form. As with all work conducted thus far, the results presented here aim to improve the morphological analyser (Dungupeyum) with all verified findings incorporated into the system. From a theoretical perspective, we also hope to contribute to the recognition and understanding of issues related to the valency of Mapuche verb forms.
[92] Dual Optimal: Make Your LLM Peer-like with Dignity
Xiangqi Wang, Yue Huang, Haomin Zhuang, Kehan Guo, Xiangliang Zhang
Main category: cs.CL
TL;DR: The paper proposes a “Dignified Peer” framework to address the “Evasive Servant” problem in aligned LLMs, where models both sycophantically validate flawed user beliefs while using boilerplate disclaimers to deflect responsibility.
Details
Motivation: Current aligned language models exhibit problematic dual failure modes: they sycophantically validate flawed user beliefs while simultaneously deflecting responsibility with boilerplate disclaimers, creating what the authors term the "Evasive Servant" problem.Method: The authors introduce the PersonaKnob dataset with compositional partial order structure of multiple persona preferences, use a tolerant constrained Lagrangian DPO algorithm to dynamically balance persona dimensions, and employ psychometrically calibrated Item Response Theory evaluation to disentangle latent model capabilities from confounders.
Result: Extensive empirical studies demonstrate successful creation of an LLM agent with both dignity and peer qualities, effectively addressing the Evasive Servant problem through the Dignified Peer framework.
Conclusion: The proposed Dignified Peer framework successfully counters servility with anti-sycophancy and trustworthiness while mitigating evasiveness through empathy and creativity, overcoming challenges in data supervision, objective collapse, and evaluation bias.
Abstract: Current aligned language models exhibit a dual failure mode we term the Evasive Servant: they sycophantically validate flawed user beliefs while deflecting responsibility with boilerplate disclaimers. We propose the Dignified Peer framework, which counters servility with anti-sycophancy and trustworthiness, and mitigates evasiveness through empathy and creativity. Realizing this agent requires overcoming significant challenges in data supervision, objective collapse, and evaluation bias. We address these issues by introducing the PersonaKnob dataset which features a compositional partial order structure of multiple persona preference. This data is utilized alongside a tolerant constrained Lagrangian DPO algorithm that dynamically balances all persona dimensions to prevent behavioral collapse. Additionally, we employ a psychometrically calibrated Item Response Theory evaluation protocol to disentangle latent model persona capability from confounders like judge biases. Extensive empirical studies demonstrate that our approach successfully build a LLM agent with both dignity and peer.
[93] ORBIT: Scalable and Verifiable Data Generation for Search Agents on a Tight Budget
Nandan Thakur, Zijian Chen, Xueguang Ma, Jimmy Lin
Main category: cs.CL
TL;DR: ORBIT: A synthetic training dataset for search agents with 20K reasoning-intensive queries across 15 domains, requiring 4-5 reasoning steps, generated without paid APIs using a frugal modular framework.
Details
Motivation: Training search agents for complex, multi-step reasoning tasks is challenging due to expensive human annotation or cumbersome prerequisites. There's a need for cost-effective ways to create high-quality training data for deep research tasks.Method: A frugal modular framework with four stages: seed creation, question-answer pair generation, and two verification stages (self and external). Uses web search for external verification without relying on paid API services.
Result: ORBIT-4B (trained Qwen3-4B on ORBIT dataset) achieves strong performance among sub-4B LLMs on Wikipedia question answering tasks, demonstrating the utility of synthetic datasets for search agents.
Conclusion: Synthetic datasets like ORBIT can effectively train search agents for complex reasoning tasks without expensive human annotation, and the frugal framework enables scalable dataset creation.
Abstract: Search agents, which integrate language models (LMs) with web search, are becoming crucial for answering complex user queries. Constructing training datasets for deep research tasks, involving multi-step retrieval and reasoning, remains challenging due to expensive human annotation, or cumbersome prerequisites. In this work, we introduce ORBIT, a training dataset with 20K reasoning-intensive queries with short verifiable answers, generated using a frugal framework without relying on paid API services. The modular framework relies on four stages: seed creation, question-answer pair generation, and two stages of verification: self and external. ORBIT spans 15 domains and each training pair requires 4-5 reasoning steps, with external search verification required from the complete web. We train Qwen3-4B as the base model on ORBIT using GRPO and evaluate it on Wikipedia question answering tasks. Extensive experiment results demonstrate that ORBIT-4B achieves strong performance among sub-4B LLMs as search agents, proving the utility of synthetic datasets. Our framework, code and datasets are open-sourced and available publicly.
[94] The Geometric Anatomy of Capability Acquisition in Transformers
Jayadev Billa
Main category: cs.CL
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.15997: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.15997&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[95] NeuroNarrator: A Generalist EEG-to-Text Foundation Model for Clinical Interpretation via Spectro-Spatial Grounding and Temporal State-Space Reasoning
Guoan Wang, Shihao Yang, Jun-en Ding, Hao Zhu, Feng Liu
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.16880: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16880&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[96] The Reasoning Error About Reasoning: Why Different Types of Reasoning Require Different Representational Structures
Yiling Wu
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.21736: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21736&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[97] The Presupposition Problem in Representation Genesis
Yiling Wu
Main category: cs.CL
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.21745: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21745&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[98] The Evolution of Tool Use in LLM Agents: From Single-Tool Call to Multi-Tool Orchestration
Haoyuan Xu, Chang Li, Xinyan Ma, Xianhao Ou, Zihan Zhang, Tao He, Xiangyu Liu, Zixiang Wang, Jiafeng Liang, Zheng Chu, Runxuan Liu, Rongchuan Mu, Dandan Tu, Ming Liu, Bing Qin
Main category: cs.CL
TL;DR: Unable to analyze paper 2603.22862 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2603.22862: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22862&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[99] Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies
Hanzhong Zhang, Siyang Song, Jindong Wang
Main category: cs.CL
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.23406: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23406&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[100] Do Phone-Use Agents Respect Your Privacy?
Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang
Main category: cs.CL
TL;DR: Failed to fetch paper summary - HTTP 429 error indicates rate limiting from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to technical limitations
Result: No results available - arXiv API returned HTTP 429 (Too Many Requests) error
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper content
Abstract: Failed to fetch summary for 2604.00986: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00986&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.CV
[101] DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation
Xinhao Huang, Jinke Yu, Wenhao Xu, Zeyi Wen, Ying Zhou, Junzhuo Liu, Junhao Ji, Zulong Chen
Main category: cs.CV
TL;DR: DOne is a framework for design-to-code generation that decouples structure understanding from element rendering to address VLMs’ limitations in reconciling high-level hierarchy with fine-grained visual details.
Details
Motivation: Vision Language Models (VLMs) for design-to-code generation suffer from a "holistic bottleneck" - they fail to reconcile high-level structural hierarchy with fine-grained visual details, resulting in layout distortions or generic placeholders.Method: DOne introduces: (1) a learned layout segmentation module to decompose complex designs, avoiding heuristic cropping limitations; (2) a specialized hybrid element retriever for UI components with extreme aspect ratios and densities; (3) a schema-guided generation paradigm that bridges layout and code.
Result: DOne outperforms existing methods on the HiFi2Code benchmark (with higher layout complexity than existing datasets) in both high-level visual similarity (over 10% in GPT Score) and fine-grained element alignment. Human evaluations show 3 times productivity gain with higher visual fidelity.
Conclusion: The DOne framework successfully addresses VLMs’ limitations in design-to-code generation by decoupling structure understanding from element rendering, achieving better performance on complex layouts and improving productivity.
Abstract: While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a “holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.
[102] VideoZeroBench: Probing the Limits of Video MLLMs with Spatio-Temporal Evidence Verification
Jiahao Meng, Tan Yue, Qi Xu, Haochen Wang, Zhongwei Ren, Weisong Liu, Yuhao Wang, Renrui Zhang, Yunhai Tong, Haodong Duan
Main category: cs.CV
TL;DR: VideoZeroBench: A hierarchical benchmark for long-video QA that verifies spatio-temporal evidence, revealing major gaps in current models’ grounded reasoning capabilities.
Details
Motivation: Current video multimodal LLM evaluations suffer from inflated scores that mask deficiencies in fine-grained visual understanding and lack verification of spatio-temporal evidence supporting predictions.Method: Created VideoZeroBench with 500 manually annotated questions across 13 domains, each paired with temporal intervals and spatial bounding boxes as evidence. Introduced five-level evaluation protocol that progressively tightens evidence requirements to disentangle answering generation, temporal grounding, and spatial grounding.
Result: Gemini-3-Pro correctly answers fewer than 17% of questions under standard QA setting. Performance drops sharply with grounding constraints - no model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required, with most failing to achieve any correct grounded predictions.
Conclusion: There’s a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a major bottleneck for long-video QA.
Abstract: Recent video multimodal large language models achieve impressive results across various benchmarks. However, current evaluations suffer from two critical limitations: (1) inflated scores can mask deficiencies in fine-grained visual understanding and reasoning, and (2) answer correctness is often measured without verifying whether models identify the precise spatio-temporal evidence supporting their predictions. To address this, we present VideoZeroBench, a hierarchical benchmark designed for challenging long-video question answering that rigorously verifies spatio-temporal evidence. It comprises 500 manually annotated questions across 13 domains, paired with temporal intervals and spatial bounding boxes as evidence. To disentangle answering generation, temporal grounding, and spatial grounding, we introduce a five-level evaluation protocol that progressively tightens evidence requirements. Experiments show that even Gemini-3-Pro correctly answers fewer than 17% of questions under the standard end-to-end QA setting (Level-3). When grounding constraints are imposed, performance drops sharply: No model exceeds 1% accuracy when both correct answering and accurate spatio-temporal localization are required (Level-5), with most failing to achieve any correct grounded predictions. These results expose a significant gap between surface-level answer correctness and genuine evidence-based reasoning, revealing that grounded video understanding remains a bottleneck for long-video QA. We further analyze performance across minimal evidence spans, atomic abilities, and inference paradigms, providing insights for future research in grounded video reasoning. The benchmark and code will be made publicly available.
[103] CLPIPS: A Personalized Metric for AI-Generated Image Similarity
Khoi Trinh, Jay Rothenberger, Scott Seidenberger, Dimitrios Diochnos, Anindya Maiti
Main category: cs.CV
TL;DR: CLPIPS is a customized version of LPIPS that adapts image similarity metrics to human judgments through lightweight fine-tuning, improving alignment in text-to-image workflows.
Details
Motivation: Existing image similarity metrics (LPIPS, CLIP) often fail to align with human judgments in context-specific or user-driven tasks, particularly in text-to-image generation workflows where iterative prompt refinement is crucial.Method: Introduces CLPIPS, a customized extension of LPIPS that fine-tunes only the layer combination weights using margin ranking loss on human-ranked image pairs from iterative text-to-image generation tasks.
Result: CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS, as measured by Spearman rank correlation and Intraclass Correlation Coefficient.
Conclusion: Lightweight, human-augmented fine-tuning can meaningfully improve perceptual alignment between metric predictions and human judgments, making similarity metrics more effective as adaptive components in human-in-the-loop text-to-image workflows.
Abstract: Iterative prompt refinement is central to reproducing target images with text to image generative models. Previous studies have incorporated image similarity metrics (ISMs) as additional feedback to human users. Existing ISMs such as LPIPS and CLIP provide objective measures of image likeness but often fail to align with human judgments, particularly in context specific or user driven tasks. In this paper, we introduce Customized Learned Perceptual Image Patch Similarity (CLPIPS), a customized extension of LPIPS that adapts a metric’s notion of similarity directly to human judgments. We aim to explore whether lightweight, human augmented fine tuning can meaningfully improve perceptual alignment, positioning similarity metrics as adaptive components for human in the loop workflows with text to image tools. We evaluate CLPIPS on a human subject dataset in which participants iteratively regenerate target images and rank generated outputs by perceived similarity. Using margin ranking loss on human ranked image pairs, we fine tune only the LPIPS layer combination weights and assess alignment via Spearman rank correlation and Intraclass Correlation Coefficient. Our results show that CLPIPS achieves stronger correlation and agreement with human judgments than baseline LPIPS. Rather than optimizing absolute metric performance, our work emphasizes improving alignment consistency between metric predictions and human ranks, demonstrating that even limited human specific fine tuning can meaningfully enhance perceptual alignment in human in the loop text to image workflows.
[104] TOL: Textual Localization with OpenStreetMap
Youqi Liao, Shuhao Kang, Jingyu Xu, Olaf Wysocki, Yan Xia, Jianping Li, Zhen Dong, Bisheng Yang, Xieyuanli Chen
Main category: cs.CV
TL;DR: TOLoc: A text-to-OSM localization framework that estimates 2-DoF positions from textual scene descriptions using OpenStreetMap, with a coarse-to-fine approach combining semantic and directional information.
Details
Motivation: Natural language provides intuitive spatial intent expression, but text-to-OSM localization remains unexplored despite OSM's advantages as a compact, freely available map representation with rich semantic information.Method: Coarse-to-fine framework: 1) Coarse stage extracts direction-aware features from text and OSM tiles to retrieve candidate locations; 2) Fine stage fuses textual descriptors with local map features via alignment module to regress 2-DoF pose.
Result: TOLoc outperforms best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds respectively, with strong generalization to unseen environments.
Conclusion: The proposed TOLoc framework effectively addresses text-to-OSM localization, demonstrating strong performance and generalization, with potential for large-scale applications using freely available OSM data.
Abstract: Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments. Dataset, code and models will be publicly available at: https://github.com/WHU-USI3DV/TOL.
[105] Cross-Scale MAE: A Tale of Multi-Scale Exploitation in Remote Sensing
Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, Hairong Qi
Main category: cs.CV
TL;DR: Cross-Scale MAE: A self-supervised learning approach for remote sensing images that uses scale augmentation and cross-scale consistency constraints to learn meaningful representations for downstream tasks.
Details
Motivation: Remote sensing images present unique challenges including extensive geographic coverage, hardware limitations, and misaligned multi-scale images. The paper aims to address multi-scale representation learning within self-supervised learning frameworks specifically for remote sensing image understanding.Method: Built upon Masked Auto-Encoder (MAE), Cross-Scale MAE employs scale augmentation techniques during pre-training and enforces cross-scale consistency constraints through both contrastive and generative losses. Implementation leverages xFormers library for accelerated training on single GPU.
Result: Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
Conclusion: Cross-Scale MAE provides an effective self-supervised learning approach for remote sensing images that handles multi-scale challenges and produces high-quality representations suitable for various downstream tasks.
Abstract: Remote sensing images present unique challenges to image analysis due to the extensive geographic coverage, hardware limitations, and misaligned multi-scale images. This paper revisits the classical multi-scale representation learning problem but under the general framework of self-supervised learning for remote sensing image understanding. We present Cross-Scale MAE, a self-supervised model built upon the Masked Auto-Encoder (MAE).During pre-training, Cross-Scale MAE employs scale augmentation techniques and enforces cross-scale consistency constraints through both contrastive and generative losses to ensure consistent and meaningful representations well-suited for a wide range of downstream tasks. Further, our implementation leverages the xFormers library to accelerate network pre-training on a single GPU while maintaining the quality of learned representations. Experimental evaluations demonstrate that Cross-Scale MAE exhibits superior performance compared to standard MAE and other state-of-the-art remote sensing MAE methods.
[106] Camouflage-aware Image-Text Retrieval via Expert Collaboration
Yao Jiang, Zhongkuan Mao, Xuan Wu, Keren Fu, Qijun Zhao
Main category: cs.CV
TL;DR: A new camouflage-aware image-text retrieval task with dedicated dataset and dual-branch network for better cross-modal alignment in camouflaged scenes.
Details
Motivation: Camouflaged scene understanding lacks robust image-text cross-modal alignment, limiting deeper understanding of camouflaged scenarios and applications.Method: Proposes camouflage-aware image-text retrieval (CA-ITR) task, constructs CamoIT dataset with 10.5K samples, and develops CECNet with dual-branch visual encoder (holistic + camouflage-specific) and confidence-conditioned graph attention mechanism.
Result: CECNet achieves ~29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models on the new CamoIT benchmark.
Conclusion: The work addresses cross-modal alignment challenges in camouflaged scenes, providing new dataset, task formulation, and effective network architecture for improved retrieval performance.
Abstract: Camouflaged scene understanding (CSU) has attracted significant attention due to its broad practical implications. However, in this field, robust image-text cross-modal alignment remains under-explored, hindering deeper understanding of camouflaged scenarios and their related applications. To this end, we focus on the typical image-text retrieval task, and formulate a new task dubbed ``camouflage-aware image-text retrieval’’ (CA-ITR). We first construct a dedicated camouflage image-text retrieval dataset (CamoIT), comprising $\sim$10.5K samples with multi-granularity textual annotations. Benchmark results conducted on CamoIT reveal the underlying challenges of CA-ITR for existing cutting-edge retrieval techniques, which are mainly caused by objects’ camouflage properties as well as those complex image contents. As a solution, we propose a camouflage-expert collaborative network (CECNet), which features a dual-branch visual encoder: one branch captures holistic image representations, while the other incorporates a dedicated model to inject representations of camouflaged objects. A novel confidence-conditioned graph attention (C\textsuperscript{2}GA) mechanism is incorporated to exploit the complementarity across branches. Comparative experiments show that CECNet achieves $\sim$29% overall CA-ITR accuracy boost, surpassing seven representative retrieval models. The dataset and code will be available at https://github.com/jiangyao-scu/CA-ITR.
[107] Moiré Video Authentication: A Physical Signature Against AI Video Generation
Yuan Qing, Kunyu Zheng, Lingxiao Li, Boqing Gong, Chang Xiao
Main category: cs.CV
TL;DR: A physics-based authentication method that uses Moiré effect patterns to distinguish real videos from AI-generated ones by analyzing the correlation between fringe phase and grating displacement.
Details
Motivation: As AI-generated videos become increasingly realistic and difficult to distinguish from real footage, there's a need for reliable authentication methods. The authors propose leveraging deterministic optical phenomena that real cameras naturally produce but that generative models cannot faithfully reproduce.Method: The approach exploits the Moiré effect - interference fringes formed when a camera views a compact two-layer grating structure. They derive the Moiré motion invariant showing that fringe phase and grating image displacement are linearly coupled by optical geometry. A verifier extracts both signals from video and tests their correlation.
Result: The method was validated on both real-captured and AI-generated videos from multiple state-of-the-art generators. Real and AI-generated videos produced significantly different correlation signatures, demonstrating robust differentiation capability.
Conclusion: Deterministic optical phenomena like the Moiré effect can serve as physically grounded, verifiable signatures against AI-generated video, providing a robust authentication mechanism.
Abstract: Recent advances in video generation have made AI-synthesized content increasingly difficult to distinguish from real footage. We propose a physics-based authentication signature that real cameras produce naturally, but that generative models cannot faithfully reproduce. Our approach exploits the Moiré effect: the interference fringes formed when a camera views a compact two-layer grating structure. We derive the Moiré motion invariant, showing that fringe phase and grating image displacement are linearly coupled by optical geometry, independent of viewing distance and grating structure. A verifier extracts both signals from video and tests their correlation. We validate the invariant on both real-captured and AI-generated videos from multiple state-of-the-art generators, and find that real and AI-generated videos produce significantly different correlation signatures, suggesting a robust means of differentiating them. Our work demonstrates that deterministic optical phenomena can serve as physically grounded, verifiable signatures against AI-generated video.
[108] SA-CycleGAN-2.5D: Self-Attention CycleGAN with Tri-Planar Context for Multi-Site MRI Harmonization
Ishrith Gowda, Chunwei Liu
Main category: cs.CV
TL;DR: SA-CycleGAN-2.5D: A domain adaptation framework for multi-site neuroimaging harmonization using 2.5D tri-planar manifold injection with global self-attention to address scanner-induced covariate shifts while preserving anatomical integrity.
Details
Motivation: Multi-site neuroimaging analysis suffers from scanner-induced covariate shifts where acquisition variance often exceeds biological pathology variance, confounding radiomic reproducibility. Existing methods either operate in feature space (precluding spatial tasks) or are limited by local receptive fields, failing to model global intensity correlations characteristic of field-strength bias.Method: Proposes SA-CycleGAN-2.5D with three innovations: (1) 2.5D tri-planar manifold injection preserving through-plane gradients at O(HW) complexity; (2) U-ResNet generator with dense voxel-to-voxel self-attention to model global scanner field biases beyond CNN receptive field limits; (3) Spectrally-normalized discriminator constraining Lipschitz constant for stable adversarial optimization.
Result: On 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), reduces Maximum Mean Discrepancy by 99.1% (1.729 → 0.015) and degrades domain classifier accuracy to near-chance (59.7%). Ablation shows global attention is statistically essential (Cohen’s d = 1.32, p < 0.001) for heterogeneous-to-homogeneous translation.
Conclusion: The framework bridges 2D efficiency and 3D consistency, yielding voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis by effectively addressing scanner-induced domain shifts.
Abstract: Multi-site neuroimaging analysis is fundamentally confounded by scanner-induced covariate shifts, where the marginal distribution of voxel intensities $P(\mathbf{x})$ varies non-linearly across acquisition protocols while the conditional anatomy $P(\mathbf{y}|\mathbf{x})$ remains constant. This is particularly detrimental to radiomic reproducibility, where acquisition variance often exceeds biological pathology variance. Existing statistical harmonization methods (e.g., ComBat) operate in feature space, precluding spatial downstream tasks, while standard deep learning approaches are theoretically bounded by local effective receptive fields (ERF), failing to model the global intensity correlations characteristic of field-strength bias. We propose SA-CycleGAN-2.5D, a domain adaptation framework motivated by the $HΔH$-divergence bound of Ben-David et al., integrating three architectural innovations: (1) A 2.5D tri-planar manifold injection preserving through-plane gradients $\nabla_z$ at $O(HW)$ complexity; (2) A U-ResNet generator with dense voxel-to-voxel self-attention, surpassing the $O(\sqrt{L})$ receptive field limit of CNNs to model global scanner field biases; and (3) A spectrally-normalized discriminator constraining the Lipschitz constant ($K_D \le 1$) for stable adversarial optimization. Evaluated on 654 glioma patients across two institutional domains (BraTS and UPenn-GBM), our method reduces Maximum Mean Discrepancy (MMD) by 99.1% ($1.729 \to 0.015$) and degrades domain classifier accuracy to near-chance (59.7%). Ablation confirms that global attention is statistically essential (Cohen’s $d = 1.32$, $p < 0.001$) for the harder heterogeneous-to-homogeneous translation direction. By bridging 2D efficiency and 3D consistency, our framework yields voxel-level harmonized images that preserve tumor pathophysiology, enabling reproducible multi-center radiomic analysis.
[109] Look Twice: Training-Free Evidence Highlighting in Multimodal Large Language Models
Marco Morini, Sara Sarto, Marcella Cornia, Lorenzo Baraldi
Main category: cs.CV
TL;DR: Look Twice (LoT) is a training-free inference-time framework that improves MLLM performance on knowledge-based VQA by using attention patterns to highlight relevant visual regions and retrieved textual evidence, then generating answers conditioned on this highlighted evidence.
Details
Motivation: MLLMs struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries, especially when dealing with noisy or partially relevant retrieved text and fine-grained visual information localization.Method: LoT uses model attention patterns to estimate relevant visual regions and retrieved textual elements, then highlights these cues through lightweight prompt-level markers that encourage the model to re-attend to relevant evidence during generation, all without additional training.
Result: Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Visual evidence highlighting alone improves performance in vision-centric settings without textual context.
Conclusion: LoT demonstrates that inference-time evidence highlighting can significantly improve MLLM performance on knowledge-intensive multimodal tasks without requiring additional training or architectural changes.
Abstract: Answering questions about images often requires combining visual understanding with external knowledge. Multimodal Large Language Models (MLLMs) provide a natural framework for this setting, but they often struggle to identify the most relevant visual and textual evidence when answering knowledge-intensive queries. In such scenarios, models must integrate visual cues with retrieved textual evidence that is often noisy or only partially relevant, while also localizing fine-grained visual information in the image. In this work, we introduce Look Twice (LoT), a training-free inference-time framework that improves how pretrained MLLMs utilize multimodal evidence. Specifically, we exploit the model attention patterns to estimate which visual regions and retrieved textual elements are relevant to a query, and then generate the answer conditioned on this highlighted evidence. The selected cues are highlighted through lightweight prompt-level markers that encourage the model to re-attend to the relevant evidence during generation. Experiments across multiple knowledge-based VQA benchmarks show consistent improvements over zero-shot MLLMs. Additional evaluations on vision-centric and hallucination-oriented benchmarks further demonstrate that visual evidence highlighting alone improves model performance in settings without textual context, all without additional training or architectural modifications. Source code will be publicly released.
[110] Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
Lingyu Liu, Yaxiong Wang, Li Zhu, Zhedong Zheng
Main category: cs.CV
TL;DR: A bidirectional framework for video frame interpolation that enforces temporal cycle-consistency through forward-backward symmetry, using learnable directional tokens and curriculum learning to improve motion consistency without extra inference cost.
Details
Motivation: Current video frame interpolation methods suffer from motion drift, directional ambiguity, and boundary misalignment due to unidirectional generation without self-verification mechanisms, especially problematic for long-range sequences.Method: Proposes a bidirectional framework with learnable directional tokens that condition a shared backbone on temporal orientation, enabling joint optimization of forward synthesis and backward reconstruction. Uses curriculum learning from short to long sequences and applies cyclic constraints only during training.
Result: Achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines with no additional computational overhead during inference.
Conclusion: The cycle-consistent bidirectional framework effectively addresses motion consistency issues in video frame interpolation, providing a powerful regularization mechanism that improves long-range temporal coherence while maintaining inference efficiency.
Abstract: Video frame interpolation aims to synthesize realistic intermediate frames between given endpoints while adhering to specific motion semantics. While recent generative models have improved visual fidelity, they predominantly operate in a unidirectional manner, lacking mechanisms to self-verify temporal consistency. This often leads to motion drift, directional ambiguity, and boundary misalignment, especially in long-range sequences. Inspired by the principle of temporal cycle-consistency in self-supervised learning, we propose a novel bidirectional framework that enforces symmetry between forward and backward generation trajectories. Our approach introduces learnable directional tokens to explicitly condition a shared backbone on temporal orientation, enabling the model to jointly optimize forward synthesis and backward reconstruction within a single unified architecture. This cycle-consistent supervision acts as a powerful regularizer, ensuring that generated motion paths are logically reversible. Furthermore, we employ a curriculum learning strategy that progressively trains the model from short to long sequences, stabilizing dynamics across varying durations. Crucially, our cyclic constraints are applied only during training; inference requires a single forward pass, maintaining the high efficiency of the base model. Extensive experiments show that our method achieves state-of-the-art performance in imaging quality, motion smoothness, and dynamic control on both 37-frame and 73-frame tasks, outperforming strong baselines while incurring no additional computational overhead.
[111] Sparse Spectral LoRA: Routed Experts for Medical VLMs
Omid Nejati Manzari, Hojat Asgariandehkordi, Taha Koleilat, Yiming Xiao, Hassan Rivaz
Main category: cs.CV
TL;DR: MedQwen: A parameter-efficient medical vision-language model using spectrally routed Mixture-of-Experts with SVD initialization and residual compensation for robust medical imaging tasks while minimizing catastrophic forgetting.
Details
Motivation: Large VLMs perform well on general benchmarks but lack robustness in medical imaging due to cross-dataset interference and sensitivity to data regimes. Clinical workflows involve sequential data/task arrival, making naive continual training prone to catastrophic forgetting.Method: Proposes MedQwen with spectrally routed Mixture-of-Experts (MoE) and theoretical scaling rule aligning low-rank updates with full-rank fine-tuned MoE. Initializes experts from non-overlapping SVD segments of pretrained weights, adds residual compensation and scaling scheme for stable expert specialization and consistent routing under distribution shift.
Result: Achieves strong performance across 23 medical datasets covering VQA, report generation, radiology classification, and hallucination mitigation. Approaches full fine-tuning on zero-shot classification with 339× fewer parameters, reduces sequential forgetting to ~5% where baselines degrade by 20-50%.
Conclusion: MedQwen provides parameter-efficient, robust medical VLM that addresses cross-dataset interference and catastrophic forgetting in clinical workflows through innovative MoE architecture with spectral routing and SVD-based expert initialization.
Abstract: Large vision-language models (VLMs) excel on general benchmarks but often lack robustness in medical imaging, where heterogeneous supervision induces cross-dataset interference and sensitivity to data regime (i.e., how the supervisory signals are mixed). In realistic clinical workflows, data and tasks arrive sequentially, so naive continual training further leads to catastrophic forgetting. To address these challenges, we propose MedQwen, a parameter-efficient medical VLM that couples a spectrally routed Mixture-of-Experts (MoE) with a theoretically grounded scaling rule that aligns low-rank updates with a full-rank, fully fine-tuned MoE, without changing the base architecture. Concretely, we initialize each expert from non-overlapping singular value decomposition (SVD) segments of the pretrained weight and introduce a residual compensation and scaling scheme to enable stable expert specialization and consistent routing under distribution shift. Across 23 medical datasets covering visual question answering, report generation, radiology classification, and hallucination mitigation, MedQwen achieves strong, reliable performance: it approaches full fine-tuning on zero-shot classification with 339$\times$ fewer trainable parameters, and reduces sequential forgetting to $\sim$5% where strong baselines degrade by $>$20-50%.
[112] AffordTissue: Dense Affordance Prediction for Tool-Action Specific Tissue Interaction
Aiza Maksutova, Lalithkumar Seenivasan, Hao Ding, Jiru Xu, Chenhao Yu, Chenyan Jing, Yiqing Shen, Mathias Unberath
Main category: cs.CV
TL;DR: AffordTissue: A multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy surgery, combining temporal vision, language conditioning, and diffusion-based decoding.
Details
Motivation: Current surgical automation methods lack predictability on where instruments will interact with tissue and lack explicit conditioning for tool-action-specific safe interaction regions, making clinical deployment challenging.Method: Combines temporal vision encoder for tool motion and tissue dynamics across multiple viewpoints, language conditioning for generalization across diverse instrument-action pairs, and DiT-style decoder for dense affordance prediction.
Result: Substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction.
Conclusion: AffordTissue provides explicit spatial reasoning for safe surgical automation, enabling policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.
Abstract: Surgical action automation has progressed rapidly toward achieving surgeon-like dexterous control, driven primarily by advances in learning from demonstration and vision-language-action models. While these have demonstrated success in table-top experiments, translating them to clinical deployment remains challenging: current methods offer limited predictability on where instruments will interact on tissue surfaces and lack explicit conditioning inputs to enforce tool-action-specific safe interaction regions. Addressing this gap, we introduce AffordTissue, a multimodal framework for predicting tool-action specific tissue affordance regions as dense heatmaps during cholecystectomy. Our approach combines a temporal vision encoder capturing tool motion and tissue dynamics across multiple viewpoints, language conditioning enabling generalization across diverse instrument-action pairs, and a DiT-style decoder for dense affordance prediction. We establish the first tissue affordance benchmark by curating and annotating 15,638 video clips across 103 cholecystectomy procedures, covering six unique tool-action pairs involving four instruments (hook, grasper, scissors, clipper) and their associated tasks: dissection, grasping, clipping, and cutting. Experiments demonstrate substantial improvement over vision-language model baselines (20.6 px ASSD vs. 60.2 px for Molmo-VLM), showing that our task-specific architecture outperforms large-scale foundation models for dense surgical affordance prediction. By predicting tool-action specific tissue affordance regions, AffordTissue provides explicit spatial reasoning for safe surgical automation, potentially unlocking explicit policy guidance toward appropriate tissue regions and early safe stop when instruments deviate outside predicted safe zones.
[113] ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos
Syed Ahsan Masud Zaidi, William Hsu, Scott Dietrich
Main category: cs.CV
TL;DR: Vision transformer-based method for detecting risky tackles in American football videos using an expanded dataset with imbalance-aware training, achieving improved recall for safety-critical events.
Details
Motivation: Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. The need for automated detection of risky tackles in American football practice videos to prevent injuries.Method: Vision transformer-based model with imbalance-aware training on an expanded dataset of 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with standardized Assessment for Tackling Technique (SATT-3) components.
Result: Achieved risky recall of 0.67 and Risky F1 of 0.59 under cross-validation, improving risky recall by more than 8% points compared to previous baseline (risky recall 0.58; Risky F1 0.56) on a much larger dataset.
Conclusion: Vision transformer-based video analysis with careful handling of class imbalance can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.
Abstract: Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.
[114] Human Pose Estimation in Trampoline Gymnastics: Improving Performance Using a New Synthetic Dataset
Léa Drolet-Roy, Victor Nogues, Sylvain Gaudet, Eve Charbonneau, Mickaël Begon, Lama Séoud
Main category: cs.CV
TL;DR: Fine-tuning ViTPose on synthetic trampoline poses from motion capture data improves 2D/3D pose estimation for extreme human poses and uncommon viewpoints in trampoline gymnastics.
Details
Motivation: State-of-the-art pose estimation models underperform on trampoline gymnastics due to extreme human poses and uncommon viewpoints. There's a need to address this performance gap for challenging real-world applications.Method: Created synthetic trampoline poses (STP) dataset from motion capture recordings. Developed pipeline to fit noisy MoCap data to parametric human model and generate multiview realistic images. Fine-tuned ViTPose model on this synthetic data.
Result: Achieved state-of-the-art 2D results on challenging trampoline data, bridging performance gap between common and extreme poses. Reduced 3D MPJPE by 12.5 mm (19.6% improvement) compared to pretrained ViTPose model.
Conclusion: Synthetic data generation from motion capture recordings effectively addresses pose estimation challenges in extreme scenarios like trampoline gymnastics, enabling improved 2D and 3D pose estimation performance.
Abstract: Trampoline gymnastics involves extreme human poses and uncommon viewpoints, on which state-of-the art pose estimation models tend to under-perform. We demonstrate that this problem can be addressed by fine-tuning a pose estimation model on a dataset of synthetic trampoline poses (STP). STP is generated from motion capture recordings of trampoline routines. We develop a pipeline to fit noisy motion capture data to a parametric human model, then generate multiview realistic images. We use this data to fine-tune a ViTPose model, and test it on real multi-view trampoline images. The resulting model exhibits accuracy improvements in 2D which translates to improved 3D triangulation. In 2D, we obtain state-of-the-art results on such challenging data, bridging the performance gap between common and extreme poses. In 3D, we reduce the MPJPE by 12.5 mm with our best model, which represents an improvement of 19.6% compared to the pretrained ViTPose model.
[115] Regularizing Attention Scores with Bootstrapping
Neo Christopher Chung, Maxim Laletin
Main category: cs.CV
TL;DR: Proposes bootstrapping-based attention regularization for Vision Transformers to quantify uncertainty and obtain sparse, interpretable attention maps by removing spurious attention from noise.
Details
Motivation: Attention scores in Vision Transformers are almost always non-zero, resulting in noisy and diffused attention maps that limit interpretability. There's a need to quantify uncertainty measures of attention scores and obtain regularized attention scores for better explanation of ViT decision-making.Method: Introduces bootstrapping for attention scores by resampling input features to generate a baseline distribution of attention scores. This bootstrap distribution is used to estimate significances and posterior probabilities of attention scores, removing spurious attention arising from noise.
Result: The Attention Regularization approach demonstrates straightforward removal of spurious attention from noise, drastically improving shrinkage and sparsity of attention maps. Quantitative evaluations on both simulation and real-world datasets show effectiveness.
Conclusion: Bootstrapping serves as a practical regularization tool when using attention scores as explanations for Vision Transformers, improving interpretability by providing sparse, meaningful attention maps.
Abstract: Vision transformers (ViT) rely on attention mechanism to weigh input features, and therefore attention scores have naturally been considered as explanations for its decision-making process. However, attention scores are almost always non-zero, resulting in noisy and diffused attention maps and limiting interpretability. Can we quantify uncertainty measures of attention scores and obtain regularized attention scores? To this end, we consider attention scores of ViT in a statistical framework where independent noise would lead to insignificant yet non-zero scores. Leveraging statistical learning techniques, we introduce the bootstrapping for attention scores which generates a baseline distribution of attention scores by resampling input features. Such a bootstrap distribution is then used to estimate significances and posterior probabilities of attention scores. In natural and medical images, the proposed \emph{Attention Regularization} approach demonstrates a straightforward removal of spurious attention arising from noise, drastically improving shrinkage and sparsity. Quantitative evaluations are conducted using both simulation and real-world datasets. Our study highlights bootstrapping as a practical regularization tool when using attention scores as explanations for ViT. Code available: https://github.com/ncchung/AttentionRegularization
[116] InvZW: Invariant Feature Learning via Noise-Adversarial Training for Robust Image Zero-Watermarking
Abdullah All Tanvir, Frank Y. Shih, Xin Zhong
Main category: cs.CV
TL;DR: A deep learning framework for robust image zero-watermarking that learns distortion-invariant features through noise-adversarial learning and projects them onto trainable reference codes to embed binary messages without altering the original image.
Details
Motivation: Traditional watermarking methods alter original images, which can degrade quality. Zero-watermarking leaves images unaltered but needs robust features that are invariant to distortions while remaining semantically expressive for reliable watermark recovery.Method: Two-module framework: 1) Feature extractor trained via noise-adversarial learning with adversarial supervision against distortion discriminator and reconstruction constraint. 2) Learning-based multibit zero-watermarking where invariant features are projected onto trainable reference codes optimized to match target binary messages.
Result: Achieves state-of-the-art robustness in feature stability and watermark recovery across diverse image datasets and wide range of distortions. Superior generalization and robustness compared to existing self-supervised and deep watermarking techniques.
Conclusion: The proposed framework effectively combines distortion-invariant feature learning with learning-based zero-watermarking to create a robust system that preserves image quality while maintaining high watermark recovery accuracy under various distortions.
Abstract: This paper introduces a novel deep learning framework for robust image zero-watermarking based on distortion-invariant feature learning. As a zero-watermarking scheme, our method leaves the original image unaltered and learns a reference signature through optimization in the feature space. The proposed framework consists of two key modules. In the first module, a feature extractor is trained via noise-adversarial learning to generate representations that are both invariant to distortions and semantically expressive. This is achieved by combining adversarial supervision against a distortion discriminator and a reconstruction constraint to retain image content. In the second module, we design a learning-based multibit zero-watermarking scheme where the trained invariant features are projected onto a set of trainable reference codes optimized to match a target binary message. Extensive experiments on diverse image datasets and a wide range of distortions show that our method achieves state-of-the-art robustness in both feature stability and watermark recovery. Comparative evaluations against existing self-supervised and deep watermarking techniques further highlight the superiority of our framework in generalization and robustness.
[117] Perceptual misalignment of texture representations in convolutional neural networks
Ludovica de Paolis, Fabio Anselmi, Alessio Ansuini, Eugenio Piasini
Main category: cs.CV
TL;DR: CNNs’ texture representations based on Gram matrices don’t correlate with their Brain-Score alignment, suggesting texture perception involves distinct mechanisms beyond standard object recognition CNNs.
Details
Motivation: To investigate whether CNN texture representations (based on Gram matrices of feature correlations) align with human texture perception, and whether CNNs that better model the visual system also have more human-like texture representations.Method: Compare perceptual content captured by feature correlations (Gram matrices) across diverse CNNs, and correlate with Brain-Score measures of alignment with mammalian visual system.
Result: No connection found between conventional measures of CNN quality as visual system models (Brain-Score) and alignment with human texture perception.
Conclusion: Texture perception involves mechanisms distinct from those commonly modeled by CNNs trained on object recognition, possibly requiring integration of contextual information.
Abstract: Mathematical modeling of visual textures traces back to Julesz’s intuition that texture perception in humans is based on local correlations between image features. An influential approach for texture analysis and generation generalizes this notion to linear correlations between the nonlinear features computed by convolutional neural networks (CNNs), compiled into Gram matrices. Given that CNNs are often used as models for the visual system, it is natural to ask whether such “texture representations” spontaneously align with the textures’ perceptual content, and in particular whether those CNNs that are regarded as better models for the visual system also possess more human-like texture representations. Here we compare the perceptual content captured by feature correlations computed for a diverse pool of CNNs, and we compare it to the models’ perceptual alignment with the mammalian visual system as measured by Brain-Score. Surprisingly, we find that there is no connection between conventional measures of CNN quality as a model of the visual system and its alignment with human texture perception. We conclude that texture perception involves mechanisms that are distinct from those that are commonly modeled using approaches based on CNNs trained on object recognition, possibly depending on the integration of contextual information.
[118] A Novel FACS-Aligned Anatomical Text Description Paradigm for Fine-Grained Facial Behavior Synthesis
Jiahe Wang, Cong Liang, Xuandong Huang, Yuxin Wang, Xin Yun, Yi Wu, Yanan Chang, Shangfei Wang
Main category: cs.CV
TL;DR: A novel paradigm for anatomically grounded facial behavior synthesis using FACS-based AU descriptions with natural language control, addressing limitations of existing emotion category and one-hot AU approaches.
Details
Motivation: Existing facial behavior synthesis methods using coarse emotion labels or one-hot AU vectors fail to render fine-grained facial behaviors and produce anatomically implausible artifacts from conflicting AUs. There's a need for anatomically grounded synthesis that respects FACS muscle movement rules.Method: Proposes anatomically grounded facial behavior synthesis from FACS-based AU descriptions using: 1) dynamic AU text processor converting raw AU annotations to anatomically consistent natural language descriptions, 2) BP4D-AUText dataset (302K text-image pairs), 3) AAAD metric for semantic consistency evaluation, and 4) VQ-AUFace framework with anatomical priors and progressive cross-modal alignment.
Result: The paradigm significantly outperforms state-of-the-art methods, especially in challenging conflicting AU scenarios, achieving superior anatomical fidelity, semantic consistency, and visual quality through extensive quantitative experiments and user studies.
Conclusion: The proposed anatomically grounded paradigm with natural language control effectively addresses limitations of existing facial behavior synthesis methods, enabling fine-grained, anatomically plausible facial expression generation with better handling of AU conflicts.
Abstract: Facial behavior constitutes the primary medium of human nonverbal communication. Existing synthesis methods predominantly follow two paradigms: coarse emotion category labels or one-hot Action Unit (AU) vectors from the Facial Action Coding System (FACS). Neither paradigm reliably renders fine-grained facial behaviors nor resolves anatomically implausible artifacts caused by conflicting AUs. Therefore, we propose a novel task paradigm: anatomically grounded facial behavior synthesis from FACS-based AU descriptions. This paradigm explicitly encodes FACS-defined muscle movement rules, inter-AU interactions, and conflict resolution mechanisms into natural language control signals. To enable systematic research, we develop a dynamic AU text processor, a FACS rule-based module that converts raw AU annotations into anatomically consistent natural language descriptions. Using this processor, we construct BP4D-AUText, the first large-scale text-image paired dataset for fine-grained facial behavior synthesis, comprising over 302K high-quality samples. Given that existing general semantic consistency metrics cannot capture the alignment between anatomical facial descriptions and synthesized muscle movements, we propose the Alignment Accuracy of AU Probability Distributions (AAAD), a task-specific metric that quantifies semantic consistency. Finally, we design VQ-AUFace, a robust baseline framework incorporating anatomical priors and progressive cross-modal alignment, to validate the paradigm. Extensive quantitative experiments and user studies demonstrate the paradigm significantly outperforms state-of-the-art methods, particularly in challenging conflicting AU scenarios, achieving superior anatomical fidelity, semantic consistency, and visual quality.
[119] IGLOSS: Image Generation for Lidar Open-vocabulary Semantic Segmentation
Nermin Samet, Gilles Puy, Renaud Marlet
Main category: cs.CV
TL;DR: A novel zero-shot open-vocabulary semantic segmentation method for 3D lidar data that uses image generation from text to create prototype images, circumventing the image-text modality gap in VLMs like CLIP.
Details
Motivation: To address the image-text modality gap in Vision Language Models (VLMs) for 3D semantic segmentation, and enable zero-shot open-vocabulary segmentation of automotive lidar data without relying on direct image-text matching.Method: Uses image generation from text to create prototype images, then leverages a 3D network distilled from a 2D Vision Foundation Model to label point clouds by matching 3D point features with 2D image features of these prototypes.
Result: Achieves state-of-the-art performance for open-vocabulary semantic segmentation on nuScenes and SemanticKITTI datasets.
Conclusion: The proposed method effectively bridges the modality gap for 3D semantic segmentation by using generated images as intermediaries, demonstrating superior zero-shot open-vocabulary segmentation capabilities for automotive lidar data.
Abstract: This paper presents a new method for the zero-shot open-vocabulary semantic segmentation (OVSS) of 3D automotive lidar data. To circumvent the recognized image-text modality gap that is intrinsic to approaches based on Vision Language Models (VLMs) such as CLIP, our method relies instead on image generation from text, to create prototype images. Given a 3D network distilled from a 2D Vision Foundation Model (VFM), we then label a point cloud by matching 3D point features with 2D image features of these prototypes. Our method is state-of-the-art for OVSS on nuScenes and SemanticKITTI. Code, pre-trained models, and generated images are available at https://github.com/valeoai/IGLOSS.
[120] Seamless High-Resolution Terrain Reconstruction: A Prior-Based Vision Transformer Approach
Osher Rafaeli, Tal Svoray, Ariel Nahlieli
Main category: cs.CV
TL;DR: Single-view RGB imagery enables high-resolution DEM reconstruction at 30cm (100x improvement over SRTM) using monocular depth estimation models fine-tuned with low-resolution SRTM priors, achieving near-LiDAR accuracy for hydrological and hazard applications.
Details
Motivation: Globally consistent, fine-scale Digital Elevation Models (DEMs) are unavailable despite being essential for hydrological modeling, hazard assessment, and environmental monitoring. Existing global DEMs like SRTM have only 30m resolution, insufficient for detailed terrain analysis.Method: Extends monocular depth foundation (MDE) models to remote sensing height domain. Fine-tunes model by integrating low-resolution SRTM data as global prior with high-resolution RGB imagery from NAIP. Combines single-view imagery with prior-based depth estimation for elevation reconstruction.
Result: Achieves 100x resolution enhancement (30m to 30cm), exceeding existing super-resolution approaches by order of magnitude. Mean absolute error <5m relative to LiDAR, improving upon SRTM by up to 18%. Generalizes robustly across diverse landscapes and enables improved hydrological analysis.
Conclusion: Single-view-based DEM reconstruction enables practical, scalable high-resolution elevation mapping for GIS applications, supporting improved hazard assessment and environmental monitoring with near-LiDAR accuracy at dramatically higher resolution than global datasets.
Abstract: High-resolution elevation data is essential for hydrological modeling, hazard assessment, and environmental monitoring; however, globally consistent, fine-scale Digital Elevation Models (DEMs) remain unavailable. Very high-resolution single-view imagery enables the extraction of topographic information at the pixel level, allowing the reconstruction of fine terrain details over large spatial extents. In this paper, we present single-view-based DEM reconstruction shown to support practical analysis in GIS environments across multiple sub-national jurisdictions. Specifically, we produce high-resolution DEMs for large-scale basins, representing a substantial improvement over the 30 m resolution of globally available Shuttle Radar Topography Mission (SRTM) data. The DEMs are generated using a prior-based monocular depth foundation (MDE) model, extended in this work to the remote sensing height domain for high-resolution, globally consistent elevation reconstruction. We fine-tune the model by integrating low-resolution SRTM data as a global prior with high-resolution RGB imagery from the National Agriculture Imagery Program (NAIP), producing DEMs with near LiDAR-level accuracy. Our method achieves a 100x resolution enhancement (from 30 m to 30 cm), exceeding existing super-resolution approaches by an order of magnitude. Across two diverse landscapes, the model generalizes robustly, resolving fine-scale terrain features with a mean absolute error of less than 5 m relative to LiDAR and improving upon SRTM by up to 18 %. Hydrological analyses at both catchment and hillslope scales confirm the method’s utility for hazard assessment and environmental monitoring, demonstrating improved streamflow representation and catchment delineation. Finally, we demonstrate the scalability of the framework by applying it across large geographic regions.
[121] GRAZE: Grounded Refinement and Motion-Aware Zero-Shot Event Localization
Syed Ahsan Masud Zaidi, Lior Shamir, William Hsu, Scott Dietrich, Talha Zaidi
Main category: cs.CV
TL;DR: GRAZE is a training-free pipeline for localizing First Point of Contact (FPOC) in American football practice videos, using object detection, motion-aware temporal reasoning, and pixel-level contact verification without labeled tackle-contact examples.
Details
Motivation: American football practice generates long, untrimmed videos where the key interaction (tackle contact) occupies only a brief window. Reliable biomechanical analysis requires accurate spatiotemporal localization of both interacting entities and contact onset in challenging real-world footage with camera motion, clutter, multiple players, and rapid pose changes.Method: GRAZE uses Grounding DINO for candidate player-dummy interaction discovery, refines them with motion-aware temporal reasoning, and employs SAM2 as an explicit pixel-level contact verifier rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes it robust to cluttered scenes and unstable grounding near impact.
Result: On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within ±10 frames on 77.5% of all clips and within ±20 frames on 82.7% of all clips.
Conclusion: Frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training, demonstrating the effectiveness of combining object detection, temporal reasoning, and pixel-level verification for spatiotemporal localization in challenging video analysis tasks.
Abstract: American football practice generates video at scale, yet the interaction of interest occupies only a brief window of each long, untrimmed clip. Reliable biomechanical analysis, therefore, depends on spatiotemporal localization that identifies both the interacting entities and the onset of contact. We study First Point of Contact (FPOC), defined as the first frame in which a player physically touches a tackle dummy, in unconstrained practice footage with camera motion, clutter, multiple similarly equipped athletes, and rapid pose changes around impact. We present GRAZE, a training-free pipeline for FPOC localization that requires no labeled tackle-contact examples. GRAZE uses Grounding DINO to discover candidate player-dummy interactions, refines them with motion-aware temporal reasoning, and uses SAM2 as an explicit pixel-level verifier of contact rather than relying on detection confidence alone. This separation between candidate discovery and contact confirmation makes the approach robust to cluttered scenes and unstable grounding near impact. On 738 tackle-practice videos, GRAZE produces valid outputs for 97.4% of clips and localizes FPOC within $\pm$ 10 frames on 77.5% of all clips and within $\pm$ 20 frames on 82.7% of all clips. These results show that frame-accurate contact onset localization in real-world practice footage is feasible without task-specific training.
[122] HERBench: A Benchmark for Multi-Evidence Integration in Video Question Answering
Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin
Main category: cs.CV
TL;DR: HERBench is a new VideoQA benchmark requiring integration of at least three non-overlapping cues from distinct video segments, designed to avoid single-cue shortcuts and test robust multi-evidence reasoning.
Details
Motivation: Current Video-LLMs are improving but VideoQA benchmarks often allow single-cue shortcuts, under-testing the ability to integrate evidence across time. There's a need for benchmarks that make multi-evidence integration unavoidable to properly evaluate video understanding capabilities.Method: Introduces HERBench with 26,806 five-way multiple-choice questions across 12 compositional tasks, each requiring at least three non-overlapping cues from distinct video segments. Proposes Minimum Required Frame-Set (MRFS) metric to measure evidential demand. Evaluates 13 state-of-the-art Video-LLMs on this benchmark.
Result: Video-LLMs achieve only 31-42% accuracy on HERBench, modestly above the 20% random-guess baseline. Analysis reveals two critical bottlenecks: (1) retrieval deficit where frame selectors miss key evidence, and (2) fusion deficit where models fail to integrate information even when all evidence is provided.
Conclusion: HERBench provides a principled benchmark for studying robust multi-evidence video understanding, revealing significant limitations in current Video-LLMs’ ability to integrate information across time and highlighting the need for improved retrieval and fusion mechanisms.
Abstract: Video Large Language Models (Video-LLMs) are improving rapidly, yet current Video Question Answering (VideoQA) benchmarks often admit single-cue shortcuts, under-testing reasoning that must integrate evidence across time. We introduce HERBench, a benchmark designed to make multi-evidence integration unavoidable: each question requires at least three non-overlapping cues drawn from distinct video segments. HERBench contains 26,806 five-way multiple-choice questions across 12 compositional tasks. To make evidential demand measurable, we introduce the Minimum Required Frame-Set (MRFS), the smallest number of frames a model must fuse to answer correctly, and show that HERBench imposes higher evidential demand than prior benchmarks. Evaluating 13 state-of-the-art Video-LLMs yields only 31-42% accuracy, only modestly above the 20% random-guess baseline. We disentangle this failure into two critical bottlenecks: (1) a retrieval deficit, where frame selectors overlook key evidence, and (2) a fusion deficit, where models fail to integrate information even when all necessary evidence is provided. HERBench thus provides a principled benchmark for studying robust multi-evidence video understanding.
[123] LESV: Language Embedded Sparse Voxel Fusion for Open-Vocabulary 3D Scene Understanding
Fusang Wang, Nathan Piasco, Moussab Bennehar, Luis Roldão, Dzmitry Tsishkou, Fabien Moutarde
Main category: cs.CV
TL;DR: A novel framework using Sparse Voxel Rasterization instead of 3D Gaussian Splatting for open-vocabulary 3D scene understanding, addressing spatial and semantic ambiguity issues through structured geometry and foundation model alignment.
Details
Motivation: Current 3D Gaussian Splatting (3DGS) approaches for open-vocabulary 3D scene understanding suffer from spatial ambiguity due to unstructured, overlapping Gaussians requiring probabilistic feature registration, and multi-level semantic ambiguity from pooling features over object-level masks that dilutes fine-grained details.Method: Proposes Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation, regularized with monocular depth and normal priors. Uses deterministic, confidence-aware feature registration and exploits dense alignment properties of foundation model AM-RADIO to resolve multi-level ambiguity without hierarchical training overhead.
Result: Achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where previous registration methods typically fail.
Conclusion: The structured geometry approach with SVRaster effectively addresses limitations of 3DGS, providing better feature registration and semantic understanding for open-vocabulary 3D scene analysis.
Abstract: Recent advancements in open-vocabulary 3D scene understanding heavily rely on 3D Gaussian Splatting (3DGS) to register vision-language features into 3D space. However, we identify two critical limitations in these approaches: the spatial ambiguity arising from unstructured, overlapping Gaussians which necessitates probabilistic feature registration, and the multi-level semantic ambiguity caused by pooling features over object-level masks, which dilutes fine-grained details. To address these challenges, we present a novel framework that leverages Sparse Voxel Rasterization (SVRaster) as a structured, disjoint geometry representation. By regularizing SVRaster with monocular depth and normal priors, we establish a stable geometric foundation. This enables a deterministic, confidence-aware feature registration process and suppresses the semantic bleeding artifact common in 3DGS. Furthermore, we resolve multi-level ambiguity by exploiting the emerging dense alignment properties of foundation model AM-RADIO, avoiding the computational overhead of hierarchical training methods. Our approach achieves state-of-the-art performance on Open Vocabulary 3D Object Retrieval and Point Cloud Understanding benchmarks, particularly excelling on fine-grained queries where registration methods typically fail.
[124] EgoFlow: Gradient-Guided Flow Matching for Egocentric 6DoF Object Motion Generation
Abhishek Saroha, Huajian Zeng, Xingxing Zuo, Daniel Cremers, Xi Wang
Main category: cs.CV
TL;DR: EgoFlow: A flow-matching framework for generating physically consistent 6DoF object trajectories from egocentric video using multimodal observations and physical constraints.
Details
Motivation: Understanding and predicting object motion from egocentric video is fundamental for embodied perception and interaction, but generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and lack of explicit physical reasoning in existing generative models.Method: EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, with a gradient-guided inference process that enforces differentiable physical constraints like collision avoidance and motion smoothness.
Result: EgoFlow outperforms diffusion-based and transformer baselines on real-world datasets (HD-EPIC, EgoExo4D, HOT3D) in accuracy, generalization, and physical realism, reducing collision rates by up to 79% and showing strong generalization to unseen scenes.
Conclusion: The results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.
Abstract: Understanding and predicting object motion from egocentric video is fundamental to embodied perception and interaction. However, generating physically consistent 6DoF trajectories remains challenging due to occlusions, fast motion, and the lack of explicit physical reasoning in existing generative models. We present EgoFlow, a flow-matching framework that synthesizes realistic and physically plausible trajectories conditioned on multimodal egocentric observations. EgoFlow employs a hybrid Mamba-Transformer-Perceiver architecture to jointly model temporal dynamics, scene geometry, and semantic intent, while a gradient-guided inference process enforces differentiable physical constraints such as collision avoidance and motion smoothness. This combination yields coherent and controllable motion generation without post-hoc filtering or additional supervision. Experiments on real-world datasets HD-EPIC, EgoExo4D, and HOT3D show that EgoFlow outperforms diffusion-based and transformer baselines in accuracy, generalization, and physical realism, reducing collision rates by up to 79%, and strong generalization to unseen scenes. Our results highlight the promise of flow-based generative modeling for scalable and physically grounded egocentric motion understanding.
[125] Better Rigs, Not Bigger Networks: A Body Model Ablation for Gaussian Avatars
Derek Austin
Main category: cs.CV
TL;DR: 3D Gaussian splatting for human avatars achieves state-of-the-art results using Momentum Human Rig (MHR) instead of SMPL, showing that body model expressiveness is the primary bottleneck in avatar reconstruction.
Details
Motivation: Recent 3D Gaussian splatting methods using SMPL have become increasingly complex. The authors aim to demonstrate that much of this complexity is unnecessary by using a more expressive body model.Method: Replace SMPL with Momentum Human Rig (MHR) estimated via SAM-3D-Body. Use a minimal pipeline without learned deformations or pose-dependent corrections. Perform controlled ablations to disentangle pose estimation quality from body model representational capacity.
Result: Achieves highest reported PSNR and competitive/superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap datasets. Ablations confirm body model expressiveness is the primary bottleneck in avatar reconstruction.
Conclusion: Body model representational capacity and pose estimation quality are key factors for high-quality avatar reconstruction, and simpler pipelines can achieve state-of-the-art results with better body models.
Abstract: Recent 3D Gaussian splatting methods built atop SMPL achieve remarkable visual fidelity while continually increasing the complexity of the overall training architecture. We demonstrate that much of this complexity is unnecessary: by replacing SMPL with the Momentum Human Rig (MHR), estimated via SAM-3D-Body, a minimal pipeline with no learned deformations or pose-dependent corrections achieves the highest reported PSNR and competitive or superior LPIPS and SSIM on PeopleSnapshot and ZJU-MoCap. To disentangle pose estimation quality from body model representational capacity, we perform two controlled ablations: translating SAM-3D-Body meshes to SMPL-X, and translating the original dataset’s SMPL poses into MHR both retrained under identical conditions. These ablations confirm that body model expressiveness has been a primary bottleneck in avatar reconstruction, with both mesh representational capacity and pose estimation quality contributing meaningfully to the full pipeline’s gains.
[126] Nonlinear Methods for Analyzing Pose in Behavioral Research
Carter Sale, Margaret C. Macpherson, Gaurav Patil, Kelly Miles, Rachel W. Kallen, Sebastian Wallot, Michael J. Richardson
Main category: cs.CV
TL;DR: A general-purpose analysis pipeline for human pose data that combines preprocessing, dimensionality reduction, and recurrence-based time series analysis to extract meaningful movement patterns from markerless pose estimation data.
Details
Motivation: Markerless pose estimation enables large-scale behavioral analysis from video, but the high-dimensional, noisy, and temporally complex pose data presents challenges for extracting meaningful coordination patterns and behavioral changes.Method: Develops a pipeline with principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify temporal structure of movement dynamics. The pipeline supports both linear and nonlinear characterizations across diverse contexts.
Result: Demonstrated through three case studies spanning facial/full-body movement, 2D/3D data, and individual/multi-agent behavior, showing how the same analytic workflow can extract theoretically meaningful insights from complex pose time series.
Conclusion: The pipeline provides a flexible, general-purpose approach for analyzing human pose data across diverse experimental contexts, enabling extraction of meaningful movement patterns from complex pose time series.
Abstract: Advances in markerless pose estimation have made it possible to capture detailed human movement in naturalistic settings using standard video, enabling new forms of behavioral analysis at scale. However, the high dimensionality, noise, and temporal complexity of pose data raise significant challenges for extracting meaningful patterns of coordination and behavioral change. This paper presents a general-purpose analysis pipeline for human pose data, designed to support both linear and nonlinear characterizations of movement across diverse experimental contexts. The pipeline combines principled preprocessing, dimensionality reduction, and recurrence-based time series analysis to quantify the temporal structure of movement dynamics. To illustrate the pipeline’s flexibility, we present three case studies spanning facial and full-body movement, 2D and 3D data, and individual versus multi-agent behavior. Together, these examples demonstrate how the same analytic workflow can be adapted to extract theoretically meaningful insights from complex pose time series.
[127] Reinforcing Consistency in Video MLLMs with Structured Rewards
Yihao Quan, Zeru Shi, Jinman Zhao, Ruixiang Tang
Main category: cs.CV
TL;DR: The paper proposes a compositional consistency audit and structured reward system to improve visual and temporal grounding in multimodal LLMs for video understanding, addressing hallucination issues through factual and temporal unit-based rewards.
Details
Motivation: Current MLLMs for video understanding often produce plausible outputs with poor visual and temporal grounding, fabricating objects, assigning incorrect attributes, or collapsing repeated events. Standard sentence-level supervision is a weak proxy for faithful video understanding, and RL with sentence-level rewards is too coarse to localize specific grounding failures.Method: 1) Compositional consistency audit that decomposes captions into factual and temporal claims to assess lower-level evidence; 2) Structured reward system with three components: instance-aware scene-graph reward for objects/attributes/relations, temporal reward for event ordering/repetition, and video-grounded VQA reward for hierarchical self-verification.
Result: The structured reward approach yields consistent gains across temporal, general video understanding, and hallucination-oriented benchmarks on open-source backbones, demonstrating improved grounding and reduced hallucinations.
Conclusion: Structured reward shaping is a practical route to more faithful video understanding in MLLMs, addressing fundamental grounding issues that standard sentence-level supervision fails to capture.
Abstract: Multimodal large language models (MLLMs) have achieved remarkable progress in video understanding. However, seemingly plausible outputs often suffer from poor visual and temporal grounding: a model may fabricate object existence, assign incorrect attributes, or collapse repeated events while still producing a globally reasonable caption or answer. We study this failure mode through a compositional consistency audit that decomposes a caption into supporting factual and temporal claims, investigating whether a correct high-level prediction is actually backed by valid lower-level evidence. Our top-down audit reveals that even correct root relational claims often lack reliable attribute and existence support. This indicates that standard sentence-level supervision is a weak proxy for faithful video understanding. Furthermore, when turning to reinforcement learning (RL) for better alignment, standard sentence-level rewards often prove too coarse to accurately localize specific grounding failures. To address this, we replace generic sentence-level rewards with a structured reward built from factual and temporal units. Our training objective integrates three complementary components: (1) an instance-aware scene-graph reward for factual objects, attributes, and relations; (2) a temporal reward for event ordering and repetition; and (3) a video-grounded VQA reward for hierarchical self-verification. Across temporal, general video understanding, and hallucination-oriented benchmarks, this objective yields consistent gains on open-source backbones. These results suggest that structured reward shaping is a practical route to more faithful video understanding.
[128] Prime Once, then Reprogram Locally: An Efficient Alternative to Black-Box Service Model Adaptation
Yunbei Zhang, Chengyi Cai, Feng Liu, Jihun Hamm
Main category: cs.CV
TL;DR: AReS: An efficient reprogramming approach for closed-box service models (APIs) that uses single-pass API interaction to prime a local encoder, then performs glass-box reprogramming on the local proxy to eliminate ongoing API costs.
Details
Motivation: Traditional zeroth-order optimization (ZOO) for adapting closed-box APIs is expensive (many API calls), slow, unstable, and struggles with modern APIs like GPT-4o that are less sensitive to input perturbations.Method: Two-stage approach: 1) Single-pass interaction with service API to prime a local pre-trained encoder (training only a lightweight layer), 2) Glass-box reprogramming performed directly on the local model. All subsequent adaptation/inference uses only the local proxy.
Result: On GPT-4o: +27.8% gain over zero-shot baseline (where ZOO methods provide little improvement). Across 10 diverse datasets: outperforms SOTA methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%.
Conclusion: AReS provides a robust, practical solution for adapting modern closed-box models by eliminating ongoing API costs while achieving superior performance compared to traditional ZOO-based methods.
Abstract: Adapting closed-box service models (i.e., APIs) for target tasks typically relies on reprogramming via Zeroth-Order Optimization (ZOO). However, this standard strategy is known for extensive, costly API calls and often suffers from slow, unstable optimization. Furthermore, we observe that this paradigm faces new challenges with modern APIs (e.g., GPT-4o). These models can be less sensitive to the input perturbations ZOO relies on, thereby hindering performance gains. To address these limitations, we propose an Alternative efficient Reprogramming approach for Service models (AReS). Instead of direct, continuous closed-box optimization, AReS initiates a single-pass interaction with the service API to prime an amenable local pre-trained encoder. This priming stage trains only a lightweight layer on top of the local encoder, making it highly receptive to the subsequent glass-box (white-box) reprogramming stage performed directly on the local model. Consequently, all subsequent adaptation and inference rely solely on this local proxy, eliminating all further API costs. Experiments demonstrate AReS’s effectiveness where prior ZOO-based methods struggle: on GPT-4o, AReS achieves a +27.8% gain over the zero-shot baseline, a task where ZOO-based methods provide little to no improvement. Broadly, across ten diverse datasets, AReS outperforms state-of-the-art methods (+2.5% for VLMs, +15.6% for standard VMs) while reducing API calls by over 99.99%. AReS thus provides a robust and practical solution for adapting modern closed-box models.
[129] UniRecGen: Unifying Multi-View 3D Reconstruction and Generation
Zhisheng Huang, Jiahao Chen, Cheng Lin, Chenyu Hu, Hanzhuo Huang, Zhengming Yu, Mengfei Li, Yuheng Liu, Zekai Gu, Zibo Zhao, Yuan Liu, Xin Li, Wenping Wang
Main category: cs.CV
TL;DR: UniRecGen is a unified framework that integrates feed-forward reconstruction and diffusion-based generation for sparse-view 3D modeling, achieving better fidelity and plausibility through cooperative learning in shared canonical space.
Details
Motivation: Sparse-view 3D modeling faces a fundamental tension: feed-forward reconstruction is efficient and input-aligned but lacks global priors for structural completeness, while diffusion-based generation provides rich geometric details but struggles with multi-view consistency.Method: UniRecGen integrates both paradigms into a cooperative system by aligning them in a shared canonical space. It uses disentangled cooperative learning for stable training and seamless inference collaboration. The reconstruction module provides canonical geometric anchors, while the diffusion generator uses latent-augmented conditioning to refine and complete geometry.
Result: Experimental results show UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.
Conclusion: The unified framework successfully bridges reconstruction and generation paradigms, demonstrating that cooperative integration in canonical space enables better sparse-view 3D modeling with both fidelity and plausibility.
Abstract: Sparse-view 3D modeling represents a fundamental tension between reconstruction fidelity and generative plausibility. While feed-forward reconstruction excels in efficiency and input alignment, it often lacks the global priors needed for structural completeness. Conversely, diffusion-based generation provides rich geometric details but struggles with multi-view consistency. We present UniRecGen, a unified framework that integrates these two paradigms into a single cooperative system. To overcome inherent conflicts in coordinate spaces, 3D representations, and training objectives, we align both models within a shared canonical space. We employ disentangled cooperative learning, which maintains stable training while enabling seamless collaboration during inference. Specifically, the reconstruction module is adapted to provide canonical geometric anchors, while the diffusion generator leverages latent-augmented conditioning to refine and complete the geometric structure. Experimental results demonstrate that UniRecGen achieves superior fidelity and robustness, outperforming existing methods in creating complete and consistent 3D models from sparse observations.
[130] Universal computational thermal imaging overcoming the ghosting effect
Hongyi Xu, Du Wang, Chenjun Zhao, Jiashuo Chen, Jiale Lin, Liqin Cao, Yanfei Zhong, Yiyuan She, Fanglin Bao
Main category: cs.CV
TL;DR: TAG is a universal computational thermal imaging framework that addresses material non-uniformity to overcome ghosting effects in night vision, enabling unprecedented facial expression recovery and 3D topological alignment.
Details
Motivation: Thermal imaging suffers from ghosting effects that lose detailed texture, especially in cluttered photon streams. While HADAR offers some improvement, it fails with material non-uniformity which is ubiquitous in real-world scenes. There's a need for universal anti-ghosting thermal imaging.Method: TAG (thermal anti-ghosting) framework uses hyperspectral photon streams for nonparametric texture recovery. It addresses material non-uniformity to overcome ghosting, enabling high-fidelity night vision across various scenes.
Result: TAG universally outperforms HADAR across various scenes, enables unprecedented facial expression recovery in ghostly human faces, reveals HADAR’s effectiveness boundary, demonstrates thermal 3D topological alignment and mood detection for the first time.
Conclusion: TAG establishes a universal foundation for high-fidelity computational night vision with applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.
Abstract: Thermal imaging is crucial for night vision but fundamentally hampered by the ghosting effect, a loss of detailed texture in cluttered photon streams. While conventional ghosting mitigation has relied on data post-processing, the recent breakthrough in heat-assisted detection and ranging (HADAR) opens a promising frontier for hyperspectral computational thermal imaging that produces night vision with day-like visibility. However, universal anti-ghosting imaging remains elusive, as state-of-the-art HADAR applies only to limited scenes with uniform materials, whereas material non-uniformity is ubiquitous in the real world. Here, we propose a universal computational thermal imaging framework, TAG (thermal anti-ghosting), to address material non-uniformity and overcome ghosting for high-fidelity night vision. TAG takes hyperspectral photon streams for nonparametric texture recovery, enabling our experimental demonstration of unprecedented expression recovery in thus-far-elusive ghostly human faces – the archetypal, long-recognized ghosting phenomenon. Strikingly, TAG not only universally outperforms HADAR across various scenes, but also reveals the influence of material non-uniformity, shedding light on HADAR’s effectiveness boundary. We extensively test facial texture and expression recovery across day and night, and demonstrate, for the first time, thermal 3D topological alignment and mood detection. This work establishes a universal foundation for high-fidelity computational night vision, with potential applications in autonomous navigation, reconnaissance, healthcare, and wildlife monitoring.
[131] Prototype-Based Low Altitude UAV Semantic Segmentation
Da Zhang, Gao Junyu, Zhao Zhiyuan
Main category: cs.CV
TL;DR: PBSeg is an efficient prototype-based semantic segmentation framework for UAV imagery that reduces computational overhead while maintaining accuracy through prototype-based cross-attention and multi-scale feature extraction.
Details
Motivation: Semantic segmentation of UAV imagery faces challenges with extreme scale variations, complex boundaries, and limited computational resources on edge devices. Existing transformer methods are computationally expensive, while lightweight approaches struggle with fine-grained details in aerial scenes.Method: Proposes PBSeg with prototype-based cross-attention (PBCA) to exploit feature redundancy and reduce complexity, combined with efficient multi-scale feature extraction using deformable convolutions and context-aware modulation to capture both local details and global semantics.
Result: Achieves 71.86% mIoU on UAVid and 80.92% mIoU on UDD6 datasets, establishing competitive performance while maintaining computational efficiency for UAV applications.
Conclusion: PBSeg provides an effective solution for UAV semantic segmentation that balances accuracy and efficiency, making it suitable for edge deployment with limited computational resources.
Abstract: Semantic segmentation of low-altitude UAV imagery presents unique challenges due to extreme scale variations, complex object boundaries, and limited computational resources on edge devices. Existing transformer-based segmentation methods achieve remarkable performance but incur high computational overhead, while lightweight approaches struggle to capture fine-grained details in high-resolution aerial scenes. To address these limitations, we propose PBSeg, an efficient prototype-based segmentation framework tailored for UAV applications. PBSeg introduces a novel prototype-based cross-attention (PBCA) that exploits feature redundancy to reduce computational complexity while maintaining segmentation quality. The framework incorporates an efficient multi-scale feature extraction module that combines deformable convolutions (DConv) with context-aware modulation (CAM) to capture both local details and global semantics. Experiments on two challenging UAV datasets demonstrate the effectiveness of the proposed approach. PBSeg achieves 71.86% mIoU on UAVid and 80.92% mIoU on UDD6, establishing competitive performance while maintaining computational efficiency. Code is available at https://github.com/zhangda1018/PBSeg.
[132] Cross-Domain Vessel Segmentation via Latent Similarity Mining and Iterative Co-Optimization
Zhanqiang Guo, Jianjiang Feng, Jie Zhou
Main category: cs.CV
TL;DR: A novel domain transfer framework for retinal vessel segmentation that uses latent vascular similarity and iterative co-optimization of generation and segmentation networks to handle domain shifts between training and testing data.
Details
Motivation: Retinal vessel segmentation is crucial for automated diagnosis but suffers from performance degradation due to domain shifts between training and testing data. Current CNN approaches struggle when there are significant modality discrepancies between source and target domains.Method: Proposes a domain transfer framework that: 1) Pre-trains generation networks for source and target domains, 2) Uses source-domain conditional diffusion model for deterministic inversion to create domain-agnostic prototypes, 3) Implements iterative refinement where segmentation and generative models undergo mutual optimization through cyclic parameter updating.
Result: The framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.
Conclusion: The proposed approach effectively addresses domain shift problems in retinal vessel segmentation by leveraging latent vascular similarity and iterative co-optimization, demonstrating strong performance in cross-domain scenarios.
Abstract: Retinal vessel segmentation serves as a critical prerequisite for automated diagnosis of retinal pathologies. While recent advances in Convolutional Neural Networks (CNNs) have demonstrated promising performance in this task, significant performance degradation occurs when domain shifts exist between training and testing data. To address these limitations, we propose a novel domain transfer framework that leverages latent vascular similarity across domains and iterative co-optimization of generation and segmentation networks. Specifically, we first pre-train generation networks for source and target domains. Subsequently, the pretrained source-domain conditional diffusion model performs deterministic inversion to establish intermediate latent representations of vascular images, creating domain-agnostic prototypes for target synthesis. Finally, we develop an iterative refinement strategy where segmentation network and generative model undergo mutual optimization through cyclic parameter updating. This co-evolution process enables simultaneous enhancement of cross-domain image synthesis quality and segmentation accuracy. Experiments demonstrate that our framework achieves state-of-the-art performance in cross-domain retinal vessel segmentation, particularly in challenging clinical scenarios with significant modality discrepancies.
[133] ReFlow: Self-correction Motion Learning for Dynamic Scene Reconstruction
Yanzhe Liang, Ruijie Zhu, Hanzhi Chang, Zhuoyuan Li, Jiahao Lu, Tianzhu Zhang
Main category: cs.CV
TL;DR: ReFlow is a unified framework for monocular dynamic scene reconstruction that learns 3D motion through self-correction from raw video without external motion guidance.
Details
Motivation: Existing methods for dynamic scene reconstruction suffer from incomplete initialization of dynamic regions, leading to unstable reconstruction that often requires external dense motion guidance like optical flow, which introduces complexity and error propagation.Method: ReFlow integrates: 1) Complete Canonical Space Construction for enhanced initialization of static/dynamic regions, 2) Separation-Based Dynamic Scene Modeling that decouples static/dynamic components, and 3) a self-correction flow matching mechanism with Full Flow Matching (aligning 3D scene flow with 2D observations) and Camera Flow Matching (enforcing multi-view consistency for static objects).
Result: Extensive experiments across diverse scenarios demonstrate superior reconstruction quality and robustness compared to existing methods.
Conclusion: ReFlow establishes a novel self-correction paradigm for monocular 4D reconstruction that eliminates the need for external motion guidance while achieving robust and accurate dynamic scene reconstruction.
Abstract: We present ReFlow, a unified framework for monocular dynamic scene reconstruction that learns 3D motion in a novel self-correction manner from raw video. Existing methods often suffer from incomplete scene initialization for dynamic regions, leading to unstable reconstruction and motion estimation, which often resorts to external dense motion guidance such as pre-computed optical flow to further stabilize and constrain the reconstruction of dynamic components. However, this introduces additional complexity and potential error propagation. To address these issues, ReFlow integrates a Complete Canonical Space Construction module for enhanced initialization of both static and dynamic regions, and a Separation-Based Dynamic Scene Modeling module that decouples static and dynamic components for targeted motion supervision. The core of ReFlow is a novel self-correction flow matching mechanism, consisting of Full Flow Matching to align 3D scene flow with time-varying 2D observations, and Camera Flow Matching to enforce multi-view consistency for static objects. Together, these modules enable robust and accurate dynamic scene reconstruction. Extensive experiments across diverse scenarios demonstrate that ReFlow achieves superior reconstruction quality and robustness, establishing a novel self-correction paradigm for monocular 4D reconstruction.
[134] Harmonized Tabular-Image Fusion via Gradient-Aligned Alternating Learning
Longfei Huang, Yang Yang
Main category: cs.CV
TL;DR: GAAL is a gradient-aligned alternating learning paradigm for multimodal tabular-image fusion that addresses gradient conflicts between modalities through alternating unimodal learning and uncertainty-based cross-modal gradient surgery.
Details
Motivation: Existing multimodal tabular-image fusion methods suffer from gradient conflicts between modalities, which mislead the optimization of unimodal learners and hinder effective fusion performance.Method: Proposes Gradient-Aligned Alternating Learning (GAAL) with: 1) alternating unimodal learning with shared classifier to decouple multimodal gradients, and 2) uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients and steer shared parameters to benefit all modalities.
Result: Empirical experiments on widely used datasets show superiority over various state-of-the-art tabular-image fusion baselines and test-time tabular missing baselines.
Conclusion: GAAL effectively addresses gradient conflicts in multimodal fusion, provides unimodal assistance, and boosts overall fusion performance through gradient alignment.
Abstract: Multimodal tabular-image fusion is an emerging task that has received increasing attention in various domains. However, existing methods may be hindered by gradient conflicts between modalities, misleading the optimization of the unimodal learner. In this paper, we propose a novel Gradient-Aligned Alternating Learning (GAAL) paradigm to address this issue by aligning modality gradients. Specifically, GAAL adopts an alternating unimodal learning and shared classifier to decouple the multimodal gradient and facilitate interaction. Furthermore, we design uncertainty-based cross-modal gradient surgery to selectively align cross-modal gradients, thereby steering the shared parameters to benefit all modalities. As a result, GAAL can provide effective unimodal assistance and help boost the overall fusion performance. Empirical experiments on widely used datasets reveal the superiority of our method through comparison with various state-of-the-art (SoTA) tabular-image fusion baselines and test-time tabular missing baselines. The source code is available at https://github.com/njustkmg/ICME26-GAAL.
[135] Satellite-Free Training for Drone-View Geo-Localization
Tao Liu, Yingzhi Zhang, Kan Ren, Xiaoqi Zhao
Main category: cs.CV
TL;DR: Satellite-free training framework for drone-view geo-localization using multi-view UAV sequences, 3D reconstruction, and pseudo-orthophoto generation without satellite imagery during training.
Details
Motivation: Existing drone-view geo-localization methods rely on satellite imagery during training, limiting deployment in GPS-denied environments where satellite data is unavailable or restricted. The paper addresses this by proposing a satellite-free training approach.Method: Three-stage framework: 1) 3D scene reconstruction from multi-view drone images using 3D Gaussian splatting, 2) PCA-guided orthographic projection to generate pseudo-orthophotos, 3) Geometry-guided inpainting and DINOv3 feature extraction with Fisher vector aggregation trained solely on drone data.
Result: Outperforms satellite-free generalization baselines on University-1652 and SUES-200 datasets, narrowing the gap to methods trained with satellite imagery.
Conclusion: Proposes a practical satellite-free training framework for drone geo-localization that eliminates dependency on satellite imagery during training while maintaining competitive performance.
Abstract: Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.
[136] SHOE: Semantic HOI Open-Vocabulary Evaluation Metric
Maja Noack, Qinqian Lei, Taipeng Tian, Bihan Dong, Robby T. Tan, Yixin Chen, John Young, Saijun Zhang, Bo Wang
Main category: cs.CV
TL;DR: SHOE introduces a semantic evaluation framework for open-vocabulary human-object interaction detection that uses LLMs to assess similarity between predicted and ground-truth interactions beyond exact string matching.
Details
Motivation: Standard HOI detection metrics like mAP treat interactions as discrete categorical labels and fail to credit semantically valid but lexically different predictions, limiting their applicability for open-vocabulary scenarios where predictions go beyond predefined label sets.Method: SHOE decomposes HOI predictions into verb and object components, estimates their semantic similarity using the average of multiple LLMs, and combines them into a similarity score for evaluating alignment beyond exact string match.
Result: SHOE scores align more closely with human judgments than existing metrics (including LLM-based and embedding-based baselines), achieving 85.73% agreement with average human ratings.
Conclusion: The work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions and will release the evaluation metric to facilitate future research.
Abstract: Open-vocabulary human-object interaction (HOI) detection is a step towards building scalable systems that generalize to unseen interactions in real-world scenarios and support grounded multimodal systems that reason about human-object relationships. However, standard evaluation metrics, such as mean Average Precision (mAP), treat HOI classes as discrete categorical labels and fail to credit semantically valid but lexically different predictions (e.g., “lean on couch” vs. “sit on couch”), limiting their applicability for evaluating open-vocabulary predictions that go beyond any predefined set of HOI labels. We introduce SHOE (Semantic HOI Open-Vocabulary Evaluation), a new evaluation framework that incorporates semantic similarity between predicted and ground-truth HOI labels. SHOE decomposes each HOI prediction into its verb and object components, estimates their semantic similarity using the average of multiple large language models (LLMs), and combines them into a similarity score to evaluate alignment beyond exact string match. This enables a flexible and scalable evaluation of both existing HOI detection methods and open-ended generative models using standard benchmarks such as HICO-DET. Experimental results show that SHOE scores align more closely with human judgments than existing metrics, including LLM-based and embedding-based baselines, achieving an agreement of 85.73% with the average human ratings. Our work underscores the need for semantically grounded HOI evaluation that better mirrors human understanding of interactions. We will release our evaluation metric to the public to facilitate future research.
[137] Mitigating the ID-OOD Tradeoff in Open-Set Test-Time Adaptation
Wenjie Zhao, Jia Li, Xin Dong, Yapeng Tian, Yu Xiang, Yunhui Guo
Main category: cs.CV
TL;DR: ROSETTA: A robust open-set test-time adaptation method that improves OOD detection while maintaining ID classification performance by addressing the entropy minimization-maximization conflict through angular and feature-norm losses.
Details
Motivation: Open-set test-time adaptation faces challenges when both in-distribution (ID) and out-of-distribution (OOD) samples are affected by distribution shifts. Existing methods using entropy minimization for ID classification and entropy maximization for OOD detection create conflicts, leading to trade-offs between csID classification and csOOD detection performance.Method: Proposes ROSETTA with two key components: 1) Angular loss to regulate feature norm magnitudes, and 2) Feature-norm loss to suppress csOOD logits. These address the limitations of entropy maximization in OSTTA settings.
Result: Achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C, and ImageNet-C. Validated on Cityscapes for real-world semantic segmentation and HAC dataset for different open-set TTA setups.
Conclusion: ROSETTA effectively addresses the conflict between entropy minimization and maximization in open-set test-time adaptation, providing robust performance for both ID classification and OOD detection under distribution shifts.
Abstract: Open-set test-time adaptation (OSTTA) addresses the challenge of adapting models to new environments where out-of-distribution (OOD) samples coexist with in-distribution (ID) samples affected by distribution shifts. In such settings, covariate shift-for example, changes in weather conditions such as snow-can alter ID samples, reducing model reliability. Consequently, models must not only correctly classify covariate-shifted ID (csID) samples but also effectively reject covariate-shifted OOD (csOOD) samples. Entropy minimization is a common strategy in test-time adaptation to maintain ID performance under distribution shifts, while entropy maximization is widely applied to enhance OOD detection. Several studies have sought to combine these objectives to tackle the challenges of OSTTA. However, the intrinsic conflict between entropy minimization and maximization inevitably leads to a trade-off between csID classification and csOOD detection. In this paper, we first analyze the limitations of entropy maximization in OSTTA and then introduce an angular loss to regulate feature norm magnitudes, along with a feature-norm loss to suppress csOOD logits, thereby improving OOD detection. These objectives form ROSETTA, a $\underline{r}$obust $\underline{o}$pen-$\underline{se}$t $\underline{t}$est-$\underline{t}$ime $\underline{a}$daptation. Our method achieves strong OOD detection while maintaining high ID classification performance on CIFAR-10-C, CIFAR-100-C, Tiny-ImageNet-C and ImageNet-C. Furthermore, experiments on the Cityscapes validate the method’s effectiveness in real-world semantic segmentation, and results on the HAC dataset demonstrate its applicability across different open-set TTA setups.
[138] Riemannian and Symplectic Geometry for Hierarchical Text-Driven Place Recognition
Tianyi Shang, Zhenyu Li
Main category: cs.CV
TL;DR: SympLoc is a novel coarse-to-fine text-to-point-cloud localization framework that uses multi-level alignment (instance, relation, and global levels) to overcome information loss in existing pooled global descriptor methods.
Details
Motivation: Existing text-to-point-cloud localization methods use pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures needed for accurate spatial understanding through natural language.Method: SympLoc employs a coarse-to-fine framework with three complementary alignment levels: 1) Instance-level alignment using Riemannian self-attention in hyperbolic space, 2) Relation-level alignment using Information-Symplectic Relation Encoder (ISRE) with Fisher-Rao metric and Hamiltonian dynamics, and 3) Global-level alignment using Spectral Manifold Transform (SMT) for structural invariants.
Result: Extensive experiments on KITTI360Pose dataset show SympLoc achieves 19% improvement in Top-1 recall@10m compared to state-of-the-art approaches.
Conclusion: The hierarchical multi-level alignment strategy effectively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval for text-to-point-cloud localization.
Abstract: Text-to-point-cloud localization enables robots to understand spatial positions through natural language descriptions, which is crucial for human-robot collaboration in applications such as autonomous driving and last-mile delivery. However, existing methods employ pooled global descriptors for similarity retrieval, which suffer from severe information loss and fail to capture discriminative scene structures. To address these issues, we propose SympLoc, a novel coarse-to-fine localization framework with multi-level alignment in the coarse stage. Different from previous methods that rely solely on global descriptors, our coarse stage consists of three complementary alignment levels: 1) Instance-level alignment establishes direct correspondence between individual object instances in point clouds and textual hints through Riemannian self-attention in hyperbolic space; 2) Relation-level alignment explicitly models pairwise spatial relationships between objects using the Information-Symplectic Relation Encoder (ISRE), which reformulates relation features through Fisher-Rao metric and Hamiltonian dynamics for uncertainty-aware geometrically consistent propagation; 3) Global-level alignment synthesizes discriminative global descriptors via the Spectral Manifold Transform (SMT) that extracts structural invariants through graph spectral analysis. This hierarchical alignment strategy progressively captures fine-grained to coarse-grained scene semantics, enabling robust cross-modal retrieval. Extensive experiments on the KITTI360Pose dataset demonstrate that SympLoc achieves a 19% improvement in Top-1 recall@10m compared to existing state-of-the-art approaches.
[139] Towards Minimal Focal Stack in Shape from Focus
Khurram Ashfaq, Muhammad Tariq Mahmood
Main category: cs.CV
TL;DR: Proposes focal stack augmentation for Shape from Focus (SFF) to enable depth estimation from just 2 images instead of dense focal stacks, using physics-based augmentation with all-in-focus images and Energy-of-Difference maps, plus a deep network with multi-scale ConvGRUs.
Details
Motivation: Traditional SFF methods require densely sampled, large focal stacks which limits practical applicability. The authors aim to enable SFF with minimal stack size (just 2 images) while maintaining accuracy.Method: 1) Physics-based focal stack augmentation: estimates all-in-focus (AiF) image from two input images, computes Energy-of-Difference (EOD) maps between AiF and inputs. 2) Deep network architecture: computes deep focus volume from augmented stacks, uses convolutional Gated Recurrent Units (ConvGRUs) at multiple scales for iterative depth refinement.
Result: Extensive experiments on synthetic and real-world datasets show the augmentation benefits existing SFF models, enabling comparable accuracy with minimal stack size. Approach maintains state-of-the-art performance with just 2 images.
Conclusion: Proposed focal stack augmentation enables practical SFF with minimal image requirements (2 images), overcoming limitations of traditional dense focal stacks while maintaining accuracy through physics-based cues and deep learning refinement.
Abstract: Shape from Focus (SFF) is a depth reconstruction technique that estimates scene structure from focus variations observed across a focal stack, that is, a sequence of images captured at different focus settings. A key limitation of SFF methods is their reliance on densely sampled, large focal stacks, which limits their practical applicability. In this study, we propose a focal stack augmentation that enables SFF methods to estimate depth using a reduced stack of just two images, without sacrificing precision. We introduce a simple yet effective physics-based focal stack augmentation that enriches the stack with two auxiliary cues: an all-in-focus (AiF) image estimated from two input images, and Energy-of-Difference (EOD) maps, computed as the energy of differences between the AiF and input images. Furthermore, we propose a deep network that computes a deep focus volume from the augmented focal stacks and iteratively refines depth using convolutional Gated Recurrent Units (ConvGRUs) at multiple scales. Extensive experiments on both synthetic and real-world datasets demonstrate that the proposed augmentation benefits existing state-of-the-art SFF models, enabling them to achieve comparable accuracy. The results also show that our approach maintains state-of-the-art performance with a minimal stack size.
[140] F3DGS: Federated 3D Gaussian Splatting for Decentralized Multi-Agent World Modeling
Morui Zhu, Mohammad Dehghani Tezerjani, Mátyás Szántó, Márton Vaitkus, Song Fu, Qing Yang
Main category: cs.CV
TL;DR: F3DGS is a federated learning framework for 3D Gaussian Splatting that enables decentralized multi-agent 3D reconstruction without centralized data aggregation.
Details
Motivation: Existing 3DGS pipelines require centralized access to all observations, which is impractical for distributed robotic systems where agents operate independently and data aggregation may be restricted. Direct multi-agent extensions face communication overhead and geometric inconsistency issues.Method: First constructs a shared geometric scaffold by registering locally merged LiDAR point clouds to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve alignment, while clients update only appearance attributes (covariance, opacity, spherical harmonic coefficients). Server uses visibility-aware aggregation, weighting each client’s contribution by observation frequency to handle partial observability.
Result: F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The framework is evaluated on a new multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements.
Conclusion: F3DGS provides an effective federated approach for multi-agent 3D reconstruction that maintains geometric consistency while distributing computation, with applications in distributed robotic systems. The dataset and code will be publicly released.
Abstract: We present F3DGS, a federated 3D Gaussian Splatting framework for decentralized multi-agent 3D reconstruction. Existing 3DGS pipelines assume centralized access to all observations, which limits their applicability in distributed robotic settings where agents operate independently, and centralized data aggregation may be restricted. Directly extending centralized training to multi-agent systems introduces communication overhead and geometric inconsistency. F3DGS first constructs a shared geometric scaffold by registering locally merged LiDAR point clouds from multiple clients to initialize a global 3DGS model. During federated optimization, Gaussian positions are fixed to preserve geometric alignment, while each client updates only appearance-related attributes, including covariance, opacity, and spherical harmonic coefficients. The server aggregates these updates using visibility-aware aggregation, weighting each client’s contribution by how frequently it observed each Gaussian, resolving the partial-observability challenge inherent to multi-agent exploration. To evaluate decentralized reconstruction, we collect a multi-sequence indoor dataset with synchronized LiDAR, RGB, and IMU measurements. Experiments show that F3DGS achieves reconstruction quality comparable to centralized training while enabling distributed optimization across agents. The dataset, development kit, and source code will be publicly released.
[141] NEMESIS: Noise-suppressed Efficient MAE with Enhanced Superpatch Integration Strategy
Kyeonghun Kim, Hyeonseok Jung, Youngung Han, Hyunsu Go, Eunseob Choi, Seongbin Park, Junsu Lim, Jiwon Yang, Sumin Lee, Insung Hwang, Ken Ying-Kai Liao, Nam-Joon Kim
Main category: cs.CV
TL;DR: NEMESIS is a memory-efficient masked autoencoder framework for 3D CT imaging that uses local superpatches, noise-enhanced reconstruction, dual-masking strategies, and cross-scale context aggregation to achieve state-of-the-art performance on medical imaging tasks with reduced computational cost.
Details
Motivation: Volumetric CT imaging is clinically important but annotation is expensive and time-consuming. Self-supervised learning for 3D CT faces challenges due to high memory costs of full-volume transformers and anisotropic spatial structure not well captured by conventional masking strategies.Method: Proposes NEMESIS framework operating on local 128x128x128 superpatches with three key components: (1) noise-enhanced reconstruction as pretext task, (2) Masked Anatomical Transformer Blocks with dual-masking through parallel plane-wise and axis-wise token removal, and (3) NEMESIS Tokens for cross-scale context aggregation.
Result: Achieves mean AUROC of 0.9633 on BTCV multi-organ classification benchmark, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). With only 10% annotations, retains AUROC of 0.9075. Reduces computational cost to 31.0 GFLOPs vs 985.8 GFLOPs for full-volume baseline.
Conclusion: NEMESIS provides a scalable and robust foundation for 3D medical imaging with strong label efficiency and significantly reduced computational requirements while maintaining high performance on medical imaging tasks.
Abstract: Volumetric CT imaging is essential for clinical diagnosis, yet annotating 3D volumes is expensive and time-consuming, motivating self-supervised learning (SSL) from unlabeled data. However, applying SSL to 3D CT remains challenging due to the high memory cost of full-volume transformers and the anisotropic spatial structure of CT data, which is not well captured by conventional masking strategies. We propose NEMESIS, a masked autoencoder (MAE) framework that operates on local 128x128x128 superpatches, enabling memory-efficient training while preserving anatomical detail. NEMESIS introduces three key components: (i) noise-enhanced reconstruction as a pretext task, (ii) Masked Anatomical Transformer Blocks (MATB) that perform dual-masking through parallel plane-wise and axis-wise token removal, and (iii) NEMESIS Tokens (NT) for cross-scale context aggregation. On the BTCV multi-organ classification benchmark, NEMESIS with a frozen backbone and a linear classifier achieves a mean AUROC of 0.9633, surpassing fully fine-tuned SuPreM (0.9493) and VoCo (0.9387). Under a low-label regime with only 10% of available annotations, it retains an AUROC of 0.9075, demonstrating strong label efficiency. Furthermore, the superpatch-based design reduces computational cost to 31.0 GFLOPs per forward pass, compared to 985.8 GFLOPs for the full-volume baseline, providing a scalable and robust foundation for 3D medical imaging.
[142] Tex3D: Objects as Attack Surfaces via Adversarial 3D Textures for Vision-Language-Action Models
Jiawei Chen, Simin Huang, Jiawei Du, Shuaihang Chen, Yu Tian, Mingjie Wei, Chao Yu, Zhaoxia Yin
Main category: cs.CV
TL;DR: Tex3D is a framework for optimizing 3D adversarial textures to attack Vision-Language-Action models in robotic manipulation, achieving high failure rates through physically plausible attacks.
Details
Motivation: Existing adversarial attacks on VLA models use language perturbations or 2D visual attacks that lack physical realism. 3D adversarial textures are more physically plausible threats but are challenging to optimize due to non-differentiable simulation environments.Method: Proposes Tex3D with two key components: 1) Foreground-Background Decoupling (FBD) enables differentiable texture optimization through dual-renderer alignment, and 2) Trajectory-Aware Adversarial Optimization (TAAO) prioritizes critical frames and uses vertex-based parameterization for stability across viewpoints.
Result: Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates up to 96.7% in both simulation and real-robot settings.
Conclusion: The work exposes critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlights the need for robustness-aware training in multimodal systems.
Abstract: Vision-language-action (VLA) models have shown strong performance in robotic manipulation, yet their robustness to physically realizable adversarial attacks remains underexplored. Existing studies reveal vulnerabilities through language perturbations and 2D visual attacks, but these attack surfaces are either less representative of real deployment or limited in physical realism. In contrast, adversarial 3D textures pose a more physically plausible and damaging threat, as they are naturally attached to manipulated objects and are easier to deploy in physical environments. Bringing adversarial 3D textures to VLA systems is nevertheless nontrivial. A central obstacle is that standard 3D simulators do not provide a differentiable optimization path from the VLA objective function back to object appearance, making it difficult to optimize through an end-to-end manner. To address this, we introduce Foreground-Background Decoupling (FBD), which enables differentiable texture optimization through dual-renderer alignment while preserving the original simulation environment. To further ensure that the attack remains effective across long-horizon and diverse viewpoints in the physical world, we propose Trajectory-Aware Adversarial Optimization (TAAO), which prioritizes behaviorally critical frames and stabilizes optimization with a vertex-based parameterization. Built on these designs, we present Tex3D, the first framework for end-to-end optimization of 3D adversarial textures directly within the VLA simulation environment. Experiments in both simulation and real-robot settings show that Tex3D significantly degrades VLA performance across multiple manipulation tasks, achieving task failure rates of up to 96.7%. Our empirical results expose critical vulnerabilities of VLA systems to physically grounded 3D adversarial attacks and highlight the need for robustness-aware training.
[143] Automatic Image-Level Morphological Trait Annotation for Organismal Images
Vardaan Pahuja, Samuel Stevens, Alyson East, Sydne Record, Yu Su
Main category: cs.CV
TL;DR: Sparse autoencoders on foundation-model features yield monosemantic neurons that activate on morphological parts, enabling automated trait annotation pipeline for insect images.
Details
Motivation: Morphological trait extraction is slow and expert-driven, limiting large-scale ecological studies. Need scalable methods to link biological images to trait annotations.Method: Use sparse autoencoders on foundation-model features to get monosemantic, spatially grounded neurons. Pipeline localizes salient regions and uses vision-language prompting to generate trait descriptions.
Result: Created Bioscan-Traits dataset with 80K trait annotations across 19K insect images. Human evaluation confirms biological plausibility. Comprehensive ablation study assesses design sensitivity.
Conclusion: Provides scalable trait annotation pipeline to inject biologically meaningful supervision into foundation models, enabling large-scale morphological analyses and bridging ecology with ML.
Abstract: Morphological traits are physical characteristics of biological organisms that provide vital clues on how organisms interact with their environment. Yet extracting these traits remains a slow, expert-driven process, limiting their use in large-scale ecological studies. A major bottleneck is the absence of high-quality datasets linking biological images to trait-level annotations. In this work, we demonstrate that sparse autoencoders trained on foundation-model features yield monosemantic, spatially grounded neurons that consistently activate on meaningful morphological parts. Leveraging this property, we introduce a trait annotation pipeline that localizes salient regions and uses vision-language prompting to generate interpretable trait descriptions. Using this approach, we construct Bioscan-Traits, a dataset of 80K trait annotations spanning 19K insect images from BIOSCAN-5M. Human evaluation confirms the biological plausibility of the generated morphological descriptions. We assess design sensitivity through a comprehensive ablation study, systematically varying key design choices and measuring their impact on the quality of the resulting trait descriptions. By annotating traits with a modular pipeline rather than prohibitively expensive manual efforts, we offer a scalable way to inject biologically meaningful supervision into foundation models, enable large-scale morphological analyses, and bridge the gap between ecological relevance and machine-learning practicality.
[144] LivingWorld: Interactive 4D World Generation with Environmental Dynamics
Hyeongju Mun, In-Hwan Jin, Sohyeong Kim, Kyeongbo Kong
Main category: cs.CV
TL;DR: LivingWorld is an interactive framework for generating 4D worlds with environmental dynamics (clouds, water, smoke) from a single image, featuring progressive construction of globally coherent motion fields.
Details
Motivation: Current 3D scene generation focuses on static geometry, leaving scene-scale environmental dynamics largely unexplored. Modeling such dynamics is challenging due to the need for motion coherence across expanding scenes while supporting low-latency user feedback.Method: Progressively constructs globally coherent motion fields as scenes expand using a geometry-aware alignment module to resolve directional/scale ambiguities. Uses compact hash-based motion field representation for efficient querying and stable propagation of dynamics, supporting bidirectional motion propagation during rendering.
Result: Generates each new scene expansion in 9 seconds on RTX 5090 GPU, with 3 seconds for motion alignment and updates, enabling interactive 4D world generation with globally coherent environmental dynamics without expensive video-based refinement.
Conclusion: LivingWorld enables interactive generation of 4D worlds with environmental dynamics from single images, addressing the gap in scene-scale dynamic modeling while maintaining global consistency and low-latency performance.
Abstract: We introduce LivingWorld, an interactive framework for generating 4D worlds with environmental dynamics from a single image. While recent advances in 3D scene generation enable large-scale environment creation, most approaches focus primarily on reconstructing static geometry, leaving scene-scale environmental dynamics such as clouds, water, or smoke largely unexplored. Modeling such dynamics is challenging because motion must remain coherent across an expanding scene while supporting low-latency user feedback. LivingWorld addresses this challenge by progressively constructing a globally coherent motion field as the scene expands. To maintain global consistency during expansion, we introduce a geometry-aware alignment module that resolves directional and scale ambiguities across views. We further represent motion using a compact hash-based motion field, enabling efficient querying and stable propagation of dynamics throughout the scene. This representation also supports bidirectional motion propagation during rendering, producing long and temporally coherent 4D sequences without relying on expensive video-based refinement. On a single RTX 5090 GPU, generating each new scene expansion step requires 9 seconds, followed by 3 seconds for motion alignment and motion field updates, enabling interactive 4D world generation with globally coherent environmental dynamics. Video demonstrations are available at cvsp-lab.github.io/LivingWorld.
[145] MonoSAOD: Monocular 3D Object Detection with Sparsely Annotated Label
Junyoung Jung, Seokwon Kim, Jun Uk Kim
Main category: cs.CV
TL;DR: A framework for sparsely annotated monocular 3D object detection using road-aware patch augmentation and prototype-based filtering to handle limited 3D annotations.
Details
Motivation: Monocular 3D object detection requires dense 3D annotations which are expensive to obtain. Real-world scenarios often have sparse annotations, making current methods impractical due to high annotation costs.Method: Two key modules: 1) Road-Aware Patch Augmentation (RAPA) - augments segmented object patches onto road regions while preserving 3D geometric consistency; 2) Prototype-Based Filtering (PBF) - generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty, maintaining global 2D RoI feature prototypes.
Result: Extensive experiments demonstrate the effectiveness of the proposed method for robust detection under sparse supervision.
Conclusion: The framework successfully addresses the challenge of sparse annotations in monocular 3D object detection through geometry-preserving augmentation and prototype-guided pseudo-labeling.
Abstract: Monocular 3D object detection has achieved impressive performance on densely annotated datasets. However, it struggles when only a fraction of objects are labeled due to the high cost of 3D annotation. This sparsely annotated setting is common in real-world scenarios where annotating every object is impractical. To address this, we propose a novel framework for sparsely annotated monocular 3D object detection with two key modules. First, we propose Road-Aware Patch Augmentation (RAPA), which leverages sparse annotations by augmenting segmented object patches onto road regions while preserving 3D geometric consistency. Second, we propose Prototype-Based Filtering (PBF), which generates high-quality pseudo-labels by filtering predictions through prototype similarity and depth uncertainty. It maintains global 2D RoI feature prototypes and selects pseudo-labels that are both feature-consistent with learned prototypes and have reliable depth estimates. Our training strategy combines geometry-preserving augmentation with prototype-guided pseudo-labeling to achieve robust detection under sparse supervision. Extensive experiments demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/VisualAIKHU/MonoSAOD .
[146] DynaVid: Learning to Generate Highly Dynamic Videos using Synthetic Motion Data
Wonjoon Jin, Jiyun Won, Janghyeok Han, Qi Dai, Chong Luo, Seung-Hwan Baek, Sunghyun Cho
Main category: cs.CV
TL;DR: DynaVid: A video synthesis framework that uses synthetic motion data (optical flow) to improve dynamic motion generation and controllability while preserving visual realism from real videos.
Details
Motivation: Video diffusion models struggle with highly dynamic motions and fine-grained motion controllability due to scarcity of such examples in training datasets. Synthetic motion data can provide diverse motion patterns and precise control signals that are hard to obtain from real data.Method: Two-stage framework: 1) Motion generator synthesizes motion using rendered optical flow from computer graphics pipelines, which encodes only motion (decoupled from appearance). 2) Motion-guided video generator produces video frames conditioned on that motion. This allows learning dynamic motion from synthetic data while preserving visual realism from real videos.
Result: Validated on vigorous human motion generation and extreme camera motion control scenarios where existing datasets are limited. Extensive experiments show improved realism and controllability in dynamic motion generation and camera motion control.
Conclusion: DynaVid effectively addresses limitations in video diffusion models by leveraging synthetic motion data while maintaining visual realism, enabling better dynamic motion generation and controllability.
Abstract: Despite recent progress, video diffusion models still struggle to synthesize realistic videos involving highly dynamic motions or requiring fine-grained motion controllability. A central limitation lies in the scarcity of such examples in commonly used training datasets. To address this, we introduce DynaVid, a video synthesis framework that leverages synthetic motion data in training, which is represented as optical flow and rendered using computer graphics pipelines. This approach offers two key advantages. First, synthetic motion offers diverse motion patterns and precise control signals that are difficult to obtain from real data. Second, unlike rendered videos with artificial appearances, rendered optical flow encodes only motion and is decoupled from appearance, thereby preventing models from reproducing the unnatural look of synthetic videos. Building on this idea, DynaVid adopts a two-stage generation framework: a motion generator first synthesizes motion, and then a motion-guided video generator produces video frames conditioned on that motion. This decoupled formulation enables the model to learn dynamic motion patterns from synthetic data while preserving visual realism from real-world videos. We validate our framework on two challenging scenarios, vigorous human motion generation and extreme camera motion control, where existing datasets are particularly limited. Extensive experiments demonstrate that DynaVid improves the realism and controllability in dynamic motion generation and camera motion control.
[147] Robust Embodied Perception in Dynamic Environments via Disentangled Weight Fusion
Juncen Guo, Xiaoguang Zhu, Jingyi Wu, Jingyu Zhang, Jingnan Cai, Zhenghao Niu, Liang Song
Main category: cs.CV
TL;DR: A domain-id and exemplar-free incremental learning framework for embodied multimedia systems that enables robust continuous environment adaptation without storing historical data or requiring domain identifiers.
Details
Motivation: Embodied perception systems face challenges of dynamic environment distribution drift in open physical spaces. Existing domain incremental methods rely on pre-obtained domain IDs during testing, limiting practicability in unknown interaction scenarios. Models often overfit to context-specific perceptual noise, leading to insufficient generalization and catastrophic forgetting.Method: Proposes a disentangled representation mechanism to remove non-essential environmental style interference and focus on extracting semantic intrinsic features shared across scenes. Uses weight fusion strategy to dynamically integrate old and new environment knowledge in parameter space without storing historical data.
Result: Extensive experiments on multiple standard benchmark datasets show the method significantly reduces catastrophic forgetting in completely exemplar-free and domain-id free settings, with accuracy better than existing state-of-the-art methods.
Conclusion: The proposed framework enables robust continuous environment adaptation for embodied multimedia systems by eliminating perceptual uncertainty and improving generalization while preventing catastrophic forgetting without requiring domain identifiers or exemplar storage.
Abstract: Embodied perception systems face severe challenges of dynamic environment distribution drift when they continuously interact in open physical spaces. However, the existing domain incremental awareness methods often rely on the domain id obtained in advance during the testing phase, which limits their practicability in unknown interaction scenarios. At the same time, the model often overfits to the context-specific perceptual noise, which leads to insufficient generalization ability and catastrophic forgetting. To address these limitations, we propose a domain-id and exemplar-free incremental learning framework for embodied multimedia systems, which aims to achieve robust continuous environment adaptation. This method designs a disentangled representation mechanism to remove non-essential environmental style interference, and guide the model to focus on extracting semantic intrinsic features shared across scenes, thereby eliminating perceptual uncertainty and improving generalization. We further use the weight fusion strategy to dynamically integrate the old and new environment knowledge in the parameter space, so as to ensure that the model adapts to the new distribution without storing historical data and maximally retains the discrimination ability of the old environment. Extensive experiments on multiple standard benchmark datasets show that the proposed method significantly reduces catastrophic forgetting in a completely exemplar-free and domain-id free setting, and its accuracy is better than the existing state-of-the-art methods.
[148] HOT: Harmonic-Constrained Optimal Transport for Remote Photoplethysmography Domain Adaptation
Ba-Thinh Nguyen, Thi-Duyen Ngo, Thanh-Trung Huynh, Thanh-Ha Le, Huy-Hieu Pham
Main category: cs.CV
TL;DR: The paper proposes FDA and HOT methods to improve rPPG model robustness against domain shifts by transferring low-frequency appearance characteristics and using harmonic constraints for alignment.
Details
Motivation: Remote photoplethysmography (rPPG) suffers from performance degradation under domain shift due to overfitting to appearance-related factors like illumination, camera characteristics, and color response that vary across domains.Method: Introduces Frequency Domain Adaptation (FDA) to transfer low-frequency spectral components encoding domain-dependent appearance characteristics, and Harmonic-Constrained Optimal Transport (HOT) to leverage harmonic properties of cardiac signals for physiologically consistent alignment between original and FDA-transferred representations.
Result: Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.
Conclusion: FDA and HOT provide a principled strategy for modeling appearance variation in rPPG, enabling models to learn invariance to appearance variations while retaining cardiac-induced signals, improving practical deployment under domain shift.
Abstract: Remote photoplethysmography (rPPG) enables non-contact physiological measurement from facial videos; however, its practical deployment is often hindered by substantial performance degradation under domain shift. While recent deep learning-based rPPG methods have achieved strong performance on individual datasets, they frequently overfit to appearance-related factors, such as illumination, camera characteristics, and color response, that vary significantly across domains. To address this limitation, we introduce frequency domain adaptation (FDA) as a principled strategy for modeling appearance variation in rPPG. By transferring low-frequency spectral components that encode domain-dependent appearance characteristics, FDA encourages rPPG models to learn invariance to appearance variations while retaining cardiac-induced signals. To further support physiologically consistent alignment under such appearance variation, we propose Harmonic-Constrained Optimal Transport (HOT), which leverages the harmonic property of cardiac signals to guide alignment between original and FDA-transferred representations. Extensive cross-dataset experiments demonstrate that the proposed FDA and HOT framework effectively enhances the robustness and generalization of rPPG models across diverse datasets.
[149] GPA: Learning GUI Process Automation from Demonstrations
Zirui Zhao, Jun Hao Liew, Yan Yang, Wenzhuo Yang, Ziyang Luo, Doyen Sahoo, Silvio Savarese, Junnan Li
Main category: cs.CV
TL;DR: GUI Process Automation (GPA) is a vision-based robotic process automation system that enables fast, stable GUI task execution with single demonstration, offering robustness, determinism, and privacy advantages over traditional RPA and VLM-based agents.
Details
Motivation: Traditional RPA systems are fragile and break easily with UI changes, while current vision language model-based GUI agents suffer from non-deterministic risks and reliability issues. There's a need for a more robust, deterministic, and privacy-preserving approach to GUI automation.Method: GPA uses Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty, readiness calibration for deterministic execution, and fully local execution for privacy. It learns from a single demonstration and can be used as a tool by other agents with coding capabilities.
Result: In pilot experiments comparing GPA with Gemini 3 Pro (with CUA tools), GPA achieved higher success rates with 10 times faster execution speed for long-horizon GUI tasks.
Conclusion: GPA provides a robust, deterministic, and privacy-preserving solution for GUI automation that outperforms current VLM-based approaches in both success rate and speed, making it suitable for enterprise workflows.
Abstract: GUI Process Automation (GPA) is a lightweight but general vision-based Robotic Process Automation (RPA), which enables fast and stable process replay with only a single demo. Addressing the fragility of traditional RPA and the non-deterministic risks of current vision language model-based GUI agents, GPA introduces three core benefits: (1) Robustness via Sequential Monte Carlo-based localization to handle rescaling and detection uncertainty; (2) Deterministic and Reliability safeguarded by readiness calibration; and (3) Privacy through fast, fully local execution. This approach delivers the adaptability, robustness, and security required for enterprise workflows. It can also be used as an MCP/CLI tool by other agents with coding capabilities so that the agent only reasons and orchestrates while GPA handles the GUI execution. We conducted a pilot experiment to compare GPA with Gemini 3 Pro (with CUA tools) and found that GPA achieves higher success rate with 10 times faster execution speed in finishing long-horizon GUI tasks.
[150] Director: Instance-aware Gaussian Splatting for Dynamic Scene Modeling and Understanding
Yuheng Jiang, Yiwen Cai, Zihao Wang, Yize Wu, Sicheng Li, Zhuo Su, Shaohui Jiao, Lan Xu
Main category: cs.CV
TL;DR: Director: A unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics for volumetric video, enabling language-aligned 4D representations and robust dynamic scene understanding.
Details
Motivation: Current Gaussian-based volumetric video approaches focus on appearance but lack instance-level structure awareness, limiting stable tracking and semantic reasoning in highly dynamic scenarios. There's a need for representations that combine high-fidelity rendering with semantic understanding.Method: 1) Embed instance-consistent semantics using temporally aligned instance masks and sentence embeddings from MLLMs to supervise learnable semantic features via MLP decoders. 2) Bridge 2D optical flow with 4D Gaussians to finetune motions for temporal stability. 3) Introduce geometry-aware SDF constraints and regularization terms for surface continuity and temporal coherence.
Result: Achieves temporally coherent 4D reconstructions while enabling instance segmentation and open-vocabulary querying. Demonstrates improved scene decomposition and robust dynamic scene understanding.
Conclusion: Embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. The approach successfully integrates rendering fidelity with semantic reasoning.
Abstract: Volumetric video seeks to model dynamic scenes as temporally coherent 4D representations. While recent Gaussian-based approaches achieve impressive rendering fidelity, they primarily emphasize appearance but are largely agnostic to instance-level structure, limiting stable tracking and semantic reasoning in highly dynamic scenarios. In this paper, we present Director, a unified spatio-temporal Gaussian representation that jointly models human performance, high-fidelity rendering, and instance-level semantics. Our key insight is that embedding instance-consistent semantics naturally complements 4D modeling, enabling more accurate scene decomposition while supporting robust dynamic scene understanding. To this end, we leverage temporally aligned instance masks and sentence embeddings derived from Multimodal Large Language Models to supervise the learnable semantic features of each Gaussian via two MLP decoders, enabling language-aligned 4D representations and enforcing identity consistency over time. To enhance temporal stability, we bridge 2D optical flow with 4D Gaussians and finetune their motions, yielding reliable initialization and reducing drift. For the training, we further introduce a geometry-aware SDF constraints, along with regularization terms that enforces surface continuity, enhancing temporal coherence in dynamic foreground modeling. Experiments demonstrate that Director achieves temporally coherent 4D reconstructions while simultaneously enabling instance segmentation and open-vocabulary querying.
[151] BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography
Ba-Thinh Nguyen, Thi-Duyen Ngo, Thanh-Trung Huynh, Thanh-Ha Le, Huy-Hieu Pham
Main category: cs.CV
TL;DR: BTS-rPPG introduces a novel temporal modeling framework for remote photoplethysmography using orthogonal butterfly temporal shifting to improve long-range temporal receptive fields for physiological signal estimation from facial videos.
Details
Motivation: Current deep learning methods for rPPG rely on temporal shifting or convolutional operators that primarily aggregate information from neighboring frames, resulting in limited temporal receptive fields and insufficient modeling of long-range physiological dynamics.Method: Proposes BTS-rPPG framework with Orthogonal Butterfly Temporal Shifting (BTS) inspired by FFT butterfly patterns, using XOR-based butterfly pairing to establish structured frame interactions, plus orthogonal feature transfer mechanism to filter redundant features before cross-frame transmission.
Result: Extensive experiments on multiple benchmark datasets show BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.
Conclusion: The BTS-rPPG framework effectively addresses limitations in temporal receptive fields for rPPG by enabling efficient propagation of information across distant frames through structured butterfly interactions and orthogonal feature filtering.
Abstract: Remote photoplethysmography (rPPG) enables contactless physiological sensing from facial videos by analyzing subtle appearance variations induced by blood circulation. However, modeling the temporal dynamics of these signals remains challenging, as many deep learning methods rely on temporal shifting or convolutional operators that aggregate information primarily from neighboring frames, resulting in predominantly local temporal modeling and limited temporal receptive fields. To address this limitation, we propose BTS-rPPG, a temporal modeling framework based on Orthogonal Butterfly Temporal Shifting (BTS). Inspired by the butterfly communication pattern in the Fast Fourier Transform (FFT), BTS establishes structured frame interactions via an XOR-based butterfly pairing schedule, progressively expanding the temporal receptive field and enabling efficient propagation of information across distant frames. Furthermore, we introduce an orthogonal feature transfer mechanism (OFT) that filters the source feature with respect to the target context before temporal shifting, retaining only the orthogonal component for cross-frame transmission. This reduces redundant feature propagation and encourages complementary temporal interaction. Extensive experiments on multiple benchmark datasets demonstrate that BTS-rPPG improves long-range temporal modeling of physiological dynamics and consistently outperforms existing temporal modeling strategies for rPPG estimation.
[152] From Understanding to Erasing: Towards Complete and Stable Video Object Removal
Dingming Liu, Wenjing Wang, Chen Li, Jing Lyu
Main category: cs.CV
TL;DR: Video object removal method that uses external knowledge distillation from vision foundation models and internal framewise context cross-attention to remove objects and their induced side effects while maintaining spatio-temporal coherence.
Details
Motivation: Current video object removal methods struggle with removing object-induced side effects like shadows, reflections, and illumination changes while maintaining overall coherence, due to insufficient physical and semantic understanding of object-scene interactions.Method: Two complementary approaches: 1) External guidance via distillation scheme transferring object-effect relationships from vision foundation models to video diffusion models, and 2) Internal guidance through framewise context cross-attention mechanism that grounds denoising blocks in informative unmasked context around target regions.
Result: State-of-the-art performance demonstrated through extensive experiments, with establishment of first real-world benchmark for video object removal to facilitate future research.
Conclusion: The proposed approach enables comprehensive understanding of target objects, their induced effects, and global background context, resulting in clear and coherent object removal in videos.
Abstract: Video object removal aims to eliminate target objects from videos while plausibly completing missing regions and preserving spatio-temporal consistency. Although diffusion models have recently advanced this task, it remains challenging to remove object-induced side effects (e.g., shadows, reflections, and illumination changes) without compromising overall coherence. This limitation stems from the insufficient physical and semantic understanding of the target object and its interactions with the scene. In this paper, we propose to introduce understanding into erasing from two complementary perspectives. Externally, we introduce a distillation scheme that transfers the relationships between objects and their induced effects from vision foundation models to video diffusion models. Internally, we propose a framewise context cross-attention mechanism that grounds each denoising block in informative, unmasked context surrounding the target region. External and internal guidance jointly enable our model to understand the target object, its induced effects, and the global background context, resulting in clear and coherent object removal. Extensive experiments demonstrate our state-of-the-art performance, and we establish the first real-world benchmark for video object removal to facilitate future research and community progress. Our code, data, and models are available at: https://github.com/WeChatCV/UnderEraser.
[153] Bias mitigation in graph diffusion models
Meng Yu, Kun Zhan
Main category: cs.CV
TL;DR: A method to mitigate two biases in graph diffusion models: reverse-starting bias and exposure bias, using Langevin sampling alignment and score correction without network modifications.
Details
Motivation: Existing graph diffusion models suffer from significant bias problems where forward diffusion's maximum perturbation distribution deviates from standard Gaussian while reverse sampling starts from standard Gaussian, causing reverse-starting bias. Combined with inherent exposure bias, this degrades generation quality.Method: 1) Use newly designed Langevin sampling algorithm to align with forward maximum perturbation distribution, establishing new reverse-starting point to mitigate reverse-starting bias. 2) Introduce score correction mechanism based on newly defined score difference to address exposure bias. Approach requires no network modifications.
Result: Validated across multiple models, datasets, and tasks, achieving state-of-the-art results. Code available at provided GitHub repository.
Conclusion: Proposed comprehensive approach effectively mitigates both reverse-starting bias and exposure bias in graph diffusion models, improving generation quality without requiring network architecture changes.
Abstract: Most existing graph diffusion models have significant bias problems. We observe that the forward diffusion’s maximum perturbation distribution in most models deviates from the standard Gaussian distribution, while reverse sampling consistently starts from a standard Gaussian distribution, which results in a reverse-starting bias. Together with the inherent exposure bias of diffusion models, this results in degraded generation quality. This paper proposes a comprehensive approach to mitigate both biases. To mitigate reverse-starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse-starting point. To address the exposure bias, we introduce a score correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.Code is at https://github.com/kunzhan/spp
[154] End-to-End Shared Attention Estimation via Group Detection with Feedback Refinement
Chihiro Nakatani, Norimichi Ukita, Jean-Marc Odobez
Main category: cs.CV
TL;DR: End-to-end method for simultaneous group detection and shared attention estimation using a two-step process with heatmap generation and refinement.
Details
Motivation: Previous methods estimate shared attention without detecting actual groups or assume single SA points, limiting practical applicability and performance.Method: Two-step process: (1) generate SA heatmaps using individual gaze attention heatmaps and group membership scalars from group inference, (2) refine initial group memberships accounting for initial SA heatmaps and predict final SA heatmap.
Result: Method outperforms other approaches in both group detection and shared attention estimation. Additional analyses validate effectiveness of proposed components.
Conclusion: Proposed end-to-end approach successfully addresses limitations of previous methods by simultaneously achieving group detection and shared attention estimation.
Abstract: This paper proposes an end-to-end shared attention estimation method via group detection. Most previous methods estimate shared attention (SA) without detecting the actual group of people focusing on it, or assume that there is a single SA point in a given image. These issues limit the applicability of SA detection in practice and impact performance. To address them, we propose to simultaneously achieve group detection and shared attention estimation using a two step process: (i) the generation of SA heatmaps relying on individual gaze attention heatmaps and group membership scalars estimated in a group inference; (ii) a refinement of the initial group memberships allowing to account for the initial SA heatmaps, and the final prediction of the SA heatmap. Experiments demonstrate that our method outperforms other methods in group detection and shared attention estimation. Additional analyses validate the effectiveness of the proposed components. Code: https://github.com/chihina/sagd-CVPRW2026.
[155] SteerFlow: Steering Rectified Flows for Faithful Inversion-Based Image Editing
Thinh Dao, Zhen Wang, Kien T. Pham, Long Chen
Main category: cs.CV
TL;DR: SteerFlow: A model-agnostic text-guided image editing framework for flow-based generative models that improves source fidelity through amortized fixed-point inversion, trajectory interpolation, and adaptive masking.
Details
Motivation: Existing flow-based text-guided image editing methods struggle with preserving source fidelity - higher-order solvers require extra inferences, truncated inversion limits editability, and feature injection lacks transferability across models.Method: Three key components: 1) Amortized Fixed-Point Solver for high-fidelity inversion by enforcing velocity consistency, 2) Trajectory Interpolation to blend editing and reconstruction velocities, 3) Adaptive Masking using concept-guided segmentation to spatially constrain edits.
Result: Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium show SteerFlow achieves better editing quality than existing methods and naturally extends to multi-turn editing without accumulating drift.
Conclusion: SteerFlow provides a model-agnostic framework with strong theoretical guarantees on source fidelity for text-guided image editing in flow-based generative models.
Abstract: Recent advances in flow-based generative models have enabled training-free, text-guided image editing by inverting an image into its latent noise and regenerating it under a new target conditional guidance. However, existing methods struggle to preserve source fidelity: higher-order solvers incur additional model inferences, truncated inversion constrains editability, and feature injection methods lack architectural transferability. To address these limitations, we propose SteerFlow, a model-agnostic editing framework with strong theoretical guarantees on source fidelity. In the forward process, we introduce an Amortized Fixed-Point Solver that implicitly straightens the forward trajectory by enforcing velocity consistency across consecutive timesteps, yielding a high-fidelity inverted latent. In the backward process, we introduce Trajectory Interpolation, which adaptively blends target-editing and source-reconstruction velocities to keep the editing trajectory anchored to the source. To further improve background preservation, we introduce an Adaptive Masking mechanism that spatially constrains the editing signal with concept-guided segmentation and source-target velocity differences. Extensive experiments on FLUX.1-dev and Stable Diffusion 3.5 Medium demonstrate that SteerFlow consistently achieves better editing quality than existing methods. Finally, we show that SteerFlow extends naturally to a complex multi-turn editing paradigm without accumulating drift.
[156] Setup-Independent Full Projector Compensation
Haibo Li, Qingyue Deng, Jijiang Li, Haibin Ling, Bingyao Huang
Main category: cs.CV
TL;DR: SIComp is the first setup-independent framework for full projector compensation that generalizes to unseen projector-camera setups without fine-tuning, using a co-adaptive design that decouples geometry and photometry correction.
Details
Motivation: Existing projector compensation methods are highly setup-dependent, requiring retraining when surface, lighting, or projector-camera pose changes, limited by lack of diverse training data and poor generalization to novel geometric configurations.Method: Constructed large-scale real-world dataset with 277 distinct setups; uses co-adaptive design with tailored optical flow module for online geometric correction and novel photometric network for photometric compensation; integrates intensity-varying surface priors for robustness under varying illumination.
Result: SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in generalization ability, establishing the first generalizable solution to projector compensation.
Conclusion: SIComp successfully addresses the setup-dependency problem in projector compensation through a novel framework that generalizes to unseen configurations without fine-tuning, enabled by a large-scale dataset and co-adaptive architecture.
Abstract: Projector compensation seeks to correct geometric and photometric distortions that occur when images are projected onto nonplanar or textured surfaces. However, most existing methods are highly setup-dependent, requiring fine-tuning or retraining whenever the surface, lighting, or projector-camera pose changes. Progress has been limited by two key challenges: (1) the absence of large, diverse training datasets and (2) existing geometric correction models are typically constrained by specific spatial setups; without further retraining or fine-tuning, they often fail to generalize directly to novel geometric configurations. We introduce SIComp, the first Setup-Independent framework for full projector Compensation, capable of generalizing to unseen setups without fine-tuning or retraining. To enable this, we construct a large-scale real-world dataset spanning 277 distinct projector-camera setups. SIComp adopts a co-adaptive design that decouples geometry and photometry: A carefully tailored optical flow module performs online geometric correction, while a novel photometric network handles photometric compensation. To further enhance robustness under varying illumination, we integrate intensity-varying surface priors into the network design. Extensive experiments demonstrate that SIComp consistently produces high-quality compensation across diverse unseen setups, substantially outperforming existing methods in terms of generalization ability and establishing the first generalizable solution to projector compensation. The code and dataset are available on our project page: https://hai-bo-li.github.io/SIComp/
[157] Dense Point-to-Mask Optimization with Reinforced Point Selection for Crowd Instance Segmentation
Hongru Chen, Jiyang Huang, Jia Wan, Antoni B. Chan
Main category: cs.CV
TL;DR: Dense Point-to-Mask Optimization (DPMO) integrates SAM with NNEC constraints to generate dense instance segmentation from point annotations, and Reinforced Point Selection (RPS) with GRPO improves crowd instance segmentation performance.
Details
Motivation: Crowd instance segmentation is important for surveillance and transportation applications, but current methods struggle with dense crowds. Point labels are common but region labels are rare and inaccurate. Direct application of large foundation models like SAM doesn't work well in dense crowds.Method: Two-stage approach: 1) Dense Point-to-Mask Optimization (DPMO) integrates SAM with Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations, 2) Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO) selects best predicted points from initial point predictions.
Result: Achieved state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. New mask-supervised loss functions boost counting performance across different models.
Conclusion: The proposed methods effectively address dense crowd instance segmentation challenges, demonstrating mask annotations significantly enhance counting accuracy and providing a practical solution for converting point annotations to masks.
Abstract: Crowd instance segmentation is a crucial task with a wide range of applications, including surveillance and transportation. Currently, point labels are common in crowd datasets, while region labels (e.g., boxes) are rare and inaccurate. The masks obtained through segmentation help to improve the accuracy of region labels and resolve the correspondence between individual location coordinates and crowd density maps. However, directly applying currently popular large foundation models such as SAM does not yield ideal results in dense crowds. To this end, we first propose Dense Point-to-Mask Optimization (DPMO), which integrates SAM with the Nearest Neighbor Exclusive Circle (NNEC) constraint to generate dense instance segmentation from point annotations. With DPMO and manual correction, we obtain mask annotations from the existing point annotations for traditional crowd datasets. Then, to predict instance segmentation in dense crowds, we propose a Reinforced Point Selection (RPS) framework trained with Group Relative Policy Optimization (GRPO), which selects the best predicted point from a sampling of the initial point prediction. Through extensive experiments, we achieve state-of-the-art crowd instance segmentation performance on ShanghaiTech, UCF-QNRF, JHU-CROWD++, and NWPU-Crowd datasets. Furthermore, we design new loss functions supervised by masks that boost counting performance across different models, demonstrating the significant role of mask annotations in enhancing counting accuracy.
[158] Unifying UAV Cross-View Geo-Localization via 3D Geometric Perception
Haoyuan Li, Wen Yang, Fang Xu, Hong Tan, Haijian Zhang, Shengyang Li, Gui-Song Xia
Main category: cs.CV
TL;DR: A geometry-aware UAV geo-localization framework that unifies place recognition and pose estimation by reconstructing 3D scenes from UAV images and rendering virtual Bird’s-Eye Views to align with satellite imagery.
Details
Motivation: Cross-view geo-localization for UAVs in GNSS-denied environments is challenging due to severe geometric discrepancies between oblique UAV imagery and orthogonal satellite maps. Existing methods treat perspective distortion as appearance noise rather than explicit geometric transformation.Method: Proposes a Visual Geometry Grounded Transformer (VGGT) to reconstruct local 3D scenes from multi-view UAV image sequences, then renders virtual Bird’s-Eye View representations that orthorectify UAV perspective to align with satellite imagery. Introduces Satellite-wise Attention Block to handle multiple location hypotheses efficiently.
Result: Extensive experiments on refined University-1652 benchmark and SUES-200 show the method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.
Conclusion: The geometry-aware framework successfully unifies coarse place recognition and fine-grained pose estimation, explicitly modeling 3D scene geometry to overcome perspective distortion challenges in cross-view geo-localization.
Abstract: Cross-view geo-localization for Unmanned Aerial Vehicles (UAVs) operating in GNSS-denied environments remains challenging due to the severe geometric discrepancy between oblique UAV imagery and orthogonal satellite maps. Most existing methods address this problem through a decoupled pipeline of place retrieval and pose estimation, implicitly treating perspective distortion as appearance noise rather than an explicit geometric transformation. In this work, we propose a geometry-aware UAV geo-localization framework that explicitly models the 3D scene geometry to unify coarse place recognition and fine-grained pose estimation within a single inference pipeline. Our approach reconstructs a local 3D scene from multi-view UAV image sequences using a Visual Geometry Grounded Transformer (VGGT), and renders a virtual Bird’s-Eye View (BEV) representation that orthorectifies the UAV perspective to align with satellite imagery. This BEV serves as a geometric intermediary that enables robust cross-view retrieval and provides spatial priors for accurate 3 Degrees of Freedom (3-DoF) pose regression. To efficiently handle multiple location hypotheses, we introduce a Satellite-wise Attention Block that isolates the interaction between each satellite candidate and the reconstructed UAV scene, preventing inter-candidate interference while maintaining linear computational complexity. In addition, we release a recalibrated version of the University-1652 dataset with precise coordinate annotations and spatial overlap analysis, enabling rigorous evaluation of end-to-end localization accuracy. Extensive experiments on the refined University-1652 benchmark and SUES-200 demonstrate that our method significantly outperforms state-of-the-art baselines, achieving robust meter-level localization accuracy and improved generalization in complex urban environments.
[159] Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding
Jiayun Jin, Haolong Chai, Xueying Huang, Xiaoqing Guo, Zengwei Zheng, Zhan Zhou, Junmei Wang, Xinyu Wang, Jie Liu, Binbin Zhou
Main category: cs.CV
TL;DR: Ultrasound-CLIP: A vision-language pre-training framework specifically designed for ultrasound imaging using a large-scale dataset (US-365K) with structured diagnostic taxonomies and semantic-aware contrastive learning.
Details
Motivation: Existing vision-language models like CLIP are not well-suited for ultrasound data due to heterogeneous anatomical structures and diverse diagnostic attributes in medical imaging. There's a need for specialized pre-training for ultrasound modality.Method: Created US-365K dataset (365k ultrasound image-text pairs across 52 anatomical categories), established Ultrasonographic Diagnostic Taxonomy with hierarchical anatomical and diagnostic attribute frameworks, developed Ultrasound-CLIP with semantic-aware contrastive learning using semantic soft labels and loss, and built heterogeneous graph modality for structured reasoning.
Result: Achieved state-of-the-art performance on classification and retrieval benchmarks with patient-level data splitting, demonstrated strong generalization to zero-shot, linear probing, and fine-tuning tasks.
Conclusion: The proposed Ultrasound-CLIP framework effectively bridges the gap between general vision-language models and ultrasound-specific requirements, enabling better understanding and analysis of ultrasound data through structured knowledge integration.
Abstract: Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF’s textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.
[160] Control-DINO: Feature Space Conditioning for Controllable Image-to-Video Diffusion
Edoardo A. Dominici, Thomas Deixelberger, Konstantinos Vardis, Markus Steinberger
Main category: cs.CV
TL;DR: A method for using self-supervised visual features as conditioning signals for video diffusion models, enabling video domain transfer and 3D-to-video generation through feature decoupling and lightweight architecture.
Details
Motivation: While video models are used for generation and world simulation, and self-supervised features serve as general-purpose vision interfaces, their combination hasn't been explored as general conditioning for pretrained video diffusion models. Self-supervised features contain entangled information that limits generative capabilities.Method: Introduces a lightweight architecture and training strategy that decouples appearance (style, lighting) from other features to preserve, enabling robust control for appearance changes like stylization and relighting. Uses higher feature dimensionality to compensate for low spatial resolution.
Result: Enables video domain transfer and video-from-3D generation with improved controllability in generative rendering from explicit spatial representations.
Conclusion: Self-supervised features can effectively condition pretrained video diffusion models for generative tasks when properly decoupled, offering new capabilities in video domain transfer and 3D-to-video generation.
Abstract: Video models have recently been applied with success to problems in content generation, novel view synthesis, and, more broadly, world simulation. Many applications in generation and transfer rely on conditioning these models, typically through perceptual, geometric, or simple semantic signals, fundamentally using them as generative renderers. At the same time, high-dimensional features obtained from large-scale self-supervised learning on images or point clouds are increasingly used as a general-purpose interface for vision models. The connection between the two has been explored for subject specific editing, aligning and training video diffusion models, but not in the role of a more general conditioning signal for pretrained video diffusion models. Features obtained through self-supervised learning like DINO, contain a lot of entangled information about style, lighting and semantics of the scene. This makes them great at reconstruction tasks but limits their generative capabilities. In this paper, we show how we can use the features for tasks such as video domain transfer and video-from-3D generation. We introduce a lightweight architecture and training strategy that decouples appearance from other features that we wish to preserve, enabling robust control for appearance changes such as stylization and relighting. Furthermore, we show that low spatial resolution can be compensated by higher feature dimensionality, improving controllability in generative rendering from explicit spatial representations.
[161] Cosine-Normalized Attention for Hyperspectral Image Classification
Muhammad Ahmad, Manuel Mazzara
Main category: cs.CV
TL;DR: Proposes cosine-normalized attention for hyperspectral image classification that emphasizes angular relationships in spectral signatures rather than magnitude, improving Transformer performance with limited supervision.
Details
Motivation: Standard Transformer attention uses dot-product similarity which mixes feature magnitude and orientation, potentially suboptimal for hyperspectral data where angular relationships between spectral signatures are more informative than magnitude variations.Method: Introduces cosine-normalized attention that projects query and key embeddings onto a unit hypersphere and uses squared cosine similarity for attention scoring, emphasizing angular relationships while reducing sensitivity to magnitude variations. Integrated into a spatial-spectral Transformer architecture.
Result: Outperforms several recent Transformer- and Mamba-based models on three benchmark datasets despite using a lightweight backbone, especially under extremely limited supervision. Cosine-based scoring provides reliable inductive bias for hyperspectral representation learning.
Conclusion: Geometric perspective on attention scoring with cosine normalization better aligns with hyperspectral data characteristics, offering improved performance and more suitable inductive bias for hyperspectral image classification tasks.
Abstract: Transformer-based methods have improved hyperspectral image classification (HSIC) by modeling long-range spatial-spectral dependencies; however, their attention mechanisms typically rely on dot-product similarity, which mixes feature magnitude and orientation and may be suboptimal for hyperspectral data. This work revisits attention scoring from a geometric perspective and introduces a cosine-normalized attention formulation that aligns similarity computation with the angular structure of hyperspectral signatures. By projecting query and key embeddings onto a unit hypersphere and applying a squared cosine similarity, the proposed method emphasizes angular relationships while reducing sensitivity to magnitude variations. The formulation is integrated into a spatial-spectral Transformer and evaluated under extremely limited supervision. Experiments on three benchmark datasets demonstrate that the proposed approach consistently achieves higher performance, outperforming several recent Transformer- and Mamba-based models despite using a lightweight backbone. In addition, a controlled analysis of multiple attention score functions shows that cosine-based scoring provides a reliable inductive bias for hyperspectral representation learning.
[162] Hidden Meanings in Plain Sight: RebusBench for Evaluating Cognitive Visual Reasoning
Seyed Amir Kasaei, Arash Marioriyad, Mahbod Khaleti, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban
Main category: cs.CV
TL;DR: RebusBench benchmark reveals LVLMs struggle with rebus puzzles requiring neurosymbolic reasoning beyond direct visual recognition, showing performance below 10% despite having necessary visual/linguistic components.
Details
Motivation: Current LVLMs excel at explicit visual recognition but fail at problems where visual input serves as a clue rather than the answer, particularly in rebus puzzles requiring multi-step reasoning that integrates perception with prior knowledge.Method: Introduces RebusBench with 1,164 puzzles to evaluate neurosymbolic reasoning capabilities. Evaluates state-of-the-art models (Qwen, InternVL, LLaVA) on exact match and semantic accuracy metrics, testing model scaling and in-context learning approaches.
Result: Models perform poorly with less than 10% exact match and 20% semantic accuracy. No significant improvement from model scaling or in-context learning, suggesting models lack the cognitive reasoning to connect visual and linguistic components.
Conclusion: LVLMs possess necessary visual and linguistic components but lack the cognitive reasoning glue for neurosymbolic tasks like rebus puzzles, indicating a fundamental gap in current multimodal reasoning architectures.
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable proficiency in explicit visual recognition, effectively describing what is directly visible in an image. However, a critical cognitive gap emerges when the visual input serves only as a clue rather than the answer. We identify that current models struggle with the complex, multi-step reasoning required to solve problems where information is not explicitly depicted. Successfully solving a rebus puzzle requires a distinct cognitive workflow: the model must extract visual and textual attributes, retrieve linguistic prior knowledge (such as idioms), and perform abstract mapping to synthesize these elements into a meaning that exists outside the pixel space. To evaluate this neurosymbolic capability, we introduce RebusBench, a benchmark of 1,164 puzzles designed to test this specific integration of perception and knowledge. Our evaluation of state-of-the-art models (including Qwen, InternVL, and LLaVA) shows a severe deficiency: performance saturates below 10% Exact Match and 20% semantic accuracy, with no significant improvement observed from model scaling or In-Context Learning (ICL). These findings suggest that while models possess the necessary visual and linguistic components, they lack the cognitive reasoning glue to connect them. Project page available at https://amirkasaei.com/rebusbench/.
[163] DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning
Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, Steven L. Waslander
Main category: cs.CV
TL;DR: DriveDreamer-Policy is a unified driving world-action model that integrates depth generation, future video prediction, and motion planning in a single modular architecture with geometric grounding for autonomous driving.
Details
Motivation: Existing world-action models focus on 2D appearance or latent representations with limited geometric grounding, which is essential for embodied systems operating in the physical world. The authors aim to create a more geometry-aware model for autonomous driving.Method: Uses a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators for depth, future video, and actions. Learns geometry-aware world representations to guide both future prediction and planning within a unified framework.
Result: Achieves 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
Conclusion: The unified driving world-action model with geometric grounding produces more coherent imagined futures and more informed driving actions while maintaining modularity and controllable latency, demonstrating the value of explicit geometric understanding in embodied AI systems.
Abstract: Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.
[164] FSKD: Monocular Forest Structure Inference via LiDAR-to-RGBI Knowledge Distillation
Taimur Khan, Hannes Feilhauer, Muhammad Jazib Zafar
Main category: cs.CV
TL;DR: FSKD: A knowledge distillation framework that transfers LiDAR-derived forest structure metrics to RGB-infrared imagery, enabling high-resolution forest monitoring without requiring LiDAR for inference.
Details
Motivation: Airborne LiDAR provides high-quality forest structure data but is costly and infrequent. There's a need for scalable, operational monitoring using more accessible RGB-infrared imagery while maintaining LiDAR-level accuracy.Method: Multi-modal knowledge distillation where a teacher model fuses RGBI imagery with LiDAR-derived metrics via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. The framework jointly predicts canopy height model (CHM), plant area index (PAI), and foliage height diversity (FHD).
Result: Achieves state-of-the-art zero-shot CHM performance (MedAE 4.17m, R²=0.51, IoU 0.87), outperforming baselines by 29-46% in MAE. Multi-modal fusion improves performance by 10-26% over RGBI-only training. Works effectively even with temporal mismatches between LiDAR and RGBI data.
Conclusion: FSKD enables scalable 20cm operational forest monitoring using only RGB-infrared imagery, removing strict co-acquisition constraints and supporting applications like Digital Twin Germany and national orthophoto programs.
Abstract: Very High Resolution (VHR) forest structure data at individual-tree scale is essential for carbon, biodiversity, and ecosystem monitoring. Still, airborne LiDAR remains costly and infrequent despite being the reference for forest structure metrics like Canopy Height Model (CHM), Plant Area Index (PAI), and Foliage Height Diversity (FHD). We propose FSKD: a LiDAR-to-RGB-Infrared (RGBI) knowledge distillation (KD) framework in which a multi-modal teacher fuses RGBI imagery with LiDAR-derived planar metrics and vertical profiles via cross-attention, and an RGBI-only SegFormer student learns to reproduce these outputs. Trained on 384 $km^2$ of forests in Saxony, Germany (20 cm ground sampling distance (GSD)) and evaluated on eight geographically distinct test tiles, the student achieves state-of-the-art (SOTA) zero-shot CHM performance (MedAE 4.17 m, $R^2$=0.51, IoU 0.87), outperforming HRCHM/DAC baselines by 29–46% in MAE (5.81 m vs. 8.14–10.84 m) with stronger correlation coefficients (0.713 vs. 0.166–0.652). Ablations show that multi-modal fusion improves performance by 10–26% over RGBI-only training, and that asymmetric distillation with appropriate model capacity is critical. The method jointly predicts CHM, PAI, and FHD, a multi-metric capability not provided by current monocular CHM estimators, although PAI/FHD transfer remains region-dependent and benefits from local calibration. The framework also remains effective under temporal mismatch (winter LiDAR, summer RGBI), removing strict co-acquisition constraints and enabling scalable 20 cm operational monitoring for workflows such as Digital Twin Germany and national Digital Orthophoto programs.
[165] GardenDesigner: Encoding Aesthetic Principles into Jiangnan Garden Construction via a Chain of Agents
Mengtian Li, Fan Yang, Ruixue Xiong, Yiyan Fan, Zhifeng Xie, Zeyu Wang
Main category: cs.CV
TL;DR: GardenDesigner: AI framework for procedural generation of Jiangnan gardens using aesthetic principles and agent chains for terrain, pathways, and asset layout optimization.
Details
Motivation: Manual modeling of Jiangnan gardens is time-consuming and requires expert knowledge, limiting their use as digital assets for film, games, and digital tourism.Method: Proposed GardenDesigner framework encodes aesthetic principles for Jiangnan garden construction using chain of agents: terrain distribution, road generation, asset selection, and layout optimization agents, integrated with GardenVerse knowledge base.
Result: Framework generates diverse and aesthetically pleasing Jiangnan gardens within one minute via text input, validated through experiments and human evaluations.
Conclusion: GardenDesigner enables non-expert users to efficiently create authentic Jiangnan gardens for digital applications, bridging traditional aesthetics with procedural generation.
Abstract: Jiangnan gardens, a prominent style of Chinese classical gardens, hold great potential as digital assets for film and game production and digital tourism. However, manual modeling of Jiangnan gardens heavily relies on expert experience for layout design and asset creation, making the process time-consuming. To address this gap, we propose GardenDesigner, a novel framework that encodes aesthetic principles for Jiangnan garden construction and integrates a chain of agents based on procedural modeling. The water-centric terrain and explorative pathway rules are applied by terrain distribution and road generation agents. Selection and spatial layout of garden assets follow the aesthetic and cultural constraints. Consequently, we propose asset selection and layout optimization agents to select and arrange objects for each area in the garden. Additionally, we introduce GardenVerse for Jiangnan garden construction, including expert-annotated garden knowledge to enhance the asset arrangement process. To enable interaction and editing, we develop an interactive interface and tools in Unity, in which non-expert users can construct Jiangnan gardens via text input within one minute. Experiments and human evaluations demonstrate that GardenDesigner can generate diverse and aesthetically pleasing Jiangnan gardens. Project page is available at https://monad-cube.github.io/GardenDesigner.
[166] PTC-Depth: Pose-Refined Monocular Depth Estimation with Temporal Consistency
Leezy Han, Seunggyu Kim, Dongseok Shim, Hyeonbeom Lee
Main category: cs.CV
TL;DR: Consistency-aware monocular depth estimation framework using wheel odometry and recursive Bayesian scaling for temporal stability in autonomous systems.
Details
Motivation: Existing monocular depth estimation methods lack temporal consistency across consecutive frames, causing jitter and failures during abrupt depth range changes, which is problematic for autonomous vehicle and robot perception systems.Method: Leverages wheel odometry to estimate camera pose and sparse depth via triangulation using optical flow between frames. Uses sparse depth to update recursive Bayesian metric scale estimation, then rescales relative depth from pre-trained foundation model.
Result: Evaluated on KITTI, TartanAir, MS2, and custom datasets, demonstrating robust and accurate depth estimation with improved temporal consistency.
Conclusion: Proposed framework effectively addresses temporal inconsistency in monocular depth estimation by integrating wheel odometry and Bayesian scaling, providing stable depth predictions for autonomous systems.
Abstract: Monocular depth estimation (MDE) has been widely adopted in the perception systems of autonomous vehicles and mobile robots. However, existing approaches often struggle to maintain temporal consistency in depth estimation across consecutive frames. This inconsistency not only causes jitter but can also lead to estimation failures when the depth range changes abruptly. To address these challenges, this paper proposes a consistency-aware monocular depth estimation framework that leverages wheel odometry from a mobile robot to achieve stable and coherent depth predictions over time. Specifically, we estimate camera pose and sparse depth from triangulation using optical flow between consecutive frames. The sparse depth estimates are used to update a recursive Bayesian estimate of the metric scale, which is then applied to rescale the relative depth predicted by a pre-trained depth estimation foundation model. The proposed method is evaluated on the KITTI, TartanAir, MS2, and our own dataset, demonstrating robust and accurate depth estimation performance.
[167] A deep learning pipeline for PAM50 subtype classification using histopathology images and multi-objective patch selection
Arezoo Borji, Gernot Kronreif, Bernhard Angermayr, Francisco Mario Calisto, Wolfgang Birkfellner, Inna Servetnyk, Yinyin Yuan, Sepideh Hatamikia
Main category: cs.CV
TL;DR: A deep learning framework that predicts breast cancer PAM50 subtypes from H&E-stained whole-slide images using optimization-guided patch selection to reduce computational costs while maintaining high accuracy.
Details
Motivation: To develop a more accessible alternative to expensive molecular assays for breast cancer subtyping by using histopathology images, reducing costs and computational burden while maintaining clinical-grade accuracy.Method: Combines NSGA-II optimization algorithm with Monte Carlo dropout uncertainty estimation to select informative, diverse, and uncertain patches from whole-slide images, using ResNet18 for feature extraction and custom CNN for classification.
Result: Achieved F1-score of 0.8812 and AUC of 0.9841 on internal TCGA-BRCA dataset, and F1-score of 0.7952 and AUC of 0.9512 on external CPTAC-BRCA validation dataset.
Conclusion: The optimization-guided patch selection method enables efficient and accurate PAM50 classification from histopathology images, potentially serving as a scalable imaging-based replacement for molecular assays in clinical decision-making.
Abstract: Breast cancer is a highly heterogeneous disease with diverse molecular profiles. The PAM50 gene signature is widely recognized as a standard for classifying breast cancer into intrinsic subtypes, enabling more personalized treatment strategies. In this study, we introduce a novel optimization-driven deep learning framework that aims to reduce reliance on costly molecular assays by directly predicting PAM50 subtypes from H&E-stained whole-slide images (WSIs). Our method jointly optimizes patch informativeness, spatial diversity, uncertainty, and patch count by combining the non-dominated sorting genetic algorithm II (NSGA-II) with Monte Carlo dropout-based uncertainty estimation. The proposed method can identify a small but highly informative patch subset for classification. We used a ResNet18 backbone for feature extraction and a custom CNN head for classification. For evaluation, we used the internal TCGA-BRCA dataset as the training cohort and the external CPTAC-BRCA dataset as the test cohort. On the internal dataset, an F1-score of 0.8812 and an AUC of 0.9841 using 627 WSIs from the TCGA-BRCA cohort were achieved. The performance of the proposed approach on the external validation dataset showed an F1-score of 0.7952 and an AUC of 0.9512. These findings indicate that the proposed optimization-guided, uncertainty-aware patch selection can achieve high performance and improve the computational efficiency of histopathology-based PAM50 classification compared to existing methods, suggesting a scalable imaging-based replacement that has the potential to support clinical decision-making.
[168] STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, Mohsen Fayyaz
Main category: cs.CV
TL;DR: STRIVE is a reinforcement learning framework for video QA that creates multiple spatiotemporal video variants and performs joint normalization across textual generations and visual variants to improve reward signals and stabilize policy updates.
Details
Motivation: Existing group-based policy optimization methods for large multimodal models suffer from low reward variance when responses have similar correctness, leading to weak or unstable advantage estimates in video question answering tasks.Method: Constructs multiple spatiotemporal variants of input videos, performs joint normalization across textual generations and visual variants, and uses importance-aware sampling to prioritize question-relevant frames while maintaining temporal coverage.
Result: Demonstrates consistent improvements over strong RL baselines across six challenging video reasoning benchmarks (VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, PerceptionTest) with multiple large multimodal models.
Conclusion: Structured spatiotemporal exploration serves as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance in large multimodal models.
Abstract: We introduce STRIVE (SpatioTemporal Reinforcement with Importance-aware Variant Exploration), a structured reinforcement learning framework for video question answering. While group-based policy optimization methods have shown promise in large multimodal models, they often suffer from low reward variance when responses exhibit similar correctness, leading to weak or unstable advantage estimates. STRIVE addresses this limitation by constructing multiple spatiotemporal variants of each input video and performing joint normalization across both textual generations and visual variants. By expanding group comparisons beyond linguistic diversity to structured visual perturbations, STRIVE enriches reward signals and promotes more stable and informative policy updates. To ensure exploration remains semantically grounded, we introduce an importance-aware sampling mechanism that prioritizes frames most relevant to the input question while preserving temporal coverage. This design encourages robust reasoning across complementary visual perspectives rather than overfitting to a single spatiotemporal configuration. Experiments on six challenging video reasoning benchmarks including VideoMME, TempCompass, VideoMMMU, MMVU, VSI-Bench, and PerceptionTest demonstrate consistent improvements over strong reinforcement learning baselines across multiple large multimodal models. Our results highlight the role of structured spatiotemporal exploration as a principled mechanism for stabilizing multimodal reinforcement learning and improving video reasoning performance.
[169] SafeRoPE: Risk-specific Head-wise Embedding Rotation for Safe Generation in Rectified Flow Transformers
Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Min Yang
Main category: cs.CV
TL;DR: SafeRoPE is a lightweight framework that suppresses unsafe content in transformer-based diffusion models by perturbing Rotary Positional Embeddings in safety-critical attention heads.
Details
Motivation: Current text-to-image models (like SD3, FLUX) are vulnerable to unsafe semantics triggered by multi-token interactions. Existing safety methods are computationally expensive and designed for U-Net architectures, not suitable for transformer-based diffusion models like MMDiT.Method: Analyzes MMDiT attention to find unsafe semantics concentrate in low-dimensional subspaces at head level. Constructs head-wise unsafe subspaces, computes Latent Risk Scores, and introduces head-wise RoPE perturbations to suppress unsafe content while preserving image quality.
Result: SafeRoPE achieves state-of-the-art performance in balancing harmful content mitigation and utility preservation for safe generation in MMDiT models.
Conclusion: The proposed lightweight framework effectively suppresses unsafe outputs in transformer-based diffusion models while maintaining generation fidelity through targeted RoPE perturbations.
Abstract: Recent Text-to-Image (T2I) models based on rectified-flow transformers (e.g., SD3, FLUX) achieve high generative fidelity but remain vulnerable to unsafe semantics, especially when triggered by multi-token interactions. Existing mitigation methods largely rely on fine-tuning or attention modulation for concept unlearning; however, their expensive computational overhead and design tailored to U-Net-based denoisers hinder direct adaptation to transformer-based diffusion models (e.g., MMDiT). In this paper, we conduct an in-depth analysis of the attention mechanism in MMDiT and find that unsafe semantics concentrate within interpretable, low-dimensional subspaces at head level, where a finite set of safety-critical heads is responsible for unsafe feature extraction. We further observe that perturbing the Rotary Positional Embedding (RoPE) applied to the query and key vectors can effectively modify some specific concepts in the generated images. Motivated by these insights, we propose SafeRoPE, a lightweight and fine-grained safe generation framework for MMDiT. Specifically, SafeRoPE first constructs head-wise unsafe subspaces by decomposing unsafe embeddings within safety-critical heads, and computes a Latent Risk Score (LRS) for each input vector via projection onto these subspaces. We then introduce head-wise RoPE perturbations that can suppress unsafe semantics without degrading benign content or image quality. SafeRoPE combines both head-wise LRS and RoPE perturbations to perform risk-specific head-wise rotation on query and key vector embeddings, enabling precise suppression of unsafe outputs while maintaining generation fidelity. Extensive experiments demonstrate that SafeRoPE achieves SOTA performance in balancing effective harmful content mitigation and utility preservation for safe generation of MMDiT. Codes are available at https://github.com/deng12yx/SafeRoPE.
[170] Language-Pretraining-Induced Bias: A Strong Foundation for General Vision Tasks
Yaxin Luo, Zhiqiang Shen
Main category: cs.CV
TL;DR: A method for adapting language pre-trained models to vision tasks through bridge training, showing that partial adaptation of LLM parameters can effectively work for visual tasks without full fine-tuning.
Details
Motivation: The paper addresses the challenge of cross-modality adaptation between language and vision, noting that prior work assumes language pre-trained models are unsuitable for visual tasks due to different parameter spaces and outlier ratios. The authors challenge this assumption and aim to bridge the modality gap.Method: Proposes a bridge training stage with random label bridge training that requires no manual labeling. Uses partial bridge training where only certain layers of LLMs are adapted, leveraging the foundational properties of some layers that remain beneficial for visual tasks without fine-tuning.
Result: Shows that adding a bridge training stage can effectively align LLM parameters with vision tasks. Demonstrates that partial bridge training is often advantageous, as certain LLM layers exhibit strong foundational properties that work well for visual tasks without fine-tuning.
Conclusion: Language pre-trained models can be adapted to vision tasks through bridge training, and partial adaptation is often sufficient. This opens new avenues for leveraging language parameters in vision models and provides a practical pathway for cross-modality adaptation.
Abstract: The ratio of outlier parameters in language pre-training models and vision pre-training models differs significantly, making cross-modality (language and vision) inherently more challenging than cross-domain adaptation. As a result, many prior studies have focused on cross-domain transfer rather than attempting to bridge language and vision modalities, assuming that language pre-trained models are unsuitable for downstream visual tasks due to disparate parameter spaces. Contrary to this assumption, we show that adding a bridge training stage as a modality adaptation learner can effectively align Large Language Model (LLM) parameters with vision tasks. Specifically, we propose a simple yet powerful solution random label bridge training that requires no manual labeling and helps LLM parameters adapt to vision foundation tasks. Moreover, our findings reveal that partial bridge training is often advantageous, as certain layers in LLMs exhibit strong foundational properties that remain beneficial even without fine-tuning for visual tasks. This surprising discovery opens up new avenues for leveraging language pre-trained parameters directly within vision models and highlights the potential of partial bridge training as a practical pathway to cross-modality adaptation.
[171] Ranking-Guided Semi-Supervised Domain Adaptation for Severity Classification
Shota Harada, Ryoma Bise, Kiyohito Tanaka, Seiichi Uchida
Main category: cs.CV
TL;DR: A novel semi-supervised domain adaptation method for medical image severity classification using ranking with class order and continuous distribution alignment
Details
Motivation: Existing domain adaptation methods struggle with severity classification in medical images due to unclear class boundaries and naturally ordered class labels, which complicates adaptation across domainsMethod: Proposes two key components: 1) Cross-Domain Ranking that ranks sample pairs across domains using class order information, and 2) Continuous Distribution Alignment that aligns rank score distributions between source and target domains
Result: Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of the approach, demonstrating successful alignment of class-specific rank score distributions
Conclusion: The proposed method effectively addresses domain adaptation challenges in medical severity classification by leveraging class order information through ranking-based alignment
Abstract: Semi-supervised domain adaptation leverages a few labeled and many unlabeled target samples, making it promising for addressing domain shifts in medical image analysis. However, existing methods struggle with severity classification due to unclear class boundaries. Severity classification involves naturally ordered class labels, complicating adaptation. We propose a novel method that aligns source and target domains using rank scores learned via ranking with class order. Specifically, Cross-Domain Ranking ranks sample pairs across domains, while Continuous Distribution Alignment aligns rank score distributions. Experiments on ulcerative colitis and diabetic retinopathy classification validate the effectiveness of our approach, demonstrating successful alignment of class-specific rank score distributions.
[172] Semantic Segmentation of Textured Non-manifold 3D Meshes using Transformers
Mohammadreza Heidarianbaei, Max Mehltretter, Franz Rottensteiner
Main category: cs.CV
TL;DR: A texture-aware transformer for semantic segmentation of 3D textured meshes that processes both geometric and textural information through a hierarchical learning scheme with Two-Stage Transformer Blocks.
Details
Motivation: Textured 3D meshes contain both geometry and appearance information, but existing deep learning methods for mesh segmentation typically ignore the rich textural information, focusing only on geometric features.Method: Introduces a texture-aware transformer that learns directly from raw pixels associated with each mesh face. Uses a texture branch to summarize face-level pixels into learnable tokens, fuses them with geometric descriptors, and processes through Two-Stage Transformer Blocks for local and global information flow. Includes hierarchical learning for multi-scale feature aggregation.
Result: Achieves 81.9% mF1 and 94.3% OA on Semantic Urban Meshes (SUM) benchmark, and 49.7% mF1 and 72.8% OA on a new cultural-heritage dataset of textured roof tiles with triangle-level damage annotations, substantially outperforming existing approaches.
Conclusion: The proposed texture-aware transformer effectively leverages both geometric and textural information for semantic segmentation of 3D meshes, demonstrating superior performance on urban and cultural heritage datasets compared to existing methods.
Abstract: Textured 3D meshes jointly represent geometry, topology, and appearance, yet their irregular structure poses significant challenges for deep-learning-based semantic segmentation. While a few recent methods operate directly on meshes without imposing geometric constraints, they typically overlook the rich textural information also provided by such meshes. We introduce a texture-aware transformer that learns directly from raw pixels associated with each mesh face, coupled with a new hierarchical learning scheme for multi-scale feature aggregation. A texture branch summarizes all face-level pixels into a learnable token, which is fused with geometrical descriptors and processed by a stack of Two-Stage Transformer Blocks (TSTB), which allow for both a local and a global information flow. We evaluate our model on the Semantic Urban Meshes (SUM) benchmark and a newly curated cultural-heritage dataset comprising textured roof tiles with triangle-level annotations for damage types. Our method achieves 81.9% mF1 and 94.3% OA on SUM and 49.7% mF1 and 72.8% OA on the new dataset, substantially outperforming existing approaches.
[173] Investigating Permutation-Invariant Discrete Representation Learning for Spatially Aligned Images
Jamie S. J. Stirling, Noura Al-Moubayed, Hubert P. H. Shum
Main category: cs.CV
TL;DR: PI-VQ introduces permutation-invariant vector quantization that removes positional information from discrete image representations, enabling direct interpolation and novel image synthesis without learned priors.
Details
Motivation: Current vector quantization methods (VQ-VAE, VQ-GAN) create position-dependent representations where codes are spatially arranged and contextually entangled, requiring complex autoregressive or diffusion priors for sampling. The authors question whether positional information is necessary for discrete representations of spatially aligned data.Method: Proposes permutation-invariant vector-quantized autoencoder (PI-VQ) that constrains latent codes to carry no positional information. Introduces matching quantization based on optimal bipartite matching to increase bottleneck capacity by 3.5× relative to naive nearest-neighbor quantization. Enables interpolation-based sampling for novel image synthesis in a single forward pass.
Result: PI-VQ learns codes that capture global, semantic features and enables direct interpolation between images without learned priors. Evaluated on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for synthesized images.
Conclusion: Permutation-invariant representations offer trade-offs in separability and interpretability of latent codes, pointing to future research directions. The approach demonstrates that positional information may not be necessary for effective discrete image representations.
Abstract: Vector quantization approaches (VQ-VAE, VQ-GAN) learn discrete neural representations of images, but these representations are inherently position-dependent: codes are spatially arranged and contextually entangled, requiring autoregressive or diffusion-based priors to model their dependencies at sample time. In this work, we ask whether positional information is necessary for discrete representations of spatially aligned data. We propose the permutation-invariant vector-quantized autoencoder (PI-VQ), in which latent codes are constrained to carry no positional information. We find that this constraint encourages codes to capture global, semantic features, and enables direct interpolation between images without a learned prior. To address the reduced information capacity of permutation-invariant representations, we introduce matching quantization, a vector quantization algorithm based on optimal bipartite matching that increases effective bottleneck capacity by $3.5\times$ relative to naive nearest-neighbour quantization. The compositional structure of the learned codes further enables interpolation-based sampling, allowing synthesis of novel images in a single forward pass. We evaluate PI-VQ on CelebA, CelebA-HQ and FFHQ, obtaining competitive precision, density and coverage metrics for images synthesised with our approach. We discuss the trade-offs inherent to position-free representations, including separability and interpretability of the latent codes, pointing to numerous directions for future work.
[174] FaCT-GS: Fast and Scalable CT Reconstruction with Gaussian Splatting
Pawel Tomasz Pieta, Rasmus Juul Pedersen, Sina Borgi, Jakob Sauer Jørgensen, Jens Wenzel Andreasen, Vedrana Andersen Dahl
Main category: cs.CV
TL;DR: FaCT-GS is a fast Gaussian Splatting framework for CT reconstruction that optimizes voxelization and rasterization pipelines, achieving 4-13x speedup over previous GS methods while enabling warm-start reconstruction from existing volumes.
Details
Motivation: While Gaussian Splatting shows promise for CT reconstruction, its benefits haven't been substantial enough to justify replacing established algorithms. The paper aims to address the speed and scalability limitations of GS-based approaches for CT reconstruction.Method: Introduces FaCT-GS with optimized voxelization and rasterization pipelines, enabling rapid fitting of Gaussians to existing volumes for warm-start reconstruction and compressed representation. The method scales well with projection and output volume size.
Result: FaCT-GS achieves over 4x speedup on standard 512x512 projections and over 13x speedup on 2k projections compared to state-of-the-art GS CT reconstruction methods.
Conclusion: FaCT-GS significantly improves the speed and scalability of Gaussian Splatting for CT reconstruction, making it more practical for real-world applications while maintaining reconstruction quality.
Abstract: Gaussian Splatting (GS) has emerged as a dominating technique for image rendering and has quickly been adapted for the X-ray Computed Tomography (CT) reconstruction task. However, despite being on par or better than many of its predecessors, the benefits of GS are typically not substantial enough to motivate a transition from well-established reconstruction algorithms. This paper addresses the most significant remaining limitations of the GS-based approach by introducing FaCT-GS, a framework for fast and flexible CT reconstruction. Enabled by an in-depth optimization of the voxelization and rasterization pipelines, our new method is significantly faster than its predecessors and scales well with projection and output volume size. Furthermore, the improved voxelization enables rapid fitting of Gaussians to pre-existing volumes, which can serve as a prior for warm-starting the reconstruction, or simply as an alternative, compressed representation. FaCT-GS is over 4X faster than the State of the Art GS CT reconstruction on standard 512x512 projections, and over 13X faster on 2k projections. Implementation available at: https://github.com/PaPieta/fact-gs.
[175] Semantic Richness or Geometric Reasoning? The Fragility of VLM’s Visual Invariance
Jason Qiu, Zachary Meurer, Xavier Thomas, Deepti Ghadiyaram
Main category: cs.CV
TL;DR: VLMs lack robust spatial invariance/equivariance under basic geometric transformations like rotations and scaling, failing to maintain object identity despite strong semantic understanding.
Details
Motivation: To investigate the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations, revealing gaps between semantic understanding and spatial reasoning.Method: Systematic evaluation across diverse visual domains (symbolic sketches, natural photographs, abstract art) with geometric transformations (rotations, scaling, identity transformations) across different architectures, model capacities, and prompting strategies.
Result: VLMs exhibit systematic failures in spatial invariance/equivariance, with sharp performance drops as semantic content becomes sparse; this behavior persists across all tested architectures and configurations.
Conclusion: Current VLMs have a systematic gap between semantic understanding and spatial reasoning, highlighting the need for stronger geometric grounding in future multimodal systems.
Abstract: This work investigates the fundamental fragility of state-of-the-art Vision-Language Models (VLMs) under basic geometric transformations. While modern VLMs excel at semantic tasks such as recognizing objects in canonical orientations and describing complex scenes, they exhibit systematic failures at a more fundamental level: lack of robust spatial invariance and equivariance required to reliably determine object identity under simple rotations, scaling, and identity transformations. We demonstrate this limitation through a systematic evaluation across diverse visual domains, including symbolic sketches, natural photographs, and abstract art. Performance drops sharply as semantic content becomes sparse, and this behavior is observed across architectures, model capacities, and prompting strategies. Overall, our results reveal a systematic gap between semantic understanding and spatial reasoning in current VLMs, highlighting the need for stronger geometric grounding in future multimodal systems.
[176] HieraVid: Hierarchical Token Pruning for Fast Video Large Language Models
Yansong Guo, Chaoyang Zhu, Jiayi Ji, Jianghang Lin, Liujuan Cao
Main category: cs.CV
TL;DR: HieraVid is a hierarchical pruning framework that reduces computational burden in VideoLLMs by progressively pruning video tokens at segment, frame, and layer levels while maintaining performance.
Details
Motivation: VideoLLMs face significant computational burdens due to massive input video tokens. Existing methods prune tokens at input level but neglect the inherent information structure in videos and LLMs.Method: Hierarchical pruning framework with three levels: 1) segment-level: temporal segmentation and spatial merging, 2) frame-level: joint pruning of similar frames within segments to preserve diversity, 3) layer-level: redundancy reduction as LLM layer increases without compromising performance.
Result: With only 30% of tokens retained, HieraVid achieves new state-of-the-art performance on four video understanding benchmarks, maintaining over 98% and 99% of performance of LLaVA-Video-7B and LLaVA-OneVision-7B respectively.
Conclusion: HieraVid effectively reduces computational burden in VideoLLMs through hierarchical pruning while preserving performance, demonstrating the importance of considering video structure and LLM information flow in token pruning.
Abstract: Video Large Language Models (VideoLLMs) have demonstrated impressive capabilities in video understanding, yet the massive number of input video tokens incurs a significant computational burden for deployment. Existing methods mainly prune video tokens at input level while neglecting the inherent information structure embedded in videos and large language models (LLMs). To address this, we propose HieraVid, a hierarchical pruning framework that progressively and dynamically reduces visual redundancy. Based on two observations that videos possess the segment-frame structure and LLMs internally propagate multi-modal information unidirectionally, we decompose pruning into three levels: 1) segment-level, where video tokens are first temporally segmented and spatially merged; 2) frame-level, where similar frames within the same segment are jointly pruned to preserve diversity; 3) layer-level, redundancy gradually shrinks as LLM layer increases w/o compromising performance. We conduct extensive experiments on four widely used video understanding benchmarks to comprehensively evaluate the effectiveness of HieraVid. Remarkably, with only 30% of tokens retained, HieraVid achieves new state-of-the-art performance, while maintaining over 98% and 99% of the performance of LLaVA-Video-7B and LLaVA-OneVision-7B, respectively.
[177] Combining Boundary Supervision and Segment-Level Regularization for Fine-Grained Action Segmentation
Hinako Mitsuoka, Kazuhiro Hotta
Main category: cs.CV
TL;DR: Lightweight dual-loss training framework for Temporal Action Segmentation that improves segmentation quality with minimal architectural changes using boundary regression and CDF-based regularization losses.
Details
Motivation: Recent TAS methods rely on complex architectures that hinder practical deployment. The authors aim to improve fine-grained segmentation quality through simple training-time modifications rather than heavier architectures.Method: Proposes a dual-loss framework with: 1) boundary-regression loss for accurate temporal localization via single-channel boundary prediction, and 2) CDF-based segment-level regularization loss that matches cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and integrates into existing TAS models.
Result: Improves segment-level consistency and boundary quality across three benchmark datasets, yielding higher F1 and Edit scores across three different models (MS-TCN, C2F-TCN, FACT). Frame-wise accuracy remains largely unchanged.
Conclusion: Precise temporal action segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements, offering a lightweight solution for practical deployment.
Abstract: Recent progress in Temporal Action Segmentation (TAS) has increasingly relied on complex architectures, which can hinder practical deployment. We present a lightweight dual-loss training framework that improves fine-grained segmentation quality with only one additional output channel and two auxiliary loss terms, requiring minimal architectural modification. Our approach combines a boundary-regression loss that promotes accurate temporal localization via a single-channel boundary prediction and a CDF-based segment-level regularization loss that encourages coherent within-segment structure by matching cumulative distributions over predicted and ground-truth segments. The framework is architecture-agnostic and can be integrated into existing TAS models (e.g., MS-TCN, C2F-TCN, FACT) as a training-time loss function. Across three benchmark datasets, the proposed method improves segment-level consistency and boundary quality, yielding higher F1 and Edit scores across three different models. Frame-wise accuracy remains largely unchanged, highlighting that precise segmentation can be achieved through simple loss design rather than heavier architectures or inference-time refinements.
[178] MAR-MAER: Metric-Aware and Ambiguity-Adaptive Autoregressive Image Generation
Kai Dong, Tingting Bai
Main category: cs.CV
TL;DR: MAR-MAER is a hierarchical autoregressive framework for text-to-image generation that improves image quality and handles ambiguous prompts through metric-aware embedding regularization and probabilistic latent modeling.
Details
Motivation: Address two major challenges in autoregressive text-to-image generation: 1) generated images often don't meet human quality standards, and 2) difficulty handling ambiguous prompts that could be interpreted in multiple valid ways.Method: Combines metric-aware embedding regularization (using lightweight projection head with adaptive kernel regression loss to align representations with human-preferred metrics like CLIPScore and HPSv2) with conditional variational module for controlled randomness in hierarchical token generation.
Result: Outperforms baseline Hi-MAR model with +1.6 improvement in CLIPScore and +5.3 in HPSv2, produces wider range of outputs for ambiguous inputs, confirmed through both human evaluation and automated metrics on COCO and new Ambiguous-Prompt Benchmark.
Conclusion: MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility, effectively addressing quality and ambiguity challenges in autoregressive text-to-image generation.
Abstract: Autoregressive (AR) models have demonstrated significant success in the realm of text-to-image generation. However, they usually face two major challenges. Firstly, the generated images may not always meet the quality standards expected by humans. Furthermore, these models face difficulty when dealing with ambiguous prompts that could be interpreted in several valid ways. To address these issues, we introduce MAR-MAER, an innovative hierarchical autoregressive framework. It combines two main components. It is a metric-aware embedding regularization method. The other one is a probabilistic latent model used for handling ambiguous semantics. Our method utilizes a lightweight projection head, which is trained with an adaptive kernel regression loss function. This aligns the model’s internal representations with human-preferred quality metrics, such as CLIPScore and HPSv2. As a result, the embedding space that is learned more accurately reflects human judgment. We are also introducing a conditional variational module. This approach incorporates an aspect of controlled randomness within the hierarchical token generation process. This capability allows the model to produce a diverse array of coherent images based on ambiguous or open-ended prompts. We conducted extensive experiments using COCO and a newly developed Ambiguous-Prompt Benchmark. The results show that MAR-MAER achieves excellent performance in both metric consistency and semantic flexibility. It exceeds the baseline Hi-MAR model’s performance, showing an improvement of +1.6 in CLIPScore and +5.3 in HPSv2. For unclear inputs, it produces a notably wider range of outputs. These findings have been confirmed through both human evaluation and automated metrics.
[179] GeoAI Agency Primitives
Akram Zaytar, Rohan Sawahn, Caleb Robinson, Gilles Q. Hacheme, Girmaw A. Tadesse, Inbal Becker-Reshef, Rahul Dodhia, Juan Lavista Ferres
Main category: cs.CV
TL;DR: Proposes 9 agency primitives for GeoAI assistants to bridge foundation models with GIS workflows, focusing on human-in-the-loop collaboration rather than just model capabilities.
Details
Motivation: Despite advances in satellite image captioning, VQA, and segmentation, these capabilities haven't translated to productivity gains for GIS practitioners who work with vector layers, raster maps, and cartographic products. The gap is not just model capability but the absence of an agency layer for iterative collaboration.Method: Proposes a vocabulary of 9 agency primitives including navigation, perception, geo-referenced memory, and dual modeling. Also introduces a benchmark to measure human productivity in GIS workflows.
Result: Ongoing research - no specific results reported in the abstract, but the framework aims to make agentic assistance in GIS implementable, testable, and comparable.
Conclusion: The paper aims to develop agency primitives that connect foundation models to real-world GIS workflows, addressing the gap between AI capabilities and practical productivity gains for practitioners.
Abstract: We present ongoing research on agency primitives for GeoAI assistants – core capabilities that connect Foundation models to the artifact-centric, human-in-the-loop workflows where GIS practitioners actually work. Despite advances in satellite image captioning, visual question answering, and promptable segmentation, these capabilities have not translated into productivity gains for practitioners who spend most of their time producing vector layers, raster maps, and cartographic products. The gap is not model capability alone but the absence of an agency layer that supports iterative collaboration. We propose a vocabulary of $9$ primitives for such a layer – including navigation, perception, geo-referenced memory, and dual modeling – along with a benchmark that measures human productivity. Our goal is a vocabulary that makes agentic assistance in GIS implementable, testable, and comparable.
[180] A3R: Agentic Affordance Reasoning via Cross-Dimensional Evidence in 3D Gaussian Scenes
Di Li, Jie Feng, Guanbin Li, Ronghua Shang, Yuhui Zheng, Weisheng Dong, Guangming Shi
Main category: cs.CV
TL;DR: A3R: Agentic affordance reasoning framework that reformulates 3D affordance reasoning as sequential evidence acquisition using MLLM-based policy for cross-dimensional (3D geometric + 2D semantic) evidence gathering in complex 3D Gaussian scenes.
Details
Motivation: Existing one-shot affordance prediction methods fail in complex 3D scenes due to incomplete task-relevant evidence under fixed observations, not weak prediction capacity. Need sequential evidence acquisition to progressively reduce ambiguity.Method: Reformulates affordance reasoning as sequential evidence acquisition process. Uses MLLM-based policy to iteratively select evidence acquisition actions and update affordance belief through cross-dimensional evidence (3D geometric + 2D semantic). Introduces GRPO-based policy learning to optimize sequential decision making.
Result: Extensive experiments on scene-level benchmarks show A3R consistently surpasses static one-shot baselines, demonstrating advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.
Conclusion: Agentic cross-dimensional evidence acquisition framework effectively addresses limitations of static one-shot methods for fine-grained affordance reasoning in complex 3D environments.
Abstract: Affordance reasoning in 3D Gaussian scenes aims to identify the region that supports the action specified by a given text instruction in complex environments. Existing methods typically cast this problem as one-shot prediction from static scene observations, assuming sufficient evidence is already available for reasoning. However, in complex 3D scenes, many failure cases arise not from weak prediction capacity, but from incomplete task-relevant evidence under fixed observations. To address this limitation, we reformulate fine-grained affordance reasoning as a sequential evidence acquisition process, where ambiguity is progressively reduced through complementary 3D geometric and 2D semantic evidence. Building on this formulation, we propose A3R, an agentic affordance reasoning framework that enables an MLLM-based policy to iteratively select evidence acquisition actions and update the affordance belief through cross-dimensional evidence acquisition. To optimize such sequential decision making, we further introduce a GRPO-based policy learning strategy that improves evidence acquisition efficiency and reasoning accuracy. Extensive experiments on scene-level benchmarks show that A3R consistently surpasses static one-shot baselines, demonstrating the advantage of agentic cross-dimensional evidence acquisition for fine-grained affordance reasoning in complex 3D Gaussian scenes.
[181] GS^2: Graph-based Spatial Distribution Optimization for Compact 3D Gaussian Splatting
Xianben Yang, Tao Wang, Yuxuan Li, Yi Jin, Haibin Ling
Main category: cs.CV
TL;DR: GS^2 optimizes 3D Gaussian Splatting for memory efficiency by introducing adaptive densification, progressive pruning, and graph-based feature encoding to reduce Gaussian points while maintaining rendering quality.
Details
Motivation: 3D Gaussian Splatting (3DGS) has excellent rendering performance but suffers from high memory costs due to many Gaussian points. Existing pruning methods compromise spatial consistency and cause artifacts.Method: Proposes GS^2 with three components: 1) ELBO-based adaptive densification strategy for automatic control, 2) opacity-aware progressive pruning to remove low-opacity points, and 3) graph-based feature encoding module for feature-guided point shifting to optimize spatial distribution.
Result: Achieves higher PSNR than 3DGS with only ~12.5% of Gaussian points. Outperforms all baselines in both rendering quality and memory efficiency.
Conclusion: GS^2 provides a compact Gaussian representation with superior rendering quality, addressing the memory limitations of 3DGS while maintaining spatial consistency.
Abstract: 3D Gaussian Splatting (3DGS) has demonstrated breakthrough performance in novel view synthesis and real-time rendering. Nevertheless, its practicality is constrained by the high memory cost due to a huge number of Gaussian points. Many pruning-based 3DGS variants have been proposed for memory saving, but often compromise spatial consistency and may lead to rendering artifacts. To address this issue, we propose graph-based spatial distribution optimization for compact 3D Gaussian Splatting (GS\textasciicircum2), which enhances reconstruction quality by optimizing the spatial distribution of Gaussian points. Specifically, we introduce an evidence lower bound (ELBO)-based adaptive densification strategy that automatically controls the densification process. In addition, an opacity-aware progressive pruning strategy is proposed to further reduce memory consumption by dynamically removing low-opacity Gaussian points. Furthermore, we propose a graph-based feature encoding module to adjust the spatial distribution via feature-guided point shifting. Extensive experiments validate that GS\textasciicircum2 achieves a compact Gaussian representation while delivering superior rendering quality. Compared with 3DGS, it achieves higher PSNR with only about 12.5% Gaussian points. Furthermore, it outperforms all compared baselines in both rendering quality and memory efficiency.
[182] Low-Effort Jailbreak Attacks Against Text-to-Image Safety Filters
Ahmed B Mustafa, Zihan Ye, Yang Lu, Michael P Pound, Shreyank N Gowda
Main category: cs.CV
TL;DR: Text-to-image models remain vulnerable to simple linguistic jailbreak attacks that bypass safety filters using natural language prompts without model access or optimization.
Details
Motivation: To investigate vulnerabilities in text-to-image generative models' safety filters and moderation pipelines, showing that current systems can be easily bypassed using only natural language prompts despite deployment in creative tools and online platforms.Method: Systematic study of prompt-based jailbreak strategies without model access, optimization, or adversarial training. Introduces taxonomy of visual jailbreak techniques: artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution.
Result: Simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery across several state-of-the-art text-to-image systems, with attack success rates up to 74.47% across all tested models and attack categories.
Conclusion: There’s a critical gap between surface-level prompt filtering and semantic understanding required to detect adversarial intent in generative media systems, highlighting vulnerabilities in current safety approaches.
Abstract: Text-to-image generative models are widely deployed in creative tools and online platforms. To mitigate misuse, these systems rely on safety filters and moderation pipelines that aim to block harmful or policy violating content. In this work we show that modern text-to-image models remain vulnerable to low-effort jailbreak attacks that require only natural language prompts. We present a systematic study of prompt-based strategies that bypass safety filters without model access, optimization, or adversarial training. We introduce a taxonomy of visual jailbreak techniques including artistic reframing, material substitution, pseudo-educational framing, lifestyle aesthetic camouflage, and ambiguous action substitution. These strategies exploit weaknesses in prompt moderation and visual safety filtering by masking unsafe intent within benign semantic contexts. We evaluate these attacks across several state-of-the-art text-to-image systems and demonstrate that simple linguistic modifications can reliably evade existing safeguards and produce restricted imagery. Our findings highlight a critical gap between surface-level prompt filtering and the semantic understanding required to detect adversarial intent in generative media systems. Across all tested models and attack categories we observe an attack success rate (ASR) of up to 74.47%.
[183] ProVG: Progressive Visual Grounding via Language Decoupling for Remote Sensing Imagery
Ke Li, Ting Wang, Di Wang, Yongshan Zhu, Yiming Zhang, Tao Lei, Quan Wang
Main category: cs.CV
TL;DR: ProVG is a progressive vision-language framework for remote sensing visual grounding that decouples language expressions into global context, spatial relations, and object attributes, using a survey-locate-verify scheme for coarse-to-fine alignment.
Details
Motivation: Previous remote sensing visual grounding methods rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues like spatial relations and object attributes that are crucial for distinguishing similar objects. These cues play distinct roles across different grounding stages and should be leveraged accordingly.Method: ProVG decouples language expressions into three components: global context, spatial relations, and object attributes. It employs a progressive cross-modal modulator with a survey-locate-verify scheme for dynamic visual attention modulation. Additional components include a cross-scale fusion module to handle large-scale variations in remote sensing imagery, a language-guided calibration decoder for refining alignment, and a unified multi-task head supporting both referring expression comprehension and segmentation.
Result: Extensive experiments on RRSIS-D and RISBench benchmarks show that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance in remote sensing visual grounding.
Conclusion: ProVG demonstrates that decoupling language expressions and progressively integrating linguistic cues through a survey-locate-verify scheme significantly improves localization accuracy in remote sensing visual grounding tasks.
Abstract: Remote sensing visual grounding (RSVG) aims to localize objects in remote sensing imagery according to natural language expressions. Previous methods typically rely on sentence-level vision-language alignment, which struggles to exploit fine-grained linguistic cues, such as \textit{spatial relations} and \textit{object attributes}, that are crucial for distinguishing objects with similar characteristics. Importantly, these cues play distinct roles across different grounding stages and should be leveraged accordingly to provide more explicit guidance. In this work, we propose \textbf{ProVG}, a novel RSVG framework that improves localization accuracy by decoupling language expressions into global context, spatial relations, and object attributes. To integrate these linguistic cues, ProVG employs a simple yet effective progressive cross-modal modulator, which dynamically modulates visual attention through a \textit{survey-locate-verify} scheme, enabling coarse-to-fine vision-language alignment. In addition, ProVG incorporates a cross-scale fusion module to mitigate the large-scale variations in remote sensing imagery, along with a language-guided calibration decoder to refine cross-modal alignment during prediction. A unified multi-task head further enables ProVG to support both referring expression comprehension and segmentation tasks. Extensive experiments on two benchmarks, \textit{i.e.}, RRSIS-D and RISBench, demonstrate that ProVG consistently outperforms existing methods, achieving new state-of-the-art performance.
[184] SHARC: Reference point driven Spherical Harmonic Representation for Complex Shapes
Panagiotis Sapoutzoglou, George Terzakis, Maria Pateraki
Main category: cs.CV
TL;DR: SHARC is a novel shape synthesis framework that uses spherical harmonic representations of distance fields anchored at optimally placed reference points to generate arbitrary, genus-agnostic shapes with high geometric fidelity and efficiency.
Details
Motivation: The paper aims to address the challenge of synthesizing arbitrary shapes with high geometric fidelity while maintaining computational efficiency. Existing methods often struggle with balancing reconstruction accuracy, time efficiency, and model parsimony, especially for complex genus-agnostic shapes.Method: SHARC uses a collection of Spherical Harmonic (SH) representations of distance fields anchored at optimally placed reference points. A cost function maximizes sparsity, centrality, and surface visibility for reference point placement. For each point, distance fields are sampled via ray-casting, SH coefficients are computed using Fast Spherical Harmonic Transform, and a configurable low-pass filter enhances geometric fidelity with local consistency constraints.
Result: SHARC outperforms state-of-the-art methods in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The framework demonstrates superior performance in generating arbitrary, genus-agnostic shapes with high geometric fidelity.
Conclusion: SHARC provides an effective framework for shape synthesis that balances accuracy, efficiency, and model simplicity through optimized spherical harmonic representations and reference point placement strategies.
Abstract: We propose SHARC, a novel framework that synthesizes arbitrary, genus-agnostic shapes by means of a collection of Spherical Harmonic (SH) representations of distance fields. These distance fields are anchored at optimally placed reference points in the interior volume of the surface in a way that maximizes learning of the finer details of the surface. To achieve this, we employ a cost function that jointly maximizes sparsity and centrality in terms of positioning, as well as visibility of the surface from their location. For each selected reference point, we sample the visible distance field to the surface geometry via ray-casting and compute the SH coefficients using the Fast Spherical Harmonic Transform (FSHT). To enhance geometric fidelity, we apply a configurable low-pass filter to the coefficients and refine the output using a local consistency constraint based on proximity. Evaluation of SHARC against state-of-the-art methods demonstrates that the proposed method outperforms existing approaches in both reconstruction accuracy and time efficiency without sacrificing model parsimony. The source code is available at https://github.com/POSE-Lab/SHARC.
[185] FTPFusion: Frequency-Aware Infrared and Visible Video Fusion with Temporal Perturbation
Xilai Li, Chusheng Fang, Xiaosong Li
Main category: cs.CV
TL;DR: FTPFusion: A frequency-aware infrared and visible video fusion method using temporal perturbation and sparse cross-modal interaction to maintain both spatial detail and temporal stability.
Details
Motivation: Existing infrared and visible video fusion methods struggle to balance temporal stability with spatial detail preservation. Current approaches either focus on frame-wise enhancement with limited temporal modeling or use heavy spatio-temporal aggregation that sacrifices high-frequency details.Method: Decomposes feature representations into high-frequency and low-frequency components. High-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion context and complementary details. Low-frequency branch uses temporal perturbation strategy to enhance robustness against video variations. Includes offset-aware temporal consistency constraint to stabilize cross-frame representations.
Result: Extensive experiments on multiple public benchmarks show FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency.
Conclusion: FTPFusion effectively addresses the challenge of maintaining temporal stability while preserving spatial detail in infrared and visible video fusion through frequency-aware decomposition and temporal perturbation strategies.
Abstract: Infrared and visible video fusion plays a critical role in intelligent surveillance and low-light monitoring. However, maintaining temporal stability while preserving spatial detail remains a fundamental challenge. Existing methods either focus on frame-wise enhancement with limited temporal modeling or rely on heavy spatio-temporal aggregation that often sacrifices high-frequency details. In this paper, we propose FTPFusion, a frequency-aware infrared and visible video fusion method based on temporal perturbation and sparse cross-modal interaction. Specifically, FTPFusion decomposes the feature representations into high-frequency and low-frequency components for collaborative modeling. The high-frequency branch performs sparse cross-modal spatio-temporal interaction to capture motion-related context and complementary details. The low-frequency branch introduces a temporal perturbation strategy to enhance robustness against complex video variations, such as flickering, jitter, and local misalignment. Furthermore, we design an offset-aware temporal consistency constraint to explicitly stabilize cross-frame representations under temporal disturbances. Extensive experiments on multiple public benchmarks demonstrate that FTPFusion consistently outperforms state-of-the-art methods across multiple metrics in both spatial fidelity and temporal consistency. The source code will be available at https://github.com/ixilai/FTPFusion.
[186] Light-ResKAN: A Parameter-Sharing Lightweight KAN with Gram Polynomials for Efficient SAR Image Recognition
Pan Yi, Weijie Li, Xiaodong Chen, Jiehua Zhang, Li Liu, Yongxiang Liu
Main category: cs.CV
TL;DR: Light-ResKAN: A lightweight SAR image recognition model using Kolmogorov-Arnold Networks with Gram Polynomial activations and parameter sharing for edge deployment.
Details
Motivation: SAR image recognition is important for various applications but faces challenges with large image sizes on resource-constrained edge devices. Existing lightweight models struggle to balance high-precision feature extraction with low computational requirements.Method: Proposes Light-ResKAN which: 1) modifies ResNet by replacing convolutions with KAN convolutions for adaptive feature extraction, 2) uses Gram Polynomials as activations suited for SAR data, and 3) employs parameter-sharing strategy where each kernel shares parameters per channel to reduce parameters and FLOPs.
Result: Achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets. On MSTAR resized to 1024×1024, reduces FLOPs by 82.90× and parameters by 163.78× compared to VGG16.
Conclusion: Light-ResKAN establishes an efficient solution for edge SAR image recognition by balancing precision and efficiency through KAN-based architecture with specialized activations and parameter optimization.
Abstract: Synthetic Aperture Radar (SAR) image recognition is vital for disaster monitoring, military reconnaissance, and ocean observation. However, large SAR image sizes hinder deep learning deployment on resource-constrained edge devices, and existing lightweight models struggle to balance high-precision feature extraction with low computational requirements. The emerging Kolmogorov-Arnold Network (KAN) enhances fitting by replacing fixed activations with learnable ones, reducing parameters and computation. Inspired by KAN, we propose Light-ResKAN to achieve a better balance between precision and efficiency. First, Light-ResKAN modifies ResNet by replacing convolutions with KAN convolutions, enabling adaptive feature extraction for SAR images. Second, we use Gram Polynomials as activations, which are well-suited for SAR data to capture complex non-linear relationships. Third, we employ a parameter-sharing strategy: each kernel shares parameters per channel, preserving unique features while reducing parameters and FLOPs. Our model achieves 99.09%, 93.01%, and 97.26% accuracy on MSTAR, FUSAR-Ship, and SAR-ACD datasets, respectively. Experiments on MSTAR resized to $1024 \times 1024$ show that compared to VGG16, our model reduces FLOPs by $82.90 \times$ and parameters by $163.78 \times$. This work establishes an efficient solution for edge SAR image recognition.
[187] Lifting Unlabeled Internet-level Data for 3D Scene Understanding
Yixin Chen, Yaowei Zhang, Huangyue Yu, Junchao He, Yan Wang, Jiangyong Huang, Hongyu Shen, Junfeng Ni, Shaofei Wang, Baoxiong Jia, Song-Chun Zhu, Siyuan Huang
Main category: cs.CV
TL;DR: Automated data generation from web videos for 3D scene understanding, enabling zero-shot performance on tasks from object detection to vision-language navigation.
Details
Motivation: 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available online. The paper aims to leverage web-curated videos to automatically generate training data for 3D scene understanding models.Method: Develops data engines that automatically generate training data from unlabeled web videos. Identifies and analyzes bottlenecks in automated data generation, determining critical factors for efficient learning from unlabeled data.
Result: Models trained on generated data show strong zero-shot performance on 3D object detection, instance segmentation, 3D spatial VQA, and Vision-Language Navigation. Further improvement observed after fine-tuning.
Conclusion: Demonstrates viability of leveraging readily available web data as a path toward more capable scene understanding systems, bridging the gap between scarce annotated 3D data and abundant unlabeled video resources.
Abstract: Annotated 3D scene data is scarce and expensive to acquire, while abundant unlabeled videos are readily available on the internet. In this paper, we demonstrate that carefully designed data engines can leverage web-curated, unlabeled videos to automatically generate training data, to facilitate end-to-end models in 3D scene understanding alongside human-annotated datasets. We identify and analyze bottlenecks in automated data generation, revealing critical factors that determine the efficiency and effectiveness of learning from unlabeled data. To validate our approach across different perception granularities, we evaluate on three tasks spanning low-level perception, i.e., 3D object detection and instance segmentation, to high-evel reasoning, i.e., 3D spatial Visual Question Answering (VQA) and Vision-Lanugage Navigation (VLN). Models trained on our generated data demonstrate strong zero-shot performance and show further improvement after finetuning. This demonstrates the viability of leveraging readily available web data as a path toward more capable scene understanding systems.
[188] Night Eyes: A Reproducible Framework for Constellation-Based Corneal Reflection Matching
Virmarie Maquiling, Yasmeen Abdrabou, Enkelejda Kasneci
Main category: cs.CV
TL;DR: A geometry-driven pipeline for detecting and matching corneal reflection glints in eye tracking systems, treating glints as structured constellations rather than independent blobs.
Details
Motivation: Current glint detection methods are often heuristic-based and embedded within larger eye tracking systems, making them difficult to reproduce across different hardware setups. There's a need for a more systematic, reproducible approach to multi-glint detection and matching.Method: Proposes a 2D geometry-driven, constellation-based pipeline inspired by lost-in-space star identification. Uses Similarity-Layout Alignment (SLA) procedure adapted for multi-LED eye tracking constraints. Combines controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated.
Result: Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. The framework enables transparent replication, comparison, and dataset annotation.
Conclusion: The constellation-based approach offers a reproducible and systematic method for glint detection and matching in eye tracking systems, with released code and tools to facilitate adoption and comparison.
Abstract: Corneal reflection (glint) detection plays an important role in pupil-corneal reflection (P-CR) eye tracking, but in practice it is often handled as heuristics embedded within larger systems, making reproducibility difficult across hardware setups. We introduce a 2D geometry-driven, constellation-based pipeline for mulit-glint detection and matching, focusing on reproducibility and clear evaluation. Inspired by lost-in-space star identification, we treat glints as structured constellations rather than independent blobs. We propose a Similarity-Layout Alignment (SLA) procedure which adapts constellation matching to the specific constraints of multi-LED eye tracking. The framework brings together controlled over-detection, adaptive candidate fallback, appearance-aware scoring, and optional semantic layout priors while keeping detection and correspondence explicitly separated. Evaluated on a public multi-LED dataset, the system provides stable identity-preserving correspondence under noisy conditions. We release code, presets, and evaluation scripts to enable transparent replication, comparison, and dataset annotation.
[189] Enhancing Medical Visual Grounding via Knowledge-guided Spatial Prompts
Yifan Gao, Tao Zhou, Yi Zhou, Ke Zou, Yizhe Zhang, Huazhu Fu
Main category: cs.CV
TL;DR: KnowMVG enhances medical visual grounding in VLMs using knowledge-enhanced prompting and global-local attention to improve spatial precision for localizing diagnostically relevant regions in medical images.
Details
Motivation: Current Vision-Language Models (VLMs) lack sufficient spatial precision for medical visual grounding due to reliance on latent embeddings without explicit localization priors, limiting their ability to provide precise visual evidence for clinical decision-making.Method: Proposes KnowMVG with: 1) Knowledge-enhanced prompting that encodes phrase-related medical knowledge into compact embeddings, and 2) Global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization without extra textual reasoning overhead.
Result: Extensive experiments on four MVG benchmarks show KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods.
Conclusion: KnowMVG effectively bridges high-level semantic understanding and fine-grained visual perception in medical visual grounding, providing more precise spatial localization for clinical applications.
Abstract: Medical Visual Grounding (MVG) aims to identify diagnostically relevant phrases from free-text radiology reports and localize their corresponding regions in medical images, providing interpretable visual evidence to support clinical decision-making. Although recent Vision-Language Models (VLMs) exhibit promising multimodal reasoning ability, their grounding remains insufficient spatial precision, largely due to a lack of explicit localization priors when relying solely on latent embeddings. In this work, we analyze this limitation from an attention perspective and propose KnowMVG, a Knowledge-prior and global-local attention enhancement framework for MVG in VLMs that explicitly strengthens spatial awareness during decoding. Specifically, we present a knowledge-enhanced prompting strategy that encodes phrase related medical knowledge into compact embeddings, together with a global-local attention that jointly leverages coarse global information and refined local cues to guide precise region localization. localization. This design bridges high-level semantic understanding and fine-grained visual perception without introducing extra textual reasoning overhead. Extensive experiments on four MVG benchmarks demonstrate that our KnowMVG consistently outperforms existing approaches, achieving gains of 3.0% in AP50 and 2.6% in mIoU over prior state-of-the-art methods. Qualitative and ablation studies further validate the effectiveness of each component.
[190] Learning Spatial Structure from Pre-Beamforming Per-Antenna Range-Doppler Radar Data via Visibility-Aware Cross-Modal Supervision
George Sebastian, Philipp Berthold, Bianca Forkel, Leon Pohl, Mirko Maehlisch
Main category: cs.CV
TL;DR: Learning spatial structure directly from pre-beamforming per-antenna radar measurements without explicit angle-domain construction or hand-crafted signal processing.
Details
Motivation: Current automotive radar perception pipelines rely on beamforming to create angle-domain representations before applying learning models. This work investigates whether meaningful spatial structure can be learned directly from raw per-antenna range-Doppler measurements, bypassing traditional signal processing stages.Method: Uses a 6-TX x 8-RX automotive radar with A/B chirp-sequence FMCW transmit scheme. Implements dual-chirp shared-weight encoder trained end-to-end on pre-beamforming per-antenna RD tensors. Supervision is cross-modal from LiDAR with visibility-aware modeling of radar FOV and occlusion-aware observability. Conducts chirp ablations (A-only, B-only, A+B) and range-band analysis.
Result: Spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages. The approach demonstrates geometric recoverability using BEV occupancy as a probe.
Conclusion: End-to-end learning directly from raw radar measurements is feasible for spatial understanding, potentially simplifying radar perception pipelines by eliminating traditional beamforming and signal processing stages.
Abstract: Automotive radar perception pipelines commonly construct angle-domain representations via beamforming before applying learning-based models. This work instead investigates a representational question: can meaningful spatial structure be learned directly from pre-beamforming per-antenna range-Doppler (RD) measurements? Experiments are conducted on a 6-TX x 8-RX (48 virtual antennas) commodity automotive radar employing an A/B chirp-sequence frequency-modulated continuous-wave (CS-FMCW) transmit scheme, in which the effective transmit aperture varies between chirps (single-TX vs. multi-TX), enabling controlled analysis of chirp-dependent transmit configurations. We operate on pre-beamforming per-antenna RD tensors using a dual-chirp shared-weight encoder trained in an end-to-end, fully data-driven manner, and evaluate spatial recoverability using bird’s-eye-view (BEV) occupancy as a geometric probe rather than a performance-driven objective. Supervision is visibility-aware and cross-modal, derived from LiDAR with explicit modeling of the radar field-of-view and occlusion-aware LiDAR observability via ray-based visibility. Through chirp ablations (A-only, B-only, A+B), range-band analysis, and physics-aligned baselines, we assess how transmit configurations affect geometric recoverability. The results indicate that spatial structure can be learned directly from pre-beamforming per-antenna RD tensors without explicit angle-domain construction or hand-crafted signal-processing stages.
[191] Rethinking Representations for Cross-Domain Infrared Small Target Detection: A Generalizable Perspective from the Frequency Domain
Yimin Fu, Songbo Wang, Feiyan Wu, Jialin Lyu, Zhunga Liu, Michael K. Ng
Main category: cs.CV
TL;DR: S²CPNet: A spatial-spectral collaborative perception network for cross-domain infrared small target detection using frequency analysis and phase rectification to handle domain shifts.
Details
Motivation: Existing infrared small target detection methods fail to generalize to unseen domains due to distribution shifts caused by variations in observational conditions and environmental factors, combined with the intrinsic indistinctiveness of infrared small targets that leads to overfitting to domain-specific patterns.Method: Proposes S²CPNet with three key components: 1) Phase Rectification Module (PRM) that analyzes spectral phase inconsistencies to derive generalizable target awareness, 2) Orthogonal Attention Mechanism (OAM) in skip connections to preserve positional information while refining representations, and 3) Selective Style Recomposition (SSR) to mitigate bias toward domain-specific patterns.
Result: Extensive experiments on three IRSTD datasets show the method consistently achieves state-of-the-art performance under diverse cross-domain settings.
Conclusion: The proposed spatial-spectral collaborative approach effectively addresses domain generalization challenges in infrared small target detection by leveraging frequency analysis and phase rectification to create more robust representations.
Abstract: The accurate target-background separation in infrared small target detection (IRSTD) highly depends on the discriminability of extracted representations. However, most existing methods are confined to domain-consistent settings, while overlooking whether such discriminability can generalize to unseen domains. In practice, distribution shifts between training and testing data are inevitable due to variations in observational conditions and environmental factors. Meanwhile, the intrinsic indistinctiveness of infrared small targets aggravates overfitting to domain-specific patterns. Consequently, the detection performance of models trained on source domains can be severely degraded when deployed in unseen domains. To address this challenge, we propose a spatial-spectral collaborative perception network (S$^2$CPNet) for cross-domain IRSTD. Moving beyond conventional spatial learning pipelines, we rethink IRSTD representations from a frequency perspective and reveal inconsistencies in spectral phase as the primary manifestation of domain discrepancies. Based on this insight, we develop a phase rectification module (PRM) to derive generalizable target awareness. Then, we employ an orthogonal attention mechanism (OAM) in skip connections to preserve positional information while refining informative representations. Moreover, the bias toward domain-specific patterns is further mitigated through selective style recomposition (SSR). Extensive experiments have been conducted on three IRSTD datasets, and the proposed method consistently achieves state-of-the-art performance under diverse cross-domain settings.
[192] Captioning Daily Activity Images in Early Childhood Education: Benchmark and Algorithm
Sixing Li, Zhibin Gu, Ziqi Zhang, Weiguo Pan, Bing Li, Ying Wang, Hongzhe Liu
Main category: cs.CV
TL;DR: ECAC introduces a large-scale benchmark for Early Childhood Education image captioning with expert annotations and a domain-specific evaluation metric (TTS), plus RSRS training framework that dynamically switches between RL and supervised learning to improve fine-grained object recognition.
Details
Motivation: Existing image captioning methods for Early Childhood Education lack domain-specific datasets and struggle with fine-grained semantic concepts, producing generic descriptions. Current training paradigms (supervised learning and reinforcement learning) have limitations in enhancing professional object description capabilities.Method: 1) Created ECAC benchmark with 256,121 real-world ECE images with expert captions and fine-grained labels; 2) Introduced Teaching Toy Recognition Score (TTS) for domain-specific evaluation; 3) Proposed RSRS training framework that dynamically switches between reinforcement learning and supervised fine-tuning based on reward signals, rerouting hard samples with zero rewards to supervised optimization.
Result: Developed KinderMM-Cap-3B model that achieves TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality. The model demonstrates strong performance in specialized educational applications.
Conclusion: The ECAC benchmark and RSRS training framework effectively address domain-specific challenges in ECE image captioning, enabling development of specialized multimodal models with improved fine-grained object recognition capabilities for educational applications.
Abstract: Image captioning for Early Childhood Education (ECE) is essential for automated activity understanding and educational assessment. However, existing methods face two key challenges. First, the lack of large-scale, domain-specific datasets limits the model’s ability to capture fine-grained semantic concepts unique to ECE scenarios, resulting in generic and imprecise descriptions. Second, conventional training paradigms exhibit limitations in enhancing professional object description capability, as supervised learning tends to favor high-frequency expressions, while reinforcement learning may suffer from unstable optimization on difficult samples. To address these limitations, we introduce ECAC, a large-scale benchmark for ECE daily activity image captioning, comprising 256,121 real-world images annotated with expert-level captions and fine-grained labels. ECAC is further equipped with a domain-oriented evaluation protocol, the Teaching Toy Recognition Score (TTS), to explicitly measure professional object naming accuracy. Furthermore, we propose RSRS (Reward-Conditional Switch of Reinforcement Learning and Supervised Fine-Tuning), a hybrid training framework that dynamically alternates between RL and supervised optimization. By rerouting hard samples with zero rewards to supervised fine-tuning, RSRS effectively mitigates advantage collapse and enables stable optimization for fine-grained recognition. Leveraging ECAC and RSRS, we develop KinderMM-Cap-3B, a domain-adapted multimodal large language model. Extensive experiments demonstrate that our model achieves a TTS of 51.06, substantially outperforming state-of-the-art baselines while maintaining superior caption quality, highlighting its potential for specialized educational applications.
[193] A Self supervised learning framework for imbalanced medical imaging datasets
Yash Kumar Sharma, Charan Ramtej Kodi, Vineet Padmanabhan
Main category: cs.CV
TL;DR: AMIMV method extends MIMV with asymmetric multi-image multi-view pairs to address data scarcity and imbalance in medical image classification, showing improved performance on MedMNIST datasets.
Details
Motivation: Address two key problems in medical imaging: 1) Limited labeled training data, and 2) Imbalanced data distribution where rare classes have scarce data. While SSL methods help with data scarcity, their robustness to imbalanced data in medical image classification remains under-explored.Method: Extends MIMV method with new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs. Evaluates eight SSL methods on 11 MedMNIST datasets under long-tailed distributions and limited supervision, with data analysis on robustness to varying class imbalance.
Result: Experimental results show improvements: 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST compared to baseline methods.
Conclusion: AMIMV effectively addresses both data scarcity and imbalance in medical image classification, demonstrating robustness across varying degrees of class imbalance and showing consistent improvements on multiple medical imaging datasets.
Abstract: Two problems often plague medical imaging analysis: 1) Non-availability of large quantities of labeled training data, and 2) Dealing with imbalanced data, i.e., abundant data are available for frequent classes, whereas data are highly limited for the rare class. Self supervised learning (SSL) methods have been proposed to deal with the first problem to a certain extent, but the issue of investigating the robustness of SSL to imbalanced data has rarely been addressed in the domain of medical image classification. In this work, we make the following contributions: 1) The MIMV method proposed by us in an earlier work is extended with a new augmentation strategy to construct asymmetric multi-image, multi-view (AMIMV) pairs to address both data scarcity and dataset imbalance in medical image classification. 2) We carry out a data analysis to evaluate the robustness of AMIMV under varying degrees of class imbalance in medical imaging . 3) We evaluate eight representative SSL methods in 11 medical imaging datasets (MedMNIST) under long-tailed distributions and limited supervision. Our experimental results on the MedMNIST dataset show an improvement of 4.25% on retinaMNIST, 1.88% on tissueMNIST, and 3.1% on DermaMNIST.
[194] MAVFusion: Efficient Infrared and Visible Video Fusion via Motion-Aware Sparse Interaction
Xilai Li, Weijun Jiang, Xiaosong Li, Yang Liu, Hongbin Wang, Tao Ye, Huafeng Li, Haishu Tan
Main category: cs.CV
TL;DR: MAVFusion is an efficient video fusion framework that uses motion-aware sparse interaction to combine infrared and visible videos, focusing computational attention on dynamic regions while using lightweight processing for static areas.
Details
Motivation: Existing infrared-visible fusion methods are designed for static images and cannot handle video motion effectively. Current video fusion methods improve temporal consistency but require high computational cost.Method: Proposes MAVFusion with motion-aware sparse interaction: uses optical flow to identify dynamic regions, applies intensive cross-modal attention to sparse dynamic areas, and uses lightweight weak interaction for static background regions.
Result: Achieves state-of-the-art performance on multiple infrared-visible video benchmarks with 14.16 FPS at 640×480 resolution, balancing fusion quality and efficiency.
Conclusion: MAVFusion effectively preserves temporal consistency and fine-grained details while accelerating inference through decoupled processing of dynamic and static regions.
Abstract: Infrared and visible video fusion combines the object saliency from infrared images with the texture details from visible images to produce semantically rich fusion results. However, most existing methods are designed for static image fusion and cannot effectively handle frame-to-frame motion in videos. Current video fusion methods improve temporal consistency by introducing interactions across frames, but they often require high computational cost. To mitigate these challenges, we propose MAVFusion, an end-to-end video fusion framework featuring a motion-aware sparse interaction mechanism that enhances efficiency while maintaining superior fusion quality. Specifically, we leverage optical flow to identify dynamic regions in multi-modal sequences, adaptively allocating computationally intensive cross-modal attention to these sparse areas to capture salient transitions and facilitate inter-modal information exchange. For static background regions, a lightweight weak interaction module is employed to maintain structural and appearance integrity. By decoupling the processing of dynamic and static regions, MAVFusion simultaneously preserves temporal consistency and fine-grained details while significantly accelerating inference. Extensive experiments demonstrate that MAVFusion achieves state-of-the-art performance on multiple infrared and visible video benchmarks, achieving a speed of 14.16,FPS at $640 \times 480$ resolution. The source code will be available at https://github.com/ixilai/MAVFusion.
[195] Automated Prostate Gland Segmentation in MRI Using nnU-Net
Pablo Rodriguez-Belenguer, Gloria Ribas, Javier Aquerreta Escribano, Rafael Moreno-Calatayud, Leonor Cerda-Alberich, Luis Marti-Bonmati
Main category: cs.CV
TL;DR: Deep learning model for automatic prostate gland segmentation in multiparametric MRI using nnU-Net v2 framework with multimodal data (T2-weighted, DWI, ADC maps)
Details
Motivation: Manual prostate segmentation in mpMRI is time-consuming and variable, while general-purpose tools lack accuracy for prostate-specific tasks, creating need for dedicated automated solutionMethod: Uses nnU-Net v2 framework with multimodal mpMRI data (T2-weighted, DWI, ADC maps) trained on 981 cases from PI-CAI dataset with whole-gland annotations, evaluated via 5-fold cross-validation and external validation
Result: Achieved mean Dice score of 0.96 in cross-validation and 0.82 on external test set, significantly outperforming general-purpose TotalSegmentator (Dice 0.15)
Conclusion: Task-specific multimodal segmentation strategies are crucial for accurate prostate segmentation, and the proposed approach shows strong generalization potential for clinical research workflows
Abstract: Accurate segmentation of the prostate gland in multiparametric MRI (mpMRI) is a fundamental step for a wide range of clinical and research applications, including image registration, volume estimation, and radiomic analysis. However, manual delineation is time-consuming and subject to inter-observer variability, while general-purpose segmentation tools often fail to provide sufficient accuracy for prostate-specific tasks. In this work, we propose a dedicated deep learning-based approach for automatic prostate gland segmentation using the nnU-Net v2 framework. The model leverages multimodal mpMRI data, including T2-weighted imaging, diffusion-weighted imaging (DWI), and apparent diffusion coefficient (ADC) maps, to exploit complementary tissue information. Training was performed on 981 cases from the PI-CAI dataset using whole-gland annotations, and model performance was assessed through 5-fold cross-validation and external validation on an independent cohort of 54 patients from Hospital La Fe. The proposed model achieved a mean Dice score of 0.96 +/- 0.00 in cross-validation and 0.82 on the external test set, demonstrating strong generalization despite domain shift. In comparison, a general-purpose approach (TotalSegmentator) showed substantially lower performance, with a Dice score of 0.15, primarily due to under-segmentation of the gland. These results highlight the importance of task-specific, multimodal segmentation strategies and demonstrate the potential of the proposed approach for reliable integration into clinical research workflows. To facilitate reproducibility and deployment, the model has been fully containerized and is available as a ready-to-use inference tool.
[196] Ego-Grounding for Personalized Question-Answering in Egocentric Videos
Junbin Xiao, Shenglang Zhang, Pengxiang Zhu, Angela Yao
Main category: cs.CV
TL;DR: MyEgo: First systematic analysis of MLLMs for ego-grounding in personalized QA from egocentric videos, revealing significant performance gaps vs humans
Details
Motivation: To evaluate MLLMs' ability to understand, remember, and reason about the camera-wearer in egocentric videos for personalized question-answering, addressing the gap in ego-grounding capabilitiesMethod: Introduces MyEgo dataset with 541 long videos and 5K personalized questions about “my things”, “my activities”, and “my past”; benchmarks various MLLM variants including open-source vs proprietary, thinking vs non-thinking, small vs large scales
Result: Top models achieve only ~46% (GPT-5) and 36% (Qwen3-VL) accuracy, trailing human performance by 40-50%; neither explicit reasoning nor model scaling yield consistent improvements; models struggle with long-range memory and tracking “me” over time
Conclusion: Highlights crucial role of ego-grounding and long-range memory for personalized QA in egocentric videos; MyEgo dataset and analyses aim to catalyze progress in egocentric personalized assistance
Abstract: We present the first systematic analysis of multimodal large language models (MLLMs) in personalized question-answering requiring ego-grounding - the ability to understand the camera-wearer in egocentric videos. To this end, we introduce MyEgo, the first egocentric VideoQA dataset designed to evaluate MLLMs’ ability to understand, remember, and reason about the camera wearer. MyEgo comprises 541 long videos and 5K personalized questions asking about “my things”, “my activities”, and “my past”. Benchmarking reveals that competitive MLLMs across variants, including open-source vs. proprietary, thinking vs. non-thinking, small vs. large scales all struggle on MyEgo. Top closed- and open-source models (e.g., GPT-5 and Qwen3-VL) achieve only~46% and 36% accuracy, trailing human performance by near 40% and 50% respectively. Surprisingly, neither explicit reasoning nor model scaling yield consistent improvements. Models improve when relevant evidence is explicitly provided, but gains drop over time, indicating limitations in tracking and remembering “me” and “my past”. These findings collectively highlight the crucial role of ego-grounding and long-range memory in enabling personalized QA in egocentric videos. We hope MyEgo and our analyses catalyze further progress in these areas for egocentric personalized assistance. Data and code are available at https://github.com/Ryougetsu3606/MyEgo
[197] SDesc3D: Towards Layout-Aware 3D Indoor Scene Generation from Short Descriptions
Jie Feng, Jiawei Shen, Junjia Huang, Junpeng Zhang, Mingtao Feng, Weisheng Dong, Guanbin Li
Main category: cs.CV
TL;DR: SDesc3D: A framework for 3D indoor scene generation from short text descriptions using multi-view structural priors and functionality-aware layout grounding to address poor physical plausibility in existing methods.
Details
Motivation: Existing text-conditioned 3D scene generation methods suffer from poor physical plausibility and insufficient detail richness when using short textual descriptions, due to reliance on explicit semantic cues about objects and spatial relationships. There's a need for enhanced 3D reasoning capabilities with better prior integration and spatial anchoring.Method: 1) Multi-view scene prior augmentation to enrich underspecified text inputs with aggregated structural knowledge; 2) Functionality-aware layout grounding using regional functionality for implicit spatial anchors and hierarchical layout reasoning; 3) Iterative reflection-rectification scheme for progressive structural plausibility refinement via self-rectification.
Result: Extensive experiments show the method outperforms existing approaches on short-text conditioned 3D indoor scene generation.
Conclusion: SDesc3D successfully addresses limitations in short-text conditioned 3D scene generation by leveraging multi-view structural priors and functionality implications, enabling better 3D layout reasoning under sparse textual guidance.
Abstract: 3D indoor scene generation conditioned on short textual descriptions provides a promising avenue for interactive 3D environment construction without the need for labor-intensive layout specification. Despite recent progress in text-conditioned 3D scene generation, existing works suffer from poor physical plausibility and insufficient detail richness in such semantic condensation cases, largely due to their reliance on explicit semantic cues about compositional objects and their spatial relationships. This limitation highlights the need for enhanced 3D reasoning capabilities, particularly in terms of prior integration and spatial anchoring.Motivated by this, we propose SDesc3D, a short-text conditioned 3D indoor scene generation framework, that leverages multi-view structural priors and regional functionality implications to enable 3D layout reasoning under sparse textual guidance.Specifically, we introduce a Multi-view scene prior augmentation that enriches underspecified textual inputs with aggregated multi-view structural knowledge, shifting from inaccessible semantic relation cues to multi-view relational prior aggregation. Building on this, we design a Functionality-aware layout grounding, employing regional functionality grounding for implicit spatial anchors and conducting hierarchical layout reasoning to enhance scene organization and semantic plausibility.Furthermore, an Iterative reflection-rectification scheme is employed for progressive structural plausibility refinement via self-rectification.Extensive experiments show that our method outperforms existing approaches on short-text conditioned 3D indoor scene generation.Code will be publicly available.
[198] NearID: Identity Representation Learning via Near-identity Distractors
Aleksandar Cvejic, Rameen Abdal, Abdelrahman Eldesokey, Bernard Ghanem, Peter Wonka
Main category: cs.CV
TL;DR: A framework using Near-identity distractors to evaluate and improve identity-focused vision tasks by eliminating background context shortcuts, with a dataset and contrastive learning approach that significantly improves identity discrimination.
Details
Motivation: Existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics for identity-focused tasks like personalized generation and image editing.Method: Introduces Near-identity (NearID) distractors - semantically similar but distinct instances on the same background as reference images. Creates NearID dataset (19K identities, 316K matched-context distractors) with strict margin-based evaluation protocol. Uses two-tier contrastive learning on frozen backbone to enforce hierarchy: same identity > NearID distractor > random negative.
Result: Pre-trained encoders perform poorly (SSR as low as 30.7%). The proposed method improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++ benchmark.
Conclusion: The NearID framework effectively isolates identity from background context, enabling more reliable evaluation and improved performance on identity-focused vision tasks through principled dataset construction and contrastive learning.
Abstract: When evaluating identity-focused tasks such as personalized generation and image editing, existing vision encoders entangle object identity with background context, leading to unreliable representations and metrics. We introduce the first principled framework to address this vulnerability using Near-identity (NearID) distractors, where semantically similar but distinct instances are placed on the exact same background as a reference image, eliminating contextual shortcuts and isolating identity as the sole discriminative signal. Based on this principle, we present the NearID dataset (19K identities, 316K matched-context distractors) together with a strict margin-based evaluation protocol. Under this setting, pre-trained encoders perform poorly, achieving Sample Success Rates (SSR), a strict margin-based identity discrimination metric, as low as 30.7% and often ranking distractors above true cross-view matches. We address this by learning identity-aware representations on a frozen backbone using a two-tier contrastive objective enforcing the hierarchy: same identity > NearID distractor > random negative. This improves SSR to 99.2%, enhances part-level discrimination by 28.0%, and yields stronger alignment with human judgments on DreamBench++, a human-aligned benchmark for personalization. Project page: https://gorluxor.github.io/NearID/
[199] Interactive Tracking: A Human-in-the-Loop Paradigm with Memory-Augmented Adaptation
Yuqing Huang, Guotian Zeng, Zhenqiao Yuan, Zhenyu He, Xin Li, Yaowei Wang, Ming-Hsuan Yang
Main category: cs.CV
TL;DR: Interactive visual tracking benchmark with language guidance, showing current trackers fail in interactive scenarios and proposing a memory-augmented baseline.
Details
Motivation: Current visual trackers operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios requiring human-in-the-loop adaptation. There's a need for systems that can accept natural language guidance during tracking.Method: Three main contributions: 1) InteractTrack benchmark with 150 videos, dense bounding box annotations, and timestamped language instructions; 2) Comprehensive evaluation protocol testing 25 representative trackers; 3) Interactive Memory-Augmented Tracking (IMAT) baseline using dynamic memory to learn from user feedback.
Result: State-of-the-art trackers fail in interactive scenarios - strong performance on conventional benchmarks doesn’t transfer. The proposed IMAT baseline shows improved adaptation to user guidance.
Conclusion: Establishes foundation for more intelligent, adaptive tracking systems that bridge automated perception and human guidance through natural language interaction.
Abstract: Existing visual trackers mainly operate in a non-interactive, fire-and-forget manner, making them impractical for real-world scenarios that require human-in-the-loop adaptation. To overcome this limitation, we introduce Interactive Tracking, a new paradigm that allows users to guide the tracker at any time using natural language commands. To support research in this direction, we make three main contributions. First, we present InteractTrack, the first large-scale benchmark for interactive tracking, containing 150 videos with dense bounding box annotations and timestamped language instructions. Second, we propose a comprehensive evaluation protocol and evaluate 25 representative trackers, showing that state-of-the-art methods fail in interactive scenarios; strong performance on conventional benchmarks does not transfer. Third, we introduce Interactive Memory-Augmented Tracking (IMAT), a new baseline that employs a dynamic memory mechanism to learn from user feedback and update tracking behavior accordingly. Our benchmark, protocol, and baseline establish a foundation for developing more intelligent, adaptive, and collaborative tracking systems, bridging the gap between automated perception and human guidance. The full benchmark, tracking results, and analysis are available at https://github.com/NorahGreen/InteractTrack.git.
[200] Curia-2: Scaling Self-Supervised Learning for Radiology Foundation Models
Antoine Saporta, Baptiste Callard, Corentin Dancette, Julien Khlaut, Charles Corbière, Leo Butsanets, Amaury Prat, Pierre Manceron
Main category: cs.CV
TL;DR: Curia-2 improves medical imaging foundation models with better pre-training for CT/MRI, scaling to billion-parameter vision transformers, and introduces 2D/3D evaluation benchmarks.
Details
Motivation: To address the unsustainable workload on radiologists by developing improved foundation models for medical imaging that can better capture the complexities of radiological data (CT and MRI volumes).Method: Builds upon Curia framework with enhanced pre-training strategy and representation quality, enabling scaling to billion-parameter Vision Transformers for multi-modal CT and MRI analysis. Introduces CuriaBench with 2D (slice-based) and 3D (volumetric) evaluation tracks.
Result: Curia-2 outperforms all foundation models on vision-focused tasks and performs competitively with vision-language models on clinically complex tasks like finding detection.
Conclusion: Curia-2 represents a significant advancement in medical imaging foundation models, with improved pre-training strategies and evaluation frameworks that enable better analysis of complex radiological data.
Abstract: The rapid growth of medical imaging has fueled the development of Foundation Models (FMs) to reduce the growing, unsustainable workload on radiologists. While recent FMs have shown the power of large-scale pre-training to CT and MRI analysis, there remains significant room to optimize how these models learn from complex radiological volumes. Building upon the Curia framework, this work introduces Curia-2, which significantly improves the original pre-training strategy and representation quality to better capture the specificities of radiological data. The proposed methodology enables scaling the architecture up to billion-parameter Vision Transformers, marking a first for multi-modal CT and MRI FMs. Furthermore, we formalize the evaluation of these models by extending and restructuring CuriaBench into two distinct tracks: a 2D track tailored for slice-based vision models and a 3D track for volumetric benchmarking. Our results demonstrate that Curia-2 outperforms all FMs on vision-focused tasks and fairs competitively to vision-language models on clinically complex tasks such as finding detection. Weights will be made publicly available to foster further research.
[201] Attention at Rest Stays at Rest: Breaking Visual Inertia for Cognitive Hallucination Mitigation
Boyang Gong, Yu Zheng, Fanye Kong, Jie Zhou, Jiwen Lu
Main category: cs.CV
TL;DR: IVE addresses cognitive hallucinations in MLLMs by breaking visual attention inertia through dynamic excitation of emerging visual tokens and penalizing persistent attention patterns.
Details
Motivation: Current MLLMs exhibit visual attention inertia where attention remains static after early decoding, failing to support compositional understanding and cognitive inference. Existing hallucination mitigation methods focus on perceptual hallucinations but are inadequate for cognitive hallucinations requiring inter-object relational deduction.Method: Proposes Inertia-aware Visual Excitation (IVE), a training-free method that models cognitive inference as dynamic responsiveness of visual attention. It selects visual tokens dynamically emerging relative to historical attention trends while distinguishing inertial tokens, and introduces an inertia-aware penalty to discourage over-concentration and limit attention persistence in localized regions.
Result: Extensive experiments show IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
Conclusion: IVE successfully addresses cognitive hallucinations in MLLMs by breaking visual attention inertia patterns, enabling better compositional understanding and relational inference through dynamic visual attention mechanisms.
Abstract: Like a body at rest that stays at rest, we find that visual attention in multimodal large language models (MLLMs) exhibits pronounced inertia, remaining largely static once settled during early decoding steps and failing to support the compositional understanding required for cognitive inference. While existing hallucination mitigation methods mainly target perceptual hallucinations concerning object existence or attributes, they remain inadequate for such cognitive hallucinations that require inter-object relational deduction. Through token-wise attention analysis, we identify this visual inertia as a key factor: attention to semantically critical regions remains persistently focused and fails to dynamically support relational inference. We thereby propose a training-free Inertia-aware Visual Excitation (IVE) method that breaks this inertial pattern by modeling cognitive inference as the dynamic responsiveness of visual attention. Specifically, IVE selects visual tokens that are dynamically emerging relative to historical attention trends while distinguishing tokens exhibiting inertial behavior. To further facilitate compositional inference, IVE introduces an inertia-aware penalty that discourages over-concentration and limits the persistence of attention within localized regions. Extensive experiments show that IVE is effective across various base MLLMs and multiple hallucination benchmarks, particularly for cognitive hallucinations.
[202] Resonance4D: Frequency-Domain Motion Supervision for Preset-Free Physical Parameter Learning in 4D Dynamic Physical Scene Simulation
Changshe Zhang, Jie Feng, Siyu Chen, Guanbin Li, Ronghua Shang, Junpeng Zhang
Main category: cs.CV
TL;DR: Resonance4D is a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with Material Point Method using lightweight dual-domain motion supervision to enable efficient physics-driven 4D simulation on consumer GPUs.
Details
Motivation: Existing physics-driven 4D dynamic simulation methods face two main issues: 1) they rely on computationally expensive online video diffusion or optical-flow pipelines for motion supervision, and 2) they simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in complex scenes.Method: Resonance4D combines 3D Gaussian Splatting with Material Point Method using Dual-domain Motion Supervision (DMS) that enforces spatial structural consistency for local deformation and frequency-domain spectral consistency for oscillatory/global dynamics. It also uses zero-shot text-prompted segmentation with simulation-guided initialization to decompose Gaussians into object-part-level regions for full material parameter optimization.
Result: The method achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35GB to around 20GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU. Experiments on synthetic and real scenes demonstrate effectiveness.
Conclusion: Resonance4D provides an efficient framework for physics-driven 4D dynamic simulation that addresses computational cost and parameter optimization limitations of existing methods through lightweight dual-domain supervision and full material parameter recovery.
Abstract: Physics-driven 4D dynamic simulation from static 3D scenes remains constrained by an overlooked contradiction: reliable motion supervision often relies on online video diffusion or optical-flow pipelines whose computational cost exceeds that of the simulator itself. Existing methods further simplify inverse physical modeling by optimizing only partial material parameters, limiting realism in scenes with complex materials and dynamics. We present Resonance4D, a physics-driven 4D dynamic simulation framework that couples 3D Gaussian Splatting with the Material Point Method through lightweight yet physically expressive supervision. Our key insight is that dynamic consistency can be enforced without dense temporal generation by jointly constraining motion in complementary domains. To this end, we introduce Dual-domain Motion Supervision (DMS), which combines spatial structural consistency for local deformation with frequency-domain spectral consistency for oscillatory and global dynamic patterns, substantially reducing training cost and memory overhead while preserving physically meaningful motion cues. To enable stable full-parameter physical recovery, we further combine zero-shot text-prompted segmentation with simulation-guided initialization to automatically decompose Gaussians into object-part-level regions and support joint optimization of full material parameters. Experiments on both synthetic and real scenes show that Resonance4D achieves strong physical fidelity and motion consistency while reducing peak GPU memory from over 35,GB to around 20,GB, enabling high-fidelity physics-driven 4D simulation on a single consumer-grade GPU.
[203] MTLSI-Net: A Linear Semantic Interaction Network for Parameter-Efficient Multi-Task Dense Prediction
Chen Liu, Hengyu Man, Xiaopeng Fan, Debin Zhao
Main category: cs.CV
TL;DR: MTLSI-Net is a multi-task dense prediction model that uses linear attention mechanisms to efficiently capture cross-task interactions at linear complexity, achieving SOTA performance on NYUDv2 and PASCAL-Context datasets.
Details
Motivation: Standard self-attention for capturing global cross-task interactions in multi-task dense prediction suffers from quadratic complexity on high-resolution features, making it computationally expensive and inefficient.Method: Proposes MTLSI-Net with three key components: 1) Multi-Task Multi-scale Query Linear Fusion Block for cross-task dependencies across scales with linear complexity, 2) Semantic Token Distiller to compress features into compact semantic tokens, and 3) Cross-Window Integrated attention Block that injects global semantics into local features via dual-branch architecture.
Result: Extensive experiments on NYUDv2 and PASCAL-Context datasets demonstrate state-of-the-art performance, validating the effectiveness and efficiency of the proposed approach in multi-task learning.
Conclusion: MTLSI-Net successfully addresses the computational limitations of standard self-attention for cross-task interactions in multi-task dense prediction, achieving both high performance and efficiency through linear attention mechanisms.
Abstract: Multi-task dense prediction aims to perform multiple pixel-level tasks simultaneously. However, capturing global cross-task interactions remains non-trivial due to the quadratic complexity of standard self-attention on high-resolution features. To address this limitation, we propose a Multi-Task Linear Semantic Interaction Network (MTLSI-Net), which facilitates cross-task interaction through linear attention. Specifically, MTLSI-Net incorporates three key components: a Multi-Task Multi-scale Query Linear Fusion Block, which captures cross-task dependencies across multiple scales with linear complexity using a shared global context matrix; a Semantic Token Distiller that compresses redundant features into compact semantic tokens, distilling essential cross-task knowledge; and a Cross-Window Integrated attention Block that injects global semantics into local features via a dual-branch architecture, preserving both global consistency and spatial precision. These components collectively enable the network to capture comprehensive cross-task interactions at linear complexity with reduced parameters. Extensive experiments on NYUDv2 and PASCAL-Context demonstrate that MTLSI-Net achieves state-of-the-art performance, validating its effectiveness and efficiency in multi-task learning.
[204] ProDiG: Progressive Diffusion-Guided Gaussian Splatting for Aerial to Ground Reconstruction
Sirshapan Mitra, Yogesh S. Rawat
Main category: cs.CV
TL;DR: ProDiG: A diffusion-guided framework that progressively transforms aerial 3D Gaussian representations to ground-level fidelity using geometry-aware attention and distance-adaptive Gaussian modules.
Details
Motivation: Existing methods for generating ground-level views from aerial imagery struggle with extreme viewpoint changes, missing intermediate observations, and large scale variations. Current approaches either produce geometrically inconsistent results or require multi-altitude ground-truth data that is rarely available.Method: ProDiG uses a progressive diffusion-guided framework with two key components: 1) Geometry-aware causal attention module that injects epipolar structure into reference-view diffusion, and 2) Distance-adaptive Gaussian module that dynamically adjusts Gaussian scale and opacity based on camera distance.
Result: Extensive experiments on synthetic and real-world datasets show ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in visual quality, geometric consistency, and robustness to extreme viewpoint changes.
Conclusion: ProDiG enables progressive, geometrically grounded refinement of aerial 3D representations to ground-level fidelity without requiring additional ground-truth viewpoints, addressing fundamental challenges in cross-view synthesis.
Abstract: Generating ground-level views and coherent 3D site models from aerial-only imagery is challenging due to extreme viewpoint changes, missing intermediate observations, and large scale variations. Existing methods either refine renderings post-hoc, often producing geometrically inconsistent results, or rely on multi-altitude ground-truth, which is rarely available. Gaussian Splatting and diffusion-based refinements improve fidelity under small variations but fail under wide aerial-to-ground gaps. To address these limitations, we introduce ProDiG (Progressive Altitude Gaussian Splatting), a diffusion-guided framework that progressively transforms aerial 3D representations toward ground-level fidelity. ProDiG synthesizes intermediate-altitude views and refines the Gaussian representation at each stage using a geometry-aware causal attention module that injects epipolar structure into reference-view diffusion. A distance-adaptive Gaussian module dynamically adjusts Gaussian scale and opacity based on camera distance, ensuring stable reconstruction across large viewpoint gaps. Together, these components enable progressive, geometrically grounded refinement without requiring additional ground-truth viewpoints. Extensive experiments on synthetic and real-world datasets demonstrate that ProDiG produces visually realistic ground-level renderings and coherent 3D geometry, significantly outperforming existing approaches in terms of visual quality, geometric consistency, and robustness to extreme viewpoint changes.
[205] Test-Time Adaptation for Height Completion via Self-Supervised ViT Features and Monocular Foundation Models
Osher Rafaeli, Tal Svoray, Ariel Nahlieli
Main category: cs.CV
TL;DR: Prior2DSM is a training-free framework for metric digital surface model completion using foundation models and test-time adaptation, achieving significant error reduction compared to existing methods.
Details
Motivation: Large-scale digital surface models often have incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or environmental changes. Traditional interpolation methods fail with missing objects, while learning-based approaches require supervised training on sensor-specific data, limiting generalization across domains.Method: The proposed method combines self-supervised Vision Transformer features from DINOv3 with monocular depth foundation models to propagate metric information through semantic feature-space correspondence. It uses test-time adaptation with parameter-efficient low-rank adaptation (LoRA) and a lightweight MLP to predict spatially varying scale and shift parameters for converting relative depth estimates into metric heights.
Result: Experiments show consistent improvements over interpolation methods, prior-based rescaling approaches, and state-of-the-art monocular depth estimation models. Prior2DSM achieves up to 46% reduction in RMSE compared to linear fitting of MDE, while preserving structural fidelity and enabling DSM updating and coupled RGB-DSM generation.
Conclusion: Prior2DSM provides an effective training-free framework for metric DSM completion that leverages foundation models and test-time adaptation, offering better generalization across domains and sensing conditions compared to supervised approaches.
Abstract: Accurate digital surface models (DSMs) are essential for many geospatial applications, including urban monitoring, environmental analyses, infrastructure management, and change detection. However, large-scale DSMs frequently contain incomplete or outdated regions due to acquisition limitations, reconstruction artifacts, or changes in the built environment. Traditional height completion approaches primarily rely on spatial interpolation or which assume spatial continuity and therefore fail when objects are missing. Recent learning-based approaches improve reconstruction quality but typically require supervised training on sensor-specific datasets, limiting their generalization across domains and sensing conditions. We propose Prior2DSM, a training-free framework for metric DSM completion that operates entirely at test time by leveraging foundation models. Unlike previous height completion approaches that require task-specific training, the proposed method combines self-supervised Vision Transformer (ViT) features from DINOv3 with monocular depth foundation models to propagate metric information from incomplete height priors through semantic feature-space correspondence. Test-time adaptation (TTA) is performed using parameter-efficient low-rank adaptation (LoRA) together with a lightweight multilayer perceptron (MLP), which predicts spatially varying scale and shift parameters to convert relative depth estimates into metric heights. Experiments demonstrate consistent improvements over interpolation based methods, prior-based rescaling height approaches, and state-of-the-art monocular depth estimation models. Prior2DSM reduces reconstruction error while preserving structural fidelity, achieving up to a 46% reduction in RMSE compared to linear fitting of MDE, and further enables DSM updating and coupled RGB-DSM generation.
[206] Decouple and Rectify: Semantics-Preserving Structural Enhancement for Open-Vocabulary Remote Sensing Segmentation
Jie Feng, Fengze Li, Junpeng Zhang, Siyu Chen, Yuping Liang, Junying Chen, Ronghua Shang
Main category: cs.CV
TL;DR: DR-Seg is a decouple-and-rectify framework for open-vocabulary semantic segmentation in remote sensing that separates CLIP features into semantic and structural subspaces for targeted enhancement without disrupting language alignment.
Details
Motivation: Current methods for remote sensing semantic segmentation struggle to balance CLIP's language-aligned recognition with fine-grained spatial delineation. They treat CLIP features as a uniform semantic space and cannot localize where structural enhancement is needed, risking disruption of semantic integrity while failing to effectively delineate boundaries.Method: DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement using DINO features. It uses a prior-driven graph rectification module to inject high-fidelity structural priors under DINO guidance, and an uncertainty-guided adaptive fusion module to dynamically integrate the refined branch with original CLIP features.
Result: Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art for open-vocabulary semantic segmentation in remote sensing.
Conclusion: DR-Seg effectively addresses the limitations of previous methods by leveraging CLIP’s functional heterogeneity to enable targeted structural enhancement without compromising language-aligned semantics, achieving superior performance in remote sensing segmentation tasks.
Abstract: Open-vocabulary semantic segmentation in the remote sensing (RS) field requires both language-aligned recognition and fine-grained spatial delineation. Although CLIP offers robust semantic generalization, its global-aligned visual representations inherently struggle to capture structural details. Recent methods attempt to compensate for this by introducing RS-pretrained DINO features. However, these methods treat CLIP representations as a monolithic semantic space and cannot localize where structural enhancement is required, failing to effectively delineate boundaries while risking the disruption of CLIP’s semantic integrity. To address this limitation, we propose DR-Seg, a novel decouple-and-rectify framework in this paper. Our method is motivated by the key observation that CLIP feature channels exhibit distinct functional heterogeneity rather than forming a uniform semantic space. Building on this insight, DR-Seg decouples CLIP features into semantics-dominated and structure-dominated subspaces, enabling targeted structural enhancement by DINO without distorting language-aligned semantics. Subsequently, a prior-driven graph rectification module injects high-fidelity structural priors under DINO guidance to form a refined branch, while an uncertainty-guided adaptive fusion module dynamically integrates this refined branch with the original CLIP branch for final prediction. Comprehensive experiments across eight benchmarks demonstrate that DR-Seg establishes a new state-of-the-art.
[207] Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
Dian Liu, Jie Feng, Di Li, Yuhui Zheng, Guanbin Li, Weisheng Dong, Guangming Shi
Main category: cs.CV
TL;DR: LinkS²Bench: A benchmark for evaluating Vision-Language Models’ cross-view spatial intelligence linking dynamic UAV footage with satellite imagery for emergency response applications.
Details
Motivation: Existing benchmarks are limited to isolated UAV videos or static satellite imagery, failing to evaluate the complex interplay between macro-scale global coverage and dynamic local perception needed for emergency response and security operations.Method: Created LinkS²Bench with 1,022 minutes of UAV footage linked to high-resolution satellite imagery covering 200+ km², using LMM-assisted pipeline and human annotation to generate 17.9k QA pairs across 12 tasks in four dimensions: perception, localization, relation, and reasoning.
Result: Evaluation of 18 VLMs shows substantial gap compared to human baselines, identifying cross-view dynamic alignment as the critical bottleneck. Proposed Cross-View Alignment Adapter significantly improves model performance, and fine-tuning experiments demonstrate the benchmark’s utility for advancing VLM adaptation.
Conclusion: LinkS²Bench addresses a critical gap in evaluating VLMs’ spatial intelligence for emergency applications, reveals current model limitations, and provides a pathway for improvement through explicit cross-view alignment techniques.
Abstract: Synergistic spatial intelligence between UAVs and satellites is indispensable for emergency response and security operations, as it uniquely integrates macro-scale global coverage with dynamic, real-time local perception. However, the capacity of Vision-Language Models (VLMs) to master this complex interplay remains largely unexplored. This gap persists primarily because existing benchmarks are confined to isolated Unmanned Aerial Vehicle (UAV) videos or static satellite imagery, failing to evaluate the dynamic local-to-global spatial mapping essential for comprehensive cross-view reasoning. To bridge this gap, we introduce LinkS$^2$Bench, the first comprehensive benchmark designed to evaluate VLMs’ wide-area, dynamic cross-view spatial intelligence. LinkS$^2$Bench links 1,022 minutes of dynamic UAV footage with high-resolution satellite imagery covering over 200 km$^2$. Through an LMM-assisted pipeline and rigorous human annotation, we constructed 17.9k high-quality question-answer pairs comprising 12 fine-grained tasks across four dimensions: perception, localization, relation, and reasoning. Evaluations of 18 representative VLMs reveal a substantial gap compared to human baselines, identifying accurate cross-view dynamic alignment as the critical bottleneck. To alleviate this, we design a Cross-View Alignment Adapter, demonstrating that explicit alignment significantly improves model performance. Furthermore, fine-tuning experiments underscore the potential of LinkS$^2$Bench in advancing VLM adaptation for complex spatial reasoning.
[208] Rare-Aware Autoencoding: Reconstructing Spatially Imbalanced Data
Alejandro Castañeda Garcia, Jan van Gemert, Daan Brinks, Nergis Tömen
Main category: cs.CV
TL;DR: Autoencoder method addressing spatial imbalance in image reconstruction via self-entropy loss and sample propagation replay mechanism.
Details
Motivation: Autoencoders struggle with spatially non-uniform sampling where informative patterns occur rarely at specific coordinates, causing bias toward dominant patterns and loss of fine-grained detail, especially in medical imaging, biology, and physics domains.Method: Two complementary components: (1) self-entropy-based loss that upweights statistically uncommon spatial locations, and (2) Sample Propagation - a replay mechanism that selectively re-exposes the model to hard-to-reconstruct samples across batches during training.
Result: Outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions, validated in simulated dataset with controlled spatial imbalance and three diverse real-world datasets spanning physical, biological, and astronomical domains.
Conclusion: The approach effectively addresses spatial imbalance in unsupervised image reconstruction by emphasizing rare samples and highlighting the importance of data representation in batches.
Abstract: Autoencoders can be challenged by spatially non-uniform sampling of image content. This is common in medical imaging, biology, and physics, where informative patterns occur rarely at specific image coordinates, as background dominates these locations in most samples, biasing reconstructions toward the majority appearance. In practice, autoencoders are biased toward dominant patterns resulting in the loss of fine-grained detail and causing blurred reconstructions for rare spatial inputs especially under spatial data imbalance. We address spatial imbalance by two complementary components: (i) self-entropy-based loss that upweights statistically uncommon spatial locations and (ii) Sample Propagation, a replay mechanism that selectively re-exposes the model to hard to reconstruct samples across batches during training. We benchmark existing data balancing strategies, originally developed for supervised classification, in the unsupervised reconstruction setting. Drawing on the limitations of these approaches, our method specifically targets spatial imbalance by encouraging models to focus on statistically rare locations, improving reconstruction consistency compared to existing baselines. We validate in a simulated dataset with controlled spatial imbalance conditions, and in three, uncontrolled, diverse real-world datasets spanning physical, biological, and astronomical domains. Our approach outperforms baselines on various reconstruction metrics, particularly under spatial imbalance distributions. These results highlight the importance of data representation in a batch and emphasize rare samples in unsupervised image reconstruction. We will make all code and related data available.
[209] IndoorCrowd: A Multi-Scene Dataset for Human Detection, Segmentation, and Tracking with an Automated Annotation Pipeline
Sebastian-Ion Nae, Radu Moldoveanu, Alexandra Stefania Ghita, Adina Magda Florea
Main category: cs.CV
TL;DR: IndoorCrowd dataset for indoor human detection, segmentation, and tracking across 4 campus locations with 31 videos and 9,913 frames, featuring human-verified masks and benchmarking of foundation models.
Details
Motivation: Existing datasets lack real-world indoor complexity at scale for understanding human behavior in crowded indoor environments, which is crucial for surveillance, smart buildings, and human-robot interaction.Method: Created IndoorCrowd dataset with 31 videos across 4 campus locations, collected 9,913 frames at 5fps with human-verified per-instance segmentation masks. Benchmarked three foundation-model auto-annotators (SAM3, GroundingSAM, EfficientGroundingSAM) against human labels using multiple metrics. Established baselines using YOLOv8n, YOLOv26n, RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT.
Result: Dataset includes 620-frame control subset for benchmarking and 2,552-frame subset for multi-object tracking. Per-scene analysis shows substantial difficulty variation driven by crowd density, scale, and occlusion, with ACS-EC being the most challenging scene (79.3% dense frames, mean instance scale 60.8px).
Conclusion: IndoorCrowd provides a comprehensive benchmark dataset for indoor human understanding tasks, revealing significant scene-dependent challenges and enabling evaluation of foundation models and tracking algorithms in complex real-world indoor settings.
Abstract: Understanding human behaviour in crowded indoor environments is central to surveillance, smart buildings, and human-robot interaction, yet existing datasets rarely capture real-world indoor complexity at scale. We introduce IndoorCrowd, a multi-scene dataset for indoor human detection, instance segmentation, and multi-object tracking, collected across four campus locations (ACS-EC, ACS-EG, IE-Central, R-Central). It comprises $31$ videos ($9{,}913$ frames at $5$fps) with human-verified, per-instance segmentation masks. A $620$-frame control subset benchmarks three foundation-model auto-annotators: SAM3, GroundingSAM, and EfficientGroundingSAM, against human labels using Cohen’s $κ$, AP, precision, recall, and mask IoU. A further $2{,}552$-frame subset supports multi-object tracking with continuous identity tracks in MOTChallenge format. We establish detection, segmentation, and tracking baselines using YOLOv8n, YOLOv26n, and RT-DETR-L paired with ByteTrack, BoT-SORT, and OC-SORT. Per-scene analysis reveals substantial difficulty variation driven by crowd density, scale, and occlusion: ACS-EC, with $79.3%$ dense frames and a mean instance scale of $60.8$px, is the most challenging scene. The project page is available at https://sheepseb.github.io/IndoorCrowd/.
[210] Efficient Reasoning via Thought Compression for Language Segmentation
Qing Zhou, Shiyu Zhang, Yuyu Jia, Junyu Gao, Weiping Ni, Junzheng Wu, Qi Wang
Main category: cs.CV
TL;DR: WISE introduces an efficient reasoning paradigm for multimodal models that generates concise rationales first, then detailed explanations, enabling 5x shorter reasoning while maintaining state-of-the-art segmentation performance.
Details
Motivation: Chain-of-thought reasoning improves multimodal model performance but generates verbose rationales with high computational cost, limiting real-world applicability. There's a need for efficient reasoning that maintains performance while reducing computational overhead.Method: WISE trains models to generate structured sequences: concise rationale → final answer → detailed explanation. Uses autoregressive conditioning to make concise rationale sufficient for generating detailed explanation. Employs self-distillation objective for semantic fidelity and conciseness. At inference (WISE-S), uses brevity-focused prompting to activate learned concise policy.
Result: Achieves state-of-the-art zero-shot performance on ReasonSeg benchmark with 58.3 cIoU while reducing average reasoning length by 5x (from 112 to 23 tokens).
Conclusion: WISE enables efficient multimodal reasoning by internalizing detailed reasoning into compact forms, maintaining performance while dramatically reducing computational cost through structured generation and inference optimization.
Abstract: Chain-of-thought (CoT) reasoning has significantly improved the performance of large multimodal models in language-guided segmentation, yet its prohibitive computational cost, stemming from generating verbose rationales, limits real-world applicability. We introduce WISE (Wisdom from Internal Self-Exploration), a novel paradigm for efficient reasoning guided by the principle of \textit{thinking twice – once for learning, once for speed}. WISE trains a model to generate a structured sequence: a concise rationale, the final answer, and then a detailed explanation. By placing the concise rationale first, our method leverages autoregressive conditioning to enforce that the concise rationale acts as a sufficient summary for generating the detailed explanation. This structure is reinforced by a self-distillation objective that jointly rewards semantic fidelity and conciseness, compelling the model to internalize its detailed reasoning into a compact form. At inference, the detailed explanation is omitted. To address the resulting conditional distribution shift, our inference strategy, WISE-S, employs a simple prompting technique that injects a brevity-focused instruction into the user’s query. This final adjustment facilitates the robust activation of the learned concise policy, unlocking the full benefits of our framework. Extensive experiments show that WISE-S achieves state-of-the-art zero-shot performance on the ReasonSeg benchmark with 58.3 cIoU, while reducing the average reasoning length by nearly \textbf{5$\times$} – from 112 to just 23 tokens. Code is available at \href{https://github.com/mrazhou/WISE}{WISE}.
[211] Jagle: Building a Large-Scale Japanese Multimodal Post-Training Dataset for Vision-Language Models
Issa Sugiura, Keito Sasagawa, Keisuke Nakao, Koki Maeda, Ziqi Yin, Zhishen Yang, Shuhei Kurita, Yusuke Oda, Ryoko Tokuhisa, Daisuke Kawahara, Naoaki Okazaki
Main category: cs.CV
TL;DR: Jagle: A large-scale Japanese multimodal dataset created by generating VQA pairs from diverse sources (images, image-text pairs, PDFs) using VLM-based generation, translation, and text rendering, enabling strong Japanese VLM performance without degrading English capabilities.
Details
Motivation: Building high-quality multilingual and non-English vision-language models is challenging due to limited VQA datasets in languages other than English. Japanese specifically lacks large-scale, diverse multimodal training data, creating obstacles for developing Japanese VLMs.Method: Collect heterogeneous source data (images, image-text pairs, PDF documents) and generate VQA pairs through multiple strategies: VLM-based QA generation, translation from English datasets, and text rendering from PDFs. Create Jagle dataset with ~9.2M instances across diverse tasks.
Result: A 2.2B model trained with Jagle achieves strong Japanese performance, surpassing InternVL3.5-2B on ten Japanese evaluation tasks and approaching Qwen3-VL-2B-Instruct within 5 points. Combining Jagle with FineVision improves English performance compared to FineVision alone.
Conclusion: Jagle successfully addresses the Japanese multimodal data scarcity problem through creative data generation strategies, enabling development of high-quality Japanese VLMs without compromising English performance, with released dataset/models/code for reproducibility.
Abstract: Developing vision-language models (VLMs) that generalize across diverse tasks requires large-scale training datasets with diverse content. In English, such datasets are typically constructed by aggregating and curating numerous existing visual question answering (VQA) resources. However, this strategy does not readily extend to other languages, where VQA datasets remain limited in both scale and domain coverage, posing a major obstacle to building high-quality multilingual and non-English VLMs. In this work, we introduce Jagle, the largest Japanese multimodal post-training dataset to date, comprising approximately 9.2 million instances across diverse tasks. Rather than relying on existing VQA datasets, we collect heterogeneous source data, including images, image-text pairs, and PDF documents, and generate VQA pairs through multiple strategies such as VLM-based QA generation, translation, and text rendering. Experiments demonstrate that a 2.2B model trained with Jagle achieves strong performance on Japanese tasks, surpassing InternVL3.5-2B in average score across ten Japanese evaluation tasks and approaching within five points of Qwen3-VL-2B-Instruct. Furthermore, combining Jagle with FineVision does not degrade English performance; instead, it improves English performance compared to training with FineVision alone. To facilitate reproducibility and future research, we release the dataset, trained models, and code.
[212] Mining Instance-Centric Vision-Language Contexts for Human-Object Interaction Detection
Soo Won Seo, KyungChae Lee, Hyungchan Cho, Taein Son, Nam Ik Cho, Jun Won Choi
Main category: cs.CV
TL;DR: InCoM-Net is a novel framework for Human-Object Interaction detection that integrates Vision-Language Model semantic knowledge with object detector features through instance-centric context mining and progressive aggregation.
Details
Motivation: Existing HOI detection methods using Vision-Language Models often fail to fully utilize diverse contextual cues across the entire scene, limiting their ability to perform nuanced interaction reasoning.Method: Proposes Instance-centric Context Mining Network with two components: Instance-centric Context Refinement (ICR) extracts intra-instance, inter-instance, and global contextual cues from VLM features; Progressive Context Aggregation (ProCA) iteratively fuses these multi-context features with instance-level detector features.
Result: Achieves state-of-the-art performance on HICO-DET and V-COCO benchmarks, surpassing previous HOI detection methods.
Conclusion: InCoM-Net effectively integrates VLM semantic knowledge with object detector features through comprehensive context mining, enabling deeper interaction reasoning for improved HOI detection.
Abstract: Human-Object Interaction (HOI) detection aims to localize human-object pairs and classify their interactions from a single image, a task that demands strong visual understanding and nuanced contextual reasoning. Recent approaches have leveraged Vision-Language Models (VLMs) to introduce semantic priors, significantly improving HOI detection performance. However, existing methods often fail to fully capitalize on the diverse contextual cues distributed across the entire scene. To overcome these limitations, we propose the Instance-centric Context Mining Network (InCoM-Net)-a novel framework that effectively integrates rich semantic knowledge extracted from VLMs with instance-specific features produced by an object detector. This design enables deeper interaction reasoning by modeling relationships not only within each detected instance but also across instances and their surrounding scene context. InCoM-Net comprises two core components: Instancecentric Context Refinement (ICR), which separately extracts intra-instance, inter-instance, and global contextual cues from VLM-derived features, and Progressive Context Aggregation (ProCA), which iteratively fuses these multicontext features with instance-level detector features to support high-level HOI reasoning. Extensive experiments on the HICO-DET and V-COCO benchmarks show that InCoM-Net achieves state-of-the-art performance, surpassing previous HOI detection methods. Code is available at https://github.com/nowuss/InCoM-Net.
[213] True to Tone? Quantifying Skin Tone Fidelity and Bias in Photographic-to-Virtual Human Pipelines
Gabriel Ferri Schneider, Erick Menezes, Rafael Mecenas, Paulo Knob, Victor Araujo, Soraia Raupp Musse
Main category: cs.CV
TL;DR: Automatic methodology for evaluating skin tone fidelity in Virtual Human rendering pipelines using colorimetric analysis and illumination compensation.
Details
Motivation: Accurate facial skin tone reproduction is crucial for realism, identity preservation, and fairness in Virtual Human rendering, but current avatar creation pipelines lack colorimetric calibration, introducing inconsistencies and bias.Method: Proposes a fully automatic workflow integrating skin color/illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Uses facial images from Chicago Face Database with two extraction strategies: cheek-region sampling and multidimensional masking. Employs TRUST framework for lighting isolation without training. Evaluates skin tone consistency in CIELAB color space using ΔE metric and Individual Typology Angle.
Result: Analyzed approximately 19,848 rendered instances, showing phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.
Conclusion: The methodology enables systematic, large-scale evaluation of skin tone fidelity in VH pipelines without manual intervention, revealing biases in current approaches and providing tools for more equitable virtual human rendering.
Abstract: Accurate reproduction of facial skin tone is essential for realism, identity preservation, and fairness in Virtual Human (VH) rendering. However, most accessible avatar creation pipelines rely on photographic inputs that lack colorimetric calibration, which can introduce inconsistencies and bias. We propose a fully automatic and scalable methodology to systematically evaluate skin tone fidelity across the VH generation pipeline. Our approach defines a full workflow that integrates skin color and illumination extraction, texture recolorization, real-time rendering, and quantitative color analysis. Using facial images from the Chicago Face Database (CFD), we compare skin tone extraction strategies based on cheek-region sampling, following the literature, and multidimensional masking derived from full-face analysis. Additionally, we test both strategies with lighting isolation, using the pre-trained TRUST framework, employed without any training or optimization within our pipeline. Extracted skin tones are applied to MetaHuman textures and rendered under multiple lighting configurations. Skin tone consistency is evaluated objectively in the CIELAB color space using the $ΔE$ metric and the Individual Typology Angle (ITA). The proposed methodology operates without manual intervention and, with the exception of pre-trained illumination compensation modules, the pipeline does not include learning or training stages, enabling low computational cost and large-scale evaluation. Using this framework, we generate and analyze approximately 19,848 rendered instances. Our results show phenotype-dependent behavior of extraction strategies and consistently higher colorimetric errors for darker skin tones.
[214] LatentUM: Unleashing the Potential of Interleaved Cross-Modal Reasoning via a Latent-Space Unified Model
Jiachun Jin, Zetong Zhou, Xiao Yang, Hao Zhang, Pengfei Liu, Jun Zhu, Zhijie Deng
Main category: cs.CV
TL;DR: LatentUM is a unified model that represents all modalities in a shared semantic latent space, enabling efficient interleaved cross-modal reasoning and generation without pixel-space mediation.
Details
Motivation: Existing unified models require pixel decoding as a bridge between visual understanding and generation due to disjoint representations, which is inefficient and ineffective. There's a need for models that can perform interleaved cross-modal reasoning for tasks like visual thinking, self-reflection, and world modeling.Method: Introduces LatentUM with a shared semantic latent space that represents all modalities, eliminating pixel-space mediation. This allows flexible interleaved cross-modal reasoning and generation directly in the latent space.
Result: Achieves state-of-the-art on Visual Spatial Planning benchmark, improves visual generation through self-reflection, supports world modeling by predicting future visual states in latent space, and shows improved computational efficiency with reduced codec bias.
Conclusion: Shared semantic latent space representation enables more effective and efficient unified models for cross-modal reasoning and generation, advancing capabilities in visual understanding, generation, and world modeling.
Abstract: Unified models (UMs) hold promise for their ability to understand and generate content across heterogeneous modalities. Compared to merely generating visual content, the use of UMs for interleaved cross-modal reasoning is more promising and valuable, e.g., for solving understanding problems that require dense visual thinking, improving visual generation through self-reflection, or modeling visual dynamics of the physical world guided by stepwise action interventions. However, existing UMs necessitate pixel decoding as a bridge due to their disjoint visual representations for understanding and generation, which is both ineffective and inefficient. In this paper, we introduce LatentUM, a novel unified model that represents all modalities within a shared semantic latent space, eliminating the need for pixel-space mediation between visual understanding and generation. This design naturally enables flexible interleaved cross-modal reasoning and generation. Beyond improved computational efficiency, the shared representation substantially alleviates codec bias and strengthens cross-modal alignment, allowing LatentUM to achieve state-of-the-art performance on the Visual Spatial Planning benchmark, push the limits of visual generation through self-reflection, and support world modeling by predicting future visual states within the shared semantic latent space.
[215] COMPASS: Complete Multimodal Fusion via Proxy Tokens and Shared Spaces for Ubiquitous Sensing
Hao Wang, Yanyu Qian, Pengcheng Weng, Zixuan Xia, William Dan, Yangxin Xu, Fei Wang
Main category: cs.CV
TL;DR: COMPASS: A missing-modality fusion framework that maintains a fixed N-slot multimodal input by synthesizing proxy tokens for missing modalities, ensuring fusion completeness and robust cross-modal interaction.
Details
Motivation: Existing multimodal fusion methods struggle with missing modalities, leading to incomplete fusion and degraded cross-modal interaction when modalities are absent during inference.Method: COMPASS uses pairwise source-to-target generators in a shared latent space to synthesize proxy tokens for missing modalities, maintaining a fixed N-slot input structure. Combines proxy alignment, shared-space regularization, and per-proxy discriminative supervision.
Result: Outperforms prior methods on XRF55, MM-Fi, and OctoNet datasets under diverse single- and multiple-missing modality settings in the majority of scenarios.
Conclusion: Preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing with missing modalities.
Abstract: Missing modalities remain a major challenge for multimodal sensing, because most existing methods adapt the fusion process to the observed subset by dropping absent branches, using subset-specific fusion, or reconstructing missing features. As a result, the fusion head often receives an input structure different from the one seen during training, leading to incomplete fusion and degraded cross-modal interaction. We propose COMPASS, a missing-modality fusion framework built on the principle of fusion completeness: the fusion head always receives a fixed N-slot multimodal input, with one token per modality slot. For each missing modality, COMPASS synthesizes a target-specific proxy token from the observed modalities using pairwise source-to-target generators in a shared latent space, and aggregates them into a single replacement token. To make these proxies both representation-compatible and task-informative, we combine proxy alignment, shared-space regularization, and per-proxy discriminative supervision. Experiments on XRF55, MM-Fi, and OctoNet under diverse single- and multiple-missing settings show that COMPASS outperforms prior methods on the large majority of scenarios. Our results suggest that preserving a modality-complete fusion interface is a simple and effective design principle for robust multimodal sensing.
[216] CASHG: Context-Aware Stylized Online Handwriting Generation
Jinsu Shin, Sungeun Hong, Jin Yeong Bak
Main category: cs.CV
TL;DR: CASHG is a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis, addressing limitations in prior sequence modeling approaches.
Details
Motivation: Generating natural sentence-level online handwriting that faithfully reflects a writer's style is challenging because sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at sentence scale and under limited compositional diversity.Method: CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory, fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor-current transitions, complemented by gated context fusion for sentence-level context. Training proceeds through a three-stage curriculum from isolated glyphs to full sentences.
Result: CASHG consistently improves Connectivity and Spacing Metrics (CSM) over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by human evaluation.
Conclusion: The proposed CASHG framework effectively addresses sentence-level online handwriting generation by explicitly modeling inter-character connectivity and using curriculum learning, outperforming prior methods in boundary-aware metrics.
Abstract: Online handwriting represents strokes as time-ordered trajectories, which makes handwritten content easier to transform and reuse in a wide range of applications. However, generating natural sentence-level online handwriting that faithfully reflects a writer’s style remains challenging, since sentence synthesis demands context-dependent characters with stroke continuity and spacing. Prior methods treat these boundary properties as implicit outcomes of sequence modeling, which becomes unreliable at the sentence scale and under limited compositional diversity. We propose CASHG, a context-aware stylized online handwriting generator that explicitly models inter-character connectivity for style-consistent sentence-level trajectory synthesis. CASHG uses a Character Context Encoder to obtain character identity and sentence-dependent context memory and fuses them in a bigram-aware sliding-window Transformer decoder that emphasizes local predecessor–current transitions, complemented by gated context fusion for sentence-level context.Training proceeds through a three-stage curriculum from isolated glyphs to full sentences, improving robustness under sparse transition coverage. We further introduce Connectivity and Spacing Metrics (CSM), a boundary-aware evaluation suite that quantifies cursive connectivity and spacing similarity. Under benchmark-matched evaluation protocols, CASHG consistently improves CSM over comparison methods while remaining competitive in DTW-based trajectory similarity, with gains corroborated by a human evaluation.
[217] CompassAD: Intent-Driven 3D Affordance Grounding in Functionally Competing Objects
Jingliang Li, Jindou Jia, Tuo An, Chuhao Zhou, Xiangyu Chen, Shilin Shan, Boyu Ma, Bofan Lyu, Gen Li, Jianfei Yang
Main category: cs.CV
TL;DR: A new 3D affordance grounding benchmark and method for selecting correct objects in confusing multi-object scenes based on implicit natural language intent, addressing the challenge where multiple objects share identical affordances but only one is task-appropriate.
Details
Motivation: Existing 3D affordance methods focus on isolated single objects with explicit category names, but real-world scenes contain confusing pairs where multiple objects share identical affordances (e.g., knife vs scissors for "cut the apple"). Current approaches sidestep this challenge by evaluating single objects with explicit queries.Method: Proposes CompassNet with two key modules: 1) Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage, and 2) Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels to sharpen distinctions between target and confusable surfaces.
Result: State-of-the-art results on both seen and unseen queries in the CompassAD benchmark (30 confusing object pairs, 16 affordance types, 6,422 scenes, 88K+ query-answer pairs). Successful deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
Conclusion: The work formalizes Multi-Object Affordance Grounding under Intent-Driven Instructions, addresses the critical challenge of confusing pairs in real-world scenes, and demonstrates practical applicability to robotics through successful real-world deployment.
Abstract: When told to “cut the apple,” a robot must choose the knife over nearby scissors, despite both objects affording the same cutting function. In real-world scenes, multiple objects may share identical affordances, yet only one is appropriate under the given task context. We call such cases confusing pairs. However, existing 3D affordance methods largely sidestep this challenge by evaluating isolated single objects, often with explicit category names provided in the query. We formalize Multi-Object Affordance Grounding under Intent-Driven Instructions, a new 3D affordance setting that requires predicting a per-point affordance mask on the correct object within a cluttered multi-object point cloud, conditioned on implicit natural language intent. To study this problem, we construct CompassAD, the first benchmark centered on implicit intent in confusable multi-object scenes. It comprises 30 confusing object pairs spanning 16 affordance types, 6,422 scenes, and 88K+ query-answer pairs. Furthermore, we propose CompassNet, a framework that incorporates two dedicated modules tailored to this task. Instance-bounded Cross Injection (ICI) constrains language-geometry alignment within object boundaries to prevent cross-object semantic leakage. Bi-level Contrastive Refinement (BCR) enforces discrimination at both geometric-group and point levels, sharpening distinctions between target and confusable surfaces. Extensive experiments demonstrate state-of-the-art results on both seen and unseen queries, and deployment on a robotic manipulator confirms effective transfer to real-world grasping in confusing multi-object scenes.
[218] KARL: Knowledge-Aware Reasoning and Reinforcement Learning for Knowledge-Intensive Visual Grounding
Xinyu Ma, Ziyang Ding, Zhicong Luo, Chi Chen, Zonghao Guo, Derek F. Wong, Zhen Zhao, Xiaoyi Feng, Maosong Sun
Main category: cs.CV
TL;DR: A knowledge-aware training paradigm for Knowledge-Intensive Visual Grounding (KVG) that addresses the gap between MLLMs’ internal knowledge and grounding predictions through knowledge-guided reasoning data and reinforcement learning.
Details
Motivation: Multimodal Large Language Models (MLLMs) have rich entity knowledge and strong generic grounding capabilities, but fail to effectively utilize this knowledge when grounding specialized concepts, revealing a knowledge-grounding gap between internal knowledge and grounding predictions.Method: Proposes a knowledge-aware training paradigm: 1) constructs knowledge-guided reasoning data to encourage models to activate domain-relevant entity knowledge during grounding, and 2) introduces KARL, a Knowledge-Aware Reinforcement Learning framework that adaptively modulates reward signals according to the model’s estimated knowledge mastery of different entities.
Result: The approach consistently outperforms a wide range of baseline models and achieves substantially stronger cross-domain generalization on unseen categories. Introduces KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities.
Conclusion: The proposed knowledge-aware training paradigm effectively addresses the knowledge-grounding gap in MLLMs for Knowledge-Intensive Visual Grounding tasks, enabling better utilization of internal knowledge for specialized concept grounding.
Abstract: Knowledge-Intensive Visual Grounding (KVG) requires models to localize objects using fine-grained, domain-specific entity names rather than generic referring expressions. Although Multimodal Large Language Models (MLLMs) possess rich entity knowledge and strong generic grounding capabilities, they often fail to effectively utilize such knowledge when grounding specialized concepts, revealing a knowledge-grounding gap between internal knowledge and grounding predictions. To address this challenge, we propose a knowledge-aware training paradigm for KVG. Our approach first constructs knowledge-guided reasoning data to encourage models to activate domain-relevant entity knowledge during grounding, and then introduces KARL, a Knowledge-Aware Reinforcement Learning framework that adaptively modulates reward signals according to the model’s estimated knowledge mastery of different entities. To facilitate systematic evaluation, we introduce KVG-Bench, a benchmark spanning 10 domains with 1.3K curated test cases covering 531 images and 882 entities. Extensive experiments show that our approach consistently outperforms a wide range of baseline models and achieves substantially stronger cross-domain generalization on unseen categories. The data, codes, and models are released at https://github.com/thunlp/KARL.
[219] Network Structure in UK Payment Flows: Evidence on Economic Interdependencies and Implications for Real-Time Measurement
Aditya Humnabadkar
Main category: cs.CV
TL;DR: Network analysis of UK inter-industry payment flows (2017-2024) shows graph features improve payment forecasting by 8.8%, especially during economic disruptions like COVID-19.
Details
Motivation: Traditional bilateral measurement approaches fail to capture structural economic relationships in payment flows. The paper aims to demonstrate that network analysis can reveal invisible economic structures and improve real-time economic monitoring, particularly during disruptions when traditional methods fail.Method: Analyzed 532,346 UK payment records across 89 industry sectors from 2017-2024. Used graph-theoretic features including centrality measures and clustering coefficients to enhance payment flow forecasting models. Compared network-enhanced models against traditional time-series methods.
Result: Network features improved payment flow forecasting by 8.8 percentage points beyond traditional methods. During COVID-19, when traditional forecasting accuracy collapsed (R² falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance with network contributions reaching +13.8 percentage points. Identified Financial Services, Wholesale Trade, and Professional Services as structurally central industries. Network density increased 12.5% over the sample period.
Conclusion: Payment network monitoring could enhance official statistics by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable. Network features are particularly valuable during economic disruptions.
Abstract: Network analysis of inter-industry payment flows reveals structural economic relationships invisible to traditional bilateral measurement approaches, with significant implications for real-time economic monitoring. Analysing 532,346 UK payment records (2017–2024) across 89 industry sectors, we demonstrate that graph-theoretic features which include centrality measures and clustering coefficients improve payment flow forecasting by 8.8 percentage points beyond traditional time-series methods. Critically, network features prove most valuable during economic disruptions: during the COVID-19 pandemic, when traditional forecasting accuracy collapsed (R2} falling from 0.38 to 0.19), network-enhanced models maintained substantially better performance, with network contributions reaching +13.8 percentage points. The analysis identifies Financial Services, Wholesale Trade, and Professional Services as structurally central industries whose network positions indicate systemic importance beyond their transaction volumes. Network density increased 12.5% over the sample period, with visible disruption during 2020 followed by recovery exceeding pre-pandemic integration levels. These findings suggest payment network monitoring could enhance official statistics production by providing leading indicators of structural economic change and improving nowcasting accuracy during periods when traditional temporal patterns prove unreliable.
[220] PLUME: Latent Reasoning Based Universal Multimodal Embedding
Chenwei He, Xiangzhao Hao, Tianyu Yang, Yuxiang Ma, Yuheng Jia, Lingxiang Wu, Chaoyang Zhao, Haiyun Guo, Jinqiao Wang
Main category: cs.CV
TL;DR: PLUME introduces a latent reasoning framework for universal multimodal embedding that replaces explicit chain-of-thought rationales with continuous latent state rollouts, achieving 30x faster inference while maintaining or improving performance on multimodal retrieval tasks.
Details
Motivation: Current UME approaches that use explicit CoT rationales suffer from substantial inference overhead and compress rich multimodal evidence into a narrow textual bottleneck, limiting efficiency and effectiveness for complex multimodal queries.Method: PLUME replaces verbalized CoT with short autoregressive rollouts of continuous latent states, introduces semantic-anchor-guided transition adapters to steer reasoning trajectories, and uses progressive explicit-to-latent curriculum training to eliminate explicit CoT at inference.
Result: On the 78-task MMEB-v2 benchmark, PLUME outperforms explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference, especially effective for video and visual document retrieval.
Conclusion: Structured latent computation preserves the benefits of intermediate reasoning without explicit rationale generation overhead, providing a stronger and more efficient paradigm for practical multimodal retrieval systems.
Abstract: Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum that uses verbalized reasoning only as a temporary training scaffold and gradually transfers this behavior into hidden-state computation, eliminating explicit CoT at inference. On the 78-task MMEB-v2 benchmark, PLUME outperforms strong explicit-CoT UME baselines while reducing reasoning from hundreds of generated tokens to fewer than 10 latent steps, delivering over 30x faster inference. PLUME is especially well suited to retrieval settings where relevant evidence is dense, structurally complex, and difficult to organize through verbalized intermediate rationales, such as video and visual document retrieval. These results show that structured latent computation can preserve the benefits of intermediate reasoning without the overhead of explicit rationale generation, providing a stronger and more efficient paradigm for practical retrieval systems.
[221] FlowSlider: Training-Free Continuous Image Editing via Fidelity-Steering Decomposition
Taichi Endo, Guoqing Hao, Kazuhiko Sumi
Main category: cs.CV
TL;DR: FlowSlider is a training-free method for continuous image editing using Rectified Flow that provides slider-style control by decomposing edits into orthogonal fidelity and steering terms.
Details
Motivation: Existing learning-based slider methods require auxiliary modules with synthetic/proxy supervision, introducing training overhead and coupling behavior to training distributions, reducing reliability under distribution shifts.Method: Decomposes FlowEdit’s update into: (1) fidelity term (source-conditioned stabilizer preserving identity/structure), and (2) steering term (driving semantic transition toward target). These orthogonal terms enable strength control by scaling only steering while keeping fidelity unchanged.
Result: Provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks through geometric analysis and empirical measurements showing term orthogonality.
Conclusion: FlowSlider offers training-free continuous image editing with stable strength control by leveraging orthogonal decomposition in Rectified Flow, addressing limitations of learning-based methods.
Abstract: Continuous image editing aims to provide slider-style control of edit strength while preserving source-image fidelity and maintaining a consistent edit direction. Existing learning-based slider methods typically rely on auxiliary modules trained with synthetic or proxy supervision. This introduces additional training overhead and couples slider behavior to the training distribution, which can reduce reliability under distribution shifts in edits or domains. We propose \textit{FlowSlider}, a training-free method for continuous editing in Rectified Flow that requires no post-training. \textit{FlowSlider} decomposes FlowEdit’s update into (i) a fidelity term, which acts as a source-conditioned stabilizer that preserves identity and structure, and (ii) a steering term that drives semantic transition toward the target edit. Geometric analysis and empirical measurements show that these terms are approximately orthogonal, enabling stable strength control by scaling only the steering term while keeping the fidelity term unchanged. As a result, \textit{FlowSlider} provides smooth and reliable control without post-training, improving continuous editing quality across diverse tasks.
[222] Center-Aware Detection with Swin-based Co-DETR Framework for Cervical Cytology
Yan Kong, Yuan Yin, Hongan Chen, Yuqi Fang, Caifeng Shan
Main category: cs.CV
TL;DR: Winning solution for cervical cytology challenge using Co-DINO with Swin-Large backbone, center-point prediction formulation, and specialized data augmentation for cell detection in Pap smear images.
Details
Motivation: Automated analysis of Pap smear images is critical for cervical cancer screening but challenging due to dense cell distribution and complex morphology. The paper addresses the RIVA Cervical Cytology Challenge to develop effective detection methods for cytology image analysis.Method: Uses Co-DINO framework with Swin-Large backbone for multi-scale feature extraction. Formulates detection as center-point prediction problem due to fixed-size bounding box annotations. Introduces center-preserving data augmentation and analytical geometric box optimization to handle localization jitter. Applies track-specific loss tuning to adapt loss weights for each task.
Result: Achieved 1st place in Track B and 2nd place in Track A of the RIVA Cervical Cytology Challenge. Targeted optimizations improved detection performance, providing an effective pipeline for cytology image analysis.
Conclusion: The proposed approach with Co-DINO framework, center-point prediction formulation, and specialized optimizations provides an effective solution for automated cervical cytology analysis, demonstrating strong performance in the challenge.
Abstract: Automated analysis of Pap smear images is critical for cervical cancer screening but remains challenging due to dense cell distribution and complex morphology. In this paper, we present our winning solution for the RIVA Cervical Cytology Challenge, achieving 1st place in Track B and 2nd place in Track A. Our approach leverages a powerful baseline, integrating the Co-DINO framework with a Swin-Large backbone for robust multi-scale feature extraction. To address the dataset’s unique fixed-size bounding box annotations, we formulate the detection task as a center-point prediction problem. Tailoring our approach to this formulation, we introduce a center-preserving data augmentation strategy and an analytical geometric box optimization to effectively absorb localization jitter. Finally, we apply track-specific loss tuning to adapt the loss weights for each task. Experiments demonstrate that our targeted optimizations improve detection performance, providing an effective pipeline for cytology image analysis. Our code is available at https://github.com/YanKong0408/Center-DETR.
[223] GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
Rong Fan, Kaiyan Xiao, Minghao Zhu, Liuyi Wang, Kai Dai, Zhao Yang
Main category: cs.CV
TL;DR: GroundVTS: A video large language model architecture that uses query-guided visual token sampling to focus on informative temporal segments for better video temporal grounding.
Details
Motivation: Existing video LLMs use uniform frame sampling which loses crucial temporal cues and results in sparse key frame distribution, limiting their ability for precise video temporal grounding tasks.Method: Proposes Grounded Visual Token Sampling (GroundVTS) with query-guided mechanism to filter visual tokens before LLM processing, plus progressive optimization strategy to adapt LLM to non-uniform visual feature distribution.
Result: Outperforms existing methods on three VTG benchmarks with 7.7-point mIoU improvement for moment retrieval and 12.0-point mAP improvement for highlight detection.
Conclusion: GroundVTS effectively addresses temporal information loss in video LLMs and enables precise video localization through focused attention on informative temporal segments.
Abstract: Video temporal grounding (VTG) is a critical task in video understanding and a key capability for extending video large language models (Vid-LLMs) to broader applications. However, existing Vid-LLMs rely on uniform frame sampling to extract video information, resulting in a sparse distribution of key frames and the loss of crucial temporal cues. To address this limitation, we propose Grounded Visual Token Sampling (GroundVTS), a Vid-LLM architecture that focuses on the most informative temporal segments. GroundVTS employs a fine-grained, query-guided mechanism to filter visual tokens before feeding them into the LLM, thereby preserving essential spatio-temporal information and maintaining temporal coherence. Futhermore, we introduce a progressive optimization strategy that enables the LLM to effectively adapt to the non-uniform distribution of visual features, enhancing its ability to model temporal dependencies and achieve precise video localization. We comprehensively evaluate GroundVTS on three standard VTG benchmarks, where it outperforms existing methods, achieving a 7.7-point improvement in mIoU for moment retrieval and 12.0-point improvement in mAP for highlight detection. Code is available at https://github.com/Florence365/GroundVTS.
[224] Which Way Does Time Flow? A Psychophysics-Grounded Evaluation for Vision-Language Models
Shiho Matta, Lis Kanashiro Pereira, Peitao Han, Fei Cheng, Shigeru Kitazawa
Main category: cs.CV
TL;DR: VLMs struggle with temporal reasoning, performing near chance on judging video playback direction (forward vs backward), revealing a fundamental gap in temporal understanding despite strong visual-semantic capabilities.
Details
Motivation: While VLMs excel at many multimodal tasks, their grasp of temporal information in video remains weak and inadequately evaluated. The authors aim to probe this gap using a simple but revealing challenge: judging the arrow of time in short video clips.Method: Introduces AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Comprehensive evaluation of open-weight and proprietary VLMs, both reasoning and non-reasoning models.
Result: Most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly.
Conclusion: Current multimodal systems lack the inductive biases required for temporal continuity and causal understanding, despite capturing rich visual-semantic correlations. The benchmark is released to encourage progress in physical and temporal reasoning capabilities of VLMs.
Abstract: Modern vision-language models (VLMs) excel at many multimodal tasks, yet their grasp of temporal information in video remains weak and has not been adequately evaluated. We probe this gap with a deceptively simple but revealing challenge: judging the arrow of time (AoT)-whether a short clip is played forward or backward. We introduce AoT-PsyPhyBENCH, a psychophysically validated benchmark that tests whether VLMs can infer temporal direction in natural videos using the same stimuli and behavioral baselines established for humans. Our comprehensive evaluation of open-weight and proprietary, reasoning and non-reasoning VLMs reveals that most models perform near chance, and even the best model lags far behind human accuracy on physically irreversible processes (e.g., free fall, diffusion/explosion) and causal manual actions (division/addition) that humans recognize almost instantly. These results highlight a fundamental gap in current multimodal systems: while they capture rich visual-semantic correlations, they lack the inductive biases required for temporal continuity and causal understanding. We release the code and data for AoT-PsyPhyBENCH to encourage further progress in the physical and temporal reasoning capabilities of VLMs.
[225] CoRegOVCD: Consistency-Regularized Open-Vocabulary Change Detection
Weidong Tang, Hanbin Sun, Zihan Li, Yikai Wang, Feifan Zhang
Main category: cs.CV
TL;DR: Training-free open-vocabulary change detection framework using competitive posterior calibration and geometry consistency to improve semantic change evidence quality.
Details
Motivation: Existing change detection methods assume fixed label spaces and cannot handle arbitrary user queries. Open-vocabulary change detection addresses this but faces challenges in training-free settings where appearance variations and weak concept competition produce noisy, fragmented change evidence.Method: Proposes CoRegOVCD with Competitive Posterior Calibration (CPC) to convert raw concept responses into competition-aware posteriors, Semantic Posterior Delta (SPD) to quantify cross-temporal discrepancy, Geometry-Token Consistency Gate (GeoGate) for structural verification, and Regional Consensus Discrepancy (RCD) for spatial coherence.
Result: Improves over strongest previous training-free baseline by 2.24 to 4.98 F1_C points across four benchmarks, reaching 47.50% F1_C on SECOND dataset for six-class average.
Conclusion: CoRegOVCD effectively addresses challenges in training-free open-vocabulary change detection through posterior calibration and geometry consistency mechanisms, enabling reliable semantic change evidence without explicit instance matching.
Abstract: Remote sensing change detection (CD) aims to identify where land-cover semantics change across time, but most existing methods still assume a fixed label space and therefore cannot answer arbitrary user-defined queries. Open-vocabulary change detection (OVCD) instead asks for the change mask of a queried concept. In the fully training-free setting, however, dense concept responses are difficult to compare directly across dates: appearance variation, weak cross-concept competition, and the spatial continuity of many land-cover categories often produce noisy, fragmented, and semantically unreliable change evidence. We propose Consistency-Regularized Open-Vocabulary Change Detection (CoRegOVCD), a training-free dense inference framework that reformulates concept-specific change as calibrated posterior discrepancy. Competitive Posterior Calibration (CPC) and the Semantic Posterior Delta (SPD) convert raw concept responses into competition-aware queried-concept posteriors and quantify their cross-temporal discrepancy, making semantic change evidence more comparable without explicit instance matching. Geometry-Token Consistency Gate (GeoGate) and Regional Consensus Discrepancy (RCD) further suppress unsupported responses and improve spatial coherence through geometry-aware structural verification and regional consensus. Across four benchmarks spanning building-oriented and multi-class settings, CoRegOVCD consistently improves over the strongest previous training-free baseline by 2.24 to 4.98 F1$_C$ points and reaches a six-class average of 47.50% F1$_C$ on SECOND.
[226] Omni123: Exploring 3D Native Foundation Models with Limited 3D Data by Unifying Text to 2D and 3D Generation
Chongjie Ye, Cheng Cao, Chuanyu Pan, Yiming Hao, Yihao Zhi, Yuanming Hu, Xiaoguang Han
Main category: cs.CV
TL;DR: Omni123 is a 3D-native foundation model that unifies text-to-2D and text-to-3D generation in a single autoregressive framework, using cross-modal consistency between images and 3D as implicit structural constraints.
Details
Motivation: Multimodal LLMs excel at text-image understanding/generation but struggle with 3D due to limited 3D data. Existing methods use indirect 2D-to-3D pipelines that sacrifice geometric consistency. There's a need for native 3D generation with better geometric fidelity.Method: Represent text, images, and 3D as discrete tokens in shared sequence space. Use interleaved X-to-X training paradigm coordinating diverse cross-modal tasks over heterogeneous datasets without requiring fully aligned triplets. Leverage abundant 2D data as geometric prior through semantic-visual-geometric cycles in autoregressive sequences.
Result: Omni123 significantly improves text-guided 3D generation and editing, demonstrating better geometric consistency and appearance fidelity compared to existing methods.
Conclusion: The approach shows a scalable path toward multimodal 3D world models by unifying 2D and 3D generation in a single framework that enforces cross-modal consistency through autoregressive modeling.
Abstract: Recent multimodal large language models have achieved strong performance in unified text and image understanding and generation, yet extending such native capability to 3D remains challenging due to limited data. Compared to abundant 2D imagery, high-quality 3D assets are scarce, making 3D synthesis under-constrained. Existing methods often rely on indirect pipelines that edit in 2D and lift results into 3D via optimization, sacrificing geometric consistency. We present Omni123, a 3D-native foundation model that unifies text-to-2D and text-to-3D generation within a single autoregressive framework. Our key insight is that cross-modal consistency between images and 3D can serve as an implicit structural constraint. By representing text, images, and 3D as discrete tokens in a shared sequence space, the model leverages abundant 2D data as a geometric prior to improve 3D representations. We introduce an interleaved X-to-X training paradigm that coordinates diverse cross-modal tasks over heterogeneous paired datasets without requiring fully aligned text-image-3D triplets. By traversing semantic-visual-geometric cycles (e.g., text to image to 3D to image) within autoregressive sequences, the model jointly enforces semantic alignment, appearance fidelity, and multi-view geometric consistency. Experiments show that Omni123 significantly improves text-guided 3D generation and editing, demonstrating a scalable path toward multimodal 3D world models.
[227] Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation
Saurabh Hinduja, Gurmeet Kaur, Maneesh Bilalpur, Jeffrey Cohn, Shaun Canavan
Main category: cs.CV
TL;DR: Cross-validation in facial Action Unit detection introduces stochastic variance, with empirical noise floor of ±0.065 F1 on BP4D+. Leave-One-Dataset-Out protocol reveals domain-level instability not visible in single-dataset cross-validation.
Details
Motivation: To demonstrate that standard subject-exclusive cross-validation for facial Action Unit detection introduces measurable stochastic variance, potentially making reported small improvements fall within protocol variance rather than representing true model improvements.Method: Used repeated 3-fold subject-exclusive splits on BP4D+ dataset to measure empirical noise floor. Also implemented Leave-One-Dataset-Out (LODO) protocol across five AU datasets to evaluate cross-dataset robustness and domain-level instability.
Result: Found empirical noise floor of ±0.065 in average F1 on BP4D+, with larger variation for low-prevalence AUs. F1 fluctuates more than AUC. LODO protocol reveals domain-level instability not visible in single-dataset cross-validation and provides more stable model rankings.
Conclusion: Reported improvements in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings for facial AU detection research.
Abstract: Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings
[228] VOID: Video Object and Interaction Deletion
Saman Motamed, William Harvey, Benjamin Klein, Luc Van Gool, Zhuoning Yuan, Ta-Ying Cheng
Main category: cs.CV
TL;DR: VOID is a video object removal framework that performs physically-plausible inpainting for complex scenarios where removed objects have significant physical interactions like collisions, going beyond appearance-level corrections.
Details
Motivation: Current video object removal methods fail when removed objects have significant physical interactions (like collisions) with other objects, producing implausible results. There's a need for physically-plausible inpainting that corrects downstream physical consequences.Method: 1) Generate paired dataset of counterfactual object removals using Kubric and HUMOTO where removal requires altering downstream physical interactions. 2) Use vision-language model to identify affected scene regions. 3) Guide video diffusion model with these regions to generate physically consistent counterfactual outcomes.
Result: Experiments on synthetic and real data show VOID better preserves consistent scene dynamics after object removal compared to prior methods, producing more physically plausible results.
Conclusion: VOID demonstrates how video editing models can become better world simulators through high-level causal reasoning, addressing physical plausibility in complex object removal scenarios.
Abstract: Existing video object removal methods excel at inpainting content “behind” the object and correcting appearance-level artifacts such as shadows and reflections. However, when the removed object has more significant interactions, such as collisions with other objects, current models fail to correct them and produce implausible results. We present VOID, a video object removal framework designed to perform physically-plausible inpainting in these complex scenarios. To train the model, we generate a new paired dataset of counterfactual object removals using Kubric and HUMOTO, where removing an object requires altering downstream physical interactions. During inference, a vision-language model identifies regions of the scene affected by the removed object. These regions are then used to guide a video diffusion model that generates physically consistent counterfactual outcomes. Experiments on both synthetic and real data show that our approach better preserves consistent scene dynamics after object removal compared to prior video object removal methods. We hope this framework sheds light on how to make video editing models better simulators of the world through high-level causal reasoning.
[229] Reflection Generation for Composite Image Using Diffusion Model
Haonan Zhao, Qingyang Liu, Jiaxuan Chen, Li Niu
Main category: cs.CV
TL;DR: A method for generating physically coherent reflections when compositing foreground objects into backgrounds using foundation diffusion models with reflection placement and appearance priors, supported by a new large-scale dataset.
Details
Motivation: While shadow generation in image composition has been well-studied, reflection generation remains largely unexplored. The paper aims to address this gap by developing methods to generate realistic, physically coherent reflections when inserting objects into backgrounds.Method: Inject prior information about reflection placement and appearance into foundation diffusion models, use type-aware model design that distinguishes between different reflection types, and train on DEROBA - the first large-scale object reflection dataset.
Result: The method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation in image composition tasks.
Conclusion: The work successfully addresses the underexplored problem of reflection generation in image composition, providing both a novel method and a comprehensive dataset to advance the field.
Abstract: Image composition involves inserting a foreground object into the background while synthesizing environment-consistent effects such as shadows and reflections. Although shadow generation has been extensively studied, reflection generation remains largely underexplored. In this work, we focus on reflection generation. We inject the prior information of reflection placement and reflection appearance into foundation diffusion model. We also divide reflections into two types and adopt type-aware model design. To support training, we construct the first large-scale object reflection dataset DEROBA. Experiments demonstrate that our method generates reflections that are physically coherent and visually realistic, establishing a new benchmark for reflection generation.
[230] Steerable Visual Representations
Jona Ruthardt, Manu Gaur, Deva Ramanan, Makarand Tapaswi, Yuki M. Asano
Main category: cs.CV
TL;DR: Steerable Visual Representations enable language-guided control over visual features while maintaining generic visual representation quality, using early fusion via lightweight cross-attention in vision transformers.
Details
Motivation: Current pretrained vision transformers focus on salient visual cues without ability to direct attention to less prominent concepts, while multimodal LLMs produce language-centric representations that lose effectiveness for generic visual tasks. There's a need for representations that can be steered with natural language while preserving visual quality.Method: Introduces steerable visual representations using early fusion via lightweight cross-attention layers injected into visual encoder, allowing text to directly influence visual features at multiple levels rather than late fusion approaches like CLIP.
Result: Features can focus on any desired objects while preserving representation quality, matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibits zero-shot generalization to out-of-distribution tasks.
Conclusion: Steerable visual representations bridge the gap between generic visual features and language-guided attention, enabling flexible control over visual representations while maintaining their utility for downstream visual tasks.
Abstract: Pretrained Vision Transformers (ViTs) such as DINOv2 and MAE provide generic image features that can be applied to a variety of downstream tasks such as retrieval, classification, and segmentation. However, such representations tend to focus on the most salient visual cues in the image, with no way to direct them toward less prominent concepts of interest. In contrast, Multimodal LLMs can be guided with textual prompts, but the resulting representations tend to be language-centric and lose their effectiveness for generic visual tasks. To address this, we introduce Steerable Visual Representations, a new class of visual representations, whose global and local features can be steered with natural language. While most vision-language models (e.g., CLIP) fuse text with visual features after encoding (late fusion), we inject text directly into the layers of the visual encoder (early fusion) via lightweight cross-attention. We introduce benchmarks for measuring representational steerability, and demonstrate that our steerable visual features can focus on any desired objects in an image while preserving the underlying representation quality. Our method also matches or outperforms dedicated approaches on anomaly detection and personalized object discrimination, exhibiting zero-shot generalization to out-of-distribution tasks.
[231] Bias Is a Subspace, Not a Coordinate: A Geometric Rethinking of Post-hoc Debiasing in Vision-Language Models
Dachuan Zhao, Weiyue Li, Zhenda Shen, Yushu Qiu, Bowen Xu, Haoyu Chen, Yongchao Chen
Main category: cs.CV
TL;DR: SPD is a subspace projection debiasing method for Vision-Language Models that removes entire bias subspaces rather than just coordinate-wise features, achieving better fairness while preserving task performance.
Details
Motivation: VLMs encode and amplify demographic biases in their representations, leading to unfair associations and misaligned predictions. Current post-hoc debiasing methods that replace attribute-correlated coordinates have limitations: feature entanglement, poor generalization, and incomplete bias removal.Method: Proposes Subspace Projection Debiasing (SPD) - a geometrically principled framework that identifies and removes entire subspaces of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity.
Result: Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation show SPD achieves more robust debiasing with 18.5% average improvement across four fairness metrics while maintaining minimal task performance loss.
Conclusion: Bias in VLMs is distributed across linear subspaces rather than isolated coordinates. SPD provides an effective subspace-level debiasing approach that addresses limitations of coordinate-wise methods while preserving model utility.
Abstract: Vision-Language Models (VLMs) have become indispensable for multimodal reasoning, yet their representations often encode and amplify demographic biases, resulting in biased associations and misaligned predictions in downstream tasks. Such behavior undermines fairness and distorts the intended alignment between vision and language. Recent post-hoc approaches attempt to mitigate bias by replacing the most attribute-correlated embedding coordinates with neutral values. However, our systematic analysis reveals three critical limitations of this coordinate-wise approach: feature entanglement, poor cross-dataset generalization, and incomplete bias removal. We find that bias is not localized to a few coordinates but is instead distributed across a few linear subspaces. To address these limitations, we propose $\textbf{S}$ubspace $\textbf{P}$rojection $\textbf{D}$ebiasing ($\textbf{SPD}$), a geometrically principled framework that identifies and removes the entire subspace of linearly decodable bias while reinserting a neutral mean component to preserve semantic fidelity. Extensive experiments across zero-shot classification, text-to-image retrieval, and image generation validate the effectiveness of SPD: our method achieves more robust debiasing with an average improvement of $18.5%$ across four fairness metrics, while maintaining minimal loss in task performance compared to the best debiasing baseline.
[232] ViT-Explainer: An Interactive Walkthrough of the Vision Transformer Pipeline
Juan Manuel Hernandez, Mariana Fernandez-Espinosa, Denis Parra, Diego Gomez-Zara
Main category: cs.CV
TL;DR: ViT-Explainer: An interactive web-based visualization system for understanding Vision Transformer inference from patch tokenization to final classification.
Details
Motivation: Transformer architectures dominate NLP and CV but remain challenging to interpret, especially for vision where images are processed as patch token sequences. Existing tools focus on isolated components or expert analysis, lacking guided end-to-end understanding of the full inference pipeline.Method: Developed ViT-Explainer, a web-based interactive system with integrated visualization of Vision Transformer inference. Combines animated walkthroughs, patch-level attention overlays, and vision-adapted Logit Lens in both guided and free exploration modes.
Result: User study with six participants suggests ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.
Conclusion: ViT-Explainer bridges the gap in guided, end-to-end understanding of Vision Transformers through interactive visualization tools that make model behavior more interpretable.
Abstract: Transformer-based architectures have become the shared backbone of natural language processing and computer vision. However, understanding how these models operate remains challenging, particularly in vision settings, where images are processed as sequences of patch tokens. Existing interpretability tools often focus on isolated components or expert-oriented analysis, leaving a gap in guided, end-to-end understanding of the full inference pipeline. To bridge this gap, we present ViT-Explainer, a web-based interactive system that provides an integrated visualization of Vision Transformer inference, from patch tokenization to final classification. The system combines animated walkthroughs, patch-level attention overlays, and a vision-adapted Logit Lens within both guided and free exploration modes. A user study with six participants suggests that ViT-Explainer is easy to learn and use, helping users interpret and understand Vision Transformer behavior.
[233] ActionParty: Multi-Subject Action Binding in Generative Video Games
Alexander Pondaven, Ziyi Wu, Igor Gilitschenski, Philip Torr, Sergey Tulyakov, Fabio Pizzati, Aliaksandr Siarohin
Main category: cs.CV
TL;DR: ActionParty: A video diffusion world model that enables multi-agent control in generative video games by introducing subject state tokens to bind actions to specific agents.
Details
Motivation: Current video diffusion models for world simulation are limited to single-agent settings and struggle with action binding - associating specific actions with their corresponding subjects in multi-agent scenarios.Method: Introduces subject state tokens as latent variables that persistently capture each subject’s state. Uses joint modeling of state tokens and video latents with spatial biasing mechanism to disentangle global frame rendering from individual action-controlled subject updates.
Result: First video world model capable of controlling up to seven players simultaneously across 46 diverse Melting Pot environments. Shows significant improvements in action-following accuracy and identity consistency, enabling robust autoregressive tracking through complex interactions.
Conclusion: ActionParty successfully addresses the action binding problem in video diffusion models, enabling multi-agent control in generative video games with improved action-subject association and identity preservation.
Abstract: Recent advances in video diffusion have enabled the development of “world models” capable of simulating interactive environments. However, these models are largely restricted to single-agent settings, failing to control multiple agents simultaneously in a scene. In this work, we tackle a fundamental issue of action binding in existing video diffusion models, which struggle to associate specific actions with their corresponding subjects. For this purpose, we propose ActionParty, an action controllable multi-subject world model for generative video games. It introduces subject state tokens, i.e. latent variables that persistently capture the state of each subject in the scene. By jointly modeling state tokens and video latents with a spatial biasing mechanism, we disentangle global video frame rendering from individual action-controlled subject updates. We evaluate ActionParty on the Melting Pot benchmark, demonstrating the first video world model capable of controlling up to seven players simultaneously across 46 diverse environments. Our results show significant improvements in action-following accuracy and identity consistency, while enabling robust autoregressive tracking of subjects through complex interactions.
[234] CXR-LT 2026 Challenge: Projection-Aware Multi-Label and Zero-Shot Chest X-Ray Classification
Juno Cho, Dohui Kim, Mingeon Kim, Hyunseo Jang, Chang Sun Lee, Jong Chul Ye
Main category: cs.CV
TL;DR: Multi-label classification framework for chest X-ray lesions with projection-specific models and zero-shot classification using dual-branch architecture with contrastive learning and LLM-generated prompts.
Details
Motivation: To address the challenges of multi-label classification for known chest X-ray lesions and zero-shot classification for unseen lesions, while handling diverse CXR projections and severe long-tail imbalances in medical imaging data.Method: 1) Integrates projection-specific models via classification network for unified multi-label classification; 2) Extends CheXzero with dual-branch architecture combining contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts for zero-shot classification; 3) Uses strong data and test-time augmentations for robustness.
Result: The framework effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization across both tasks, with robustness ensured through data and test-time augmentations.
Conclusion: The proposed unified framework successfully addresses multi-label and zero-shot classification challenges in chest X-ray analysis through projection-specific modeling and advanced dual-branch architecture with LLM integration.
Abstract: This challenge tackles multi-label classification for known chest X-ray (CXR) lesions and zero-shot classification for unseen ones. To handle diverse CXR projections, we integrate projection-specific models via a classification network into a unified framework. For zero-shot classification (Task 2), we extend CheXzero with a novel dual-branch architecture that combines contrastive learning, Asymmetric Loss (ASL), and LLM-generated descriptive prompts. This effectively mitigates severe long-tail imbalances and maximizes zero-shot generalization. Additionally, strong data and test-time augmentations (TTA) ensure robustness across both tasks.
[235] Lightweight Spatiotemporal Highway Lane Detection via 3D-ResNet and PINet with ROI-Aware Attention
Sorna Shanmuga Raja, Abdelhafid Zenati
Main category: cs.CV
TL;DR: Lightweight 3D CNN architecture for highway lane detection combining spatial-temporal information with instance segmentation, achieving 93.40% accuracy on TuSimple dataset with reduced computational complexity.
Details
Motivation: Need for robust, real-time lane detection in autonomous driving systems that can handle complex real-world scenarios with temporal consistency and computational efficiency.Method: Two models combining 3D-ResNet encoder with Point Instance Network decoder: 1) FPN + Self-Attention for multi-scale feature refinement, 2) ROI detection head for lane-relevant region focus.
Result: 93.40% accuracy on TuSimple dataset, reduced false negatives, fewer parameters, lower latency compared to 2D/3D baselines, validated in real-time inference.
Conclusion: Proposed models are suitable for ADAS integration with potential for Lane Assist Systems, offering robust performance with computational efficiency.
Abstract: This paper presents a lightweight, end-to-end highway lane detection architecture that jointly captures spatial and temporal information for robust performance in real-world driving scenarios. Building on the strengths of 3D convolutional neural networks and instance segmentation, we propose two models that integrate a 3D-ResNet encoder with a Point Instance Network (PINet) decoder. The first model enhances multi-scale feature representation using a Feature Pyramid Network (FPN) and Self-Attention mechanism to refine spatial dependencies. The second model introduces a Region of Interest (ROI) detection head to selectively focus on lane-relevant regions, thereby improving precision and reducing computational complexity. Experiments conducted on the TuSimple dataset (highway driving scenarios) demonstrate that the proposed second model achieves 93.40% accuracy while significantly reducing false negatives. Compared to existing 2D and 3D baselines, our approach achieves improved performance with fewer parameters and reduced latency. The architecture has been validated through offline training and real-time inference in the Autonomous Systems Laboratory at City, St George’s University of London. These results suggest that the proposed models are well-suited for integration into Advanced Driver Assistance Systems (ADAS), with potential scalability toward full Lane Assist Systems (LAS).
[236] UniDriveVLA: Unifying Understanding, Perception, and Action Planning for Autonomous Driving
Yongkang Li, Lijun Zhou, Sixu Yan, Bencheng Liao, Tianyi Yan, Kaixin Xiong, Long Chen, Hongwei Xie, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Haiyang Sun, Xinggang Wang
Main category: cs.CV
TL;DR: UniDriveVLA is a unified driving vision-language-action model that addresses the perception-reasoning conflict in autonomous driving through expert decoupling and Mixture-of-Transformers architecture.
Details
Motivation: Existing VLA models for autonomous driving face a critical dilemma between spatial perception and semantic reasoning, forcing suboptimal compromises where enhancing spatial perception impairs reasoning capacity or vice versa.Method: Proposes UniDriveVLA with three experts (driving understanding, scene perception, action planning) coordinated through masked joint attention, combined with sparse perception paradigm and three-stage progressive training strategy.
Result: Achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive, with strong performance across perception, prediction, and understanding tasks.
Conclusion: UniDriveVLA successfully addresses the perception-reasoning conflict and demonstrates broad applicability as a unified model for autonomous driving with strong multimodal capabilities.
Abstract: Vision-Language-Action (VLA) models have recently emerged in autonomous driving, with the promise of leveraging rich world knowledge to improve the cognitive capabilities of driving systems. However, adapting such models for driving tasks currently faces a critical dilemma between spatial perception and semantic reasoning. Consequently, existing VLA systems are forced into suboptimal compromises: directly adopting 2D Vision-Language Models yields limited spatial perception, whereas enhancing them with 3D spatial representations often impairs the native reasoning capacity of VLMs. We argue that this dilemma largely stems from the coupled optimization of spatial perception and semantic reasoning within shared model parameters. To overcome this, we propose UniDriveVLA, a Unified Driving Vision-Language-Action model based on Mixture-of-Transformers that addresses the perception-reasoning conflict via expert decoupling. Specifically, it comprises three experts for driving understanding, scene perception, and action planning, which are coordinated through masked joint attention. In addition, we combine a sparse perception paradigm with a three-stage progressive training strategy to improve spatial perception while maintaining semantic reasoning capability. Extensive experiments show that UniDriveVLA achieves state-of-the-art performance in open-loop evaluation on nuScenes and closed-loop evaluation on Bench2Drive. Moreover, it demonstrates strong performance across a broad range of perception, prediction, and understanding tasks, including 3D detection, online mapping, motion forecasting, and driving-oriented VQA, highlighting its broad applicability as a unified model for autonomous driving. Code and model have been released at https://github.com/xiaomi-research/unidrivevla
[237] SCALE: Semantic- and Confidence-Aware Conditional Variational Autoencoder for Zero-shot Skeleton-based Action Recognition
Soroush Oraki, Feng Ding, Jie Liang
Main category: cs.CV
TL;DR: SCALE: A lightweight deterministic framework for zero-shot skeleton-based action recognition using semantic- and confidence-aware listwise energy ranking with conditional VAEs.
Details
Motivation: Existing zero-shot skeleton-based action recognition methods rely on explicit skeleton-text alignment, which is brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable.Method: Proposes SCALE framework with text-conditioned Conditional VAE where frozen text representations parameterize latent prior and decoder. Introduces semantic- and confidence-aware listwise energy loss emphasizing hard negatives and incorporating posterior uncertainty. Uses latent prototype contrast objective to align posterior means with text-derived prototypes.
Result: Experiments on NTU-60 and NTU-120 datasets show SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.
Conclusion: SCALE provides an effective deterministic framework for zero-shot skeleton-based action recognition that addresses limitations of explicit alignment methods through energy-based ranking and semantic-aware objectives.
Abstract: Zero-shot skeleton-based action recognition (ZSAR) aims to recognize action classes without any training skeletons from those classes, relying instead on auxiliary semantics from text. Existing approaches frequently depend on explicit skeleton-text alignment, which can be brittle when action names underspecify fine-grained dynamics and when unseen classes are semantically confusable. We propose SCALE, a lightweight and deterministic Semantic- and Confidence-Aware Listwise Energy-based framework that formulates ZSAR as class-conditional energy ranking. SCALE builds a text-conditioned Conditional Variational Autoencoder where frozen text representations parameterize both the latent prior and the decoder, enabling likelihood-based evaluation for unseen classes without generating samples at test time. To separate competing hypotheses, we introduce a semantic- and confidence-aware listwise energy loss that emphasizes semantically similar hard negatives and incorporates posterior uncertainty to adapt decision margins and reweight ambiguous training instances. Additionally, we utilize a latent prototype contrast objective to align posterior means with text-derived latent prototypes, improving semantic organization and class separability without direct feature matching. Experiments on NTU-60 and NTU-120 datasets show that SCALE consistently improves over prior VAE- and alignment-based baselines while remaining competitive with diffusion-based methods.
[238] UAV-Track VLA: Embodied Aerial Tracking via Vision-Language-Action Models
Qiyao Zhang, Shuhua Zheng, Jianli Sun, Chengxiang Li, Xianke Wu, Zihan Song, Zhiyong Cui, Yisheng Lv, Yonglin Tian
Main category: cs.CV
TL;DR: UAV-Track VLA: An improved Vision-Language-Action model for embodied visual tracking in UAVs with temporal compression and spatial-aware decoders for real-time control.
Details
Motivation: Embodied visual tracking is crucial for UAVs in complex urban scenarios, but existing VLA models suffer from temporal feature redundancy and lack spatial geometric priors needed for precise tracking.Method: Proposes UAV-Track VLA built on π₀.₅ architecture with: 1) temporal compression net to capture inter-frame dynamics efficiently, 2) parallel dual-branch decoder with spatial-aware auxiliary grounding head and flow matching action expert to decouple cross-modal features and generate fine-grained continuous actions.
Result: Achieves 61.76% success rate and 269.65 average tracking frames in long-distance pedestrian tracking, outperforms baselines, shows robust zero-shot generalization, reduces inference latency by 33.4% to 0.0571s, enabling real-time UAV control.
Conclusion: UAV-Track VLA demonstrates superior end-to-end performance for embodied visual tracking with efficient real-time capabilities and strong generalization, validated through systematic experiments in CARLA simulator.
Abstract: Embodied visual tracking is crucial for Unmanned Aerial Vehicles (UAVs) executing complex real-world tasks. In dynamic urban scenarios with complex semantic requirements, Vision-Language-Action (VLA) models show great promise due to their cross-modal fusion and continuous action generation capabilities. To benchmark multimodal tracking in such environments, we construct a dedicated evaluation benchmark and a large-scale dataset encompassing over 890K frames, 176 tasks, and 85 diverse objects. Furthermore, to address temporal feature redundancy and the lack of spatial geometric priors in existing VLA models, we propose an improved VLA tracking model, UAV-Track VLA. Built upon the $π_{0.5}$ architecture, our model introduces a temporal compression net to efficiently capture inter-frame dynamics. Additionally, a parallel dual-branch decoder comprising a spatial-aware auxiliary grounding head and a flow matching action expert is designed to decouple cross-modal features and generate fine-grained continuous actions. Systematic experiments in the CARLA simulator validate the superior end-to-end performance of our method. Notably, in challenging long-distance pedestrian tracking tasks, UAV-Track VLA achieves a 61.76% success rate and 269.65 average tracking frames, significantly outperforming existing baselines. Furthermore, it demonstrates robust zero-shot generalization in unseen environments and reduces single-step inference latency by 33.4% (to 0.0571s) compared to the original $π_{0.5}$, enabling highly efficient, real-time UAV control. Data samples and demonstration videos are available at: https://github.com/Hub-Tian/UAV-Track_VLA.
[239] SPAR: Single-Pass Any-Resolution ViT for Open-vocabulary Segmentation
Naomi Kombol, Ivan Martinović, Siniša Šegvić, Giorgos Tolias
Main category: cs.CV
TL;DR: SPAR is a resolution-agnostic ViT that enables efficient high-resolution dense prediction by distilling sliding-window teacher knowledge into a single-pass student model.
Details
Motivation: Standard Vision Transformers struggle with fine-grained spatial understanding due to fixed pre-training resolution and coarse patch representations, especially in dense prediction tasks like open-vocabulary segmentation where high-resolution inputs are crucial but computationally expensive with sliding-window approaches.Method: SPAR uses knowledge distillation to transfer spatial reasoning capabilities from a finely-strided sliding-window teacher to a single-pass student ViT using feature regression loss, without architectural changes or pixel-level supervision.
Result: SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher model in open-vocabulary segmentation, demonstrating effective high-resolution reasoning with computational efficiency.
Conclusion: SPAR provides an effective solution for resolution-agnostic dense feature extraction in ViTs, enabling efficient high-resolution inference for vision-language tasks without architectural modifications.
Abstract: Foundational Vision Transformers (ViTs) have limited effectiveness in tasks requiring fine-grained spatial understanding, due to their fixed pre-training resolution and inherently coarse patch-level representations. These challenges are especially pronounced in dense prediction scenarios, such as open-vocabulary segmentation with ViT-based vision-language models, where high-resolution inputs are essential for accurate pixel-level reasoning. Existing approaches typically process large-resolution images using a sliding-window strategy at the pre-training resolution. While this improves accuracy through finer strides, it comes at a significant computational cost. We introduce SPAR: Single-Pass Any-Resolution ViT, a resolution-agnostic dense feature extractor designed for efficient high-resolution inference. We distill the spatial reasoning capabilities of a finely-strided, sliding-window teacher into a single-pass student using a feature regression loss, without requiring architectural changes or pixel-level supervision. Applied to open-vocabulary segmentation, SPAR improves single-pass baselines by up to 10.5 mIoU and even surpasses the teacher, demonstrating effectiveness in efficient, high-resolution reasoning. Code: https://github.com/naomikombol/SPAR
[240] Modular Energy Steering for Safe Text-to-Image Generation with Foundation Models
Yaoteng Tan, Zikui Cai, M. Salman Asif
Main category: cs.CV
TL;DR: Training-free inference-time steering framework for text-to-image generation safety using gradient feedback from frozen vision-language foundation models as semantic energy estimators.
Details
Motivation: Existing safety approaches for text-to-image models rely on fine-tuning or curated datasets, which degrade generation quality or limit scalability. Need for modular, training-free safety control that preserves generation quality while being robust.Method: Uses gradient feedback from frozen pretrained vision-language foundation models to guide generation without modifying the underlying generator. Formulates safety steering as energy-based sampling by injecting feedback through clean latent estimates at each sampling step. Compatible with both diffusion and flow-matching models.
Result: State-of-the-art robustness against NSFW red-teaming benchmarks, effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Generalizes across diverse visual concepts.
Conclusion: Provides principled approach for using foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation without training or quality degradation.
Abstract: Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.
[241] AdamFlow: Adam-based Wasserstein Gradient Flows for Surface Registration in Medical Imaging
Qiang Ma, Qingjie Meng, Xin Hu, Yicheng Wu, Wenjia Bai
Main category: cs.CV
TL;DR: Fast surface registration method for medical imaging using probability measures and sliced Wasserstein distance with AdamFlow optimization
Details
Motivation: Existing surface registration methods face trade-offs between efficiency and robustness - local methods are efficient but vulnerable to noise, while global methods are robust but computationally expensiveMethod: Formulates surface meshes as probability measures and registration as distributional optimization using sliced Wasserstein distance with log-linear complexity; proposes AdamFlow optimization method that generalizes Adam from Euclidean to probability space
Result: Theoretical analysis of AdamFlow convergence and empirical demonstration of superior performance in both affine and non-rigid surface registration across various anatomical structures
Conclusion: Proposed method addresses efficiency-robustness trade-off in surface registration through probability measure formulation and novel optimization approach
Abstract: Surface registration plays an important role for anatomical shape analysis in medical imaging. Existing surface registration methods often face a trade-off between efficiency and robustness. Local point matching methods are computationally efficient, but vulnerable to noise and initialisation. Methods designed for global point set alignment tend to incur a high computational cost. To address the challenge, here we present a fast surface registration method, which formulates surface meshes as probability measures and surface registration as a distributional optimisation problem. The discrepancy between two meshes is measured using an efficient sliced Wasserstein distance with log-linear computational complexity. We propose a novel optimisation method, AdamFlow, which generalises the well-known Adam optimisation method from the Euclidean space to the probability space for minimising the sliced Wasserstein distance. We theoretically analyse the asymptotic convergence of AdamFlow and empirically demonstrate its superior performance in both affine and non-rigid surface registration across various anatomical structures.
[242] A Simple Baseline for Streaming Video Understanding
Yujiao Shen, Shulin Tian, Jingkang Yang, Ziwei Liu
Main category: cs.CV
TL;DR: A sliding-window baseline using only recent frames matches or beats complex streaming video models, challenging current trends in video understanding research.
Details
Motivation: To challenge the trend of increasingly complex memory mechanisms in streaming video understanding by showing that a simple sliding-window approach can match or surpass published streaming models.Method: Proposes SimpleStream - a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf vision-language model (VLM). Evaluated against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench.
Result: SimpleStream achieves 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench with only 4 recent frames. Shows that longer context value is backbone-dependent, and reveals a perception-memory trade-off where more historical context improves recall but weakens real-time perception.
Conclusion: Future streaming benchmarks should separate recent-scene perception from long-range memory to properly evaluate performance improvements from added complexity. Stronger memory/retrieval modules shouldn’t be considered progress unless they clearly outperform SimpleStream.
Abstract: Recent streaming video understanding methods increasingly rely on complex memory mechanisms to handle long video streams. We challenge this trend with a simple finding: a sliding-window baseline that feeds only the most recent N frames to an off-the-shelf VLM already matches or surpasses published streaming models. We formalize this baseline as SimpleStream and evaluate it against 13 major offline and online video LLM baselines on OVO-Bench and StreamingBench. Despite its simplicity, SimpleStream delivers consistently strong performance. With only 4 recent frames, it reaches 67.7% average accuracy on OVO-Bench and 80.59% on StreamingBench. Controlled ablations further show that the value of longer context is backbone-dependent rather than uniformly increasing with model scale, and reveal a consistent perception-memory trade-off: adding more historical context can improve recall, but often weakens real-time perception. This suggests that stronger memory, retrieval, or compression modules should not be taken as evidence of progress unless they clearly outperform SimpleStream under the same protocol. We therefore argue that future streaming benchmarks should separate recent-scene perception from long-range memory, so that performance improvements from added complexity can be evaluated more clearly.
[243] Large-scale Codec Avatars: The Unreasonable Effectiveness of Large-scale Avatar Pretraining
Junxuan Li, Rawal Khirodkar, Chengan He, Zhongshi Jiang, Giljoo Nam, Lingchen Yang, Jihyun Lee, Egor Zakharov, Zhaoen Su, Rinat Abdrashitov, Yuan Dong, Julieta Martinez, Kai Li, Qingyang Tan, Takaaki Shiratori, Matthew Hu, Peihong Guo, Xuhua Huang, Ariyan Zarei, Marco Pesavento, Yichen Xu, He Wen, Teng Deng, Wyatt Borsos, Anjali Thakrar, Jean-Charles Bazin, Carsten Stoll, Ginés Hidalgo, James Booth, Lucy Wang, Xiaowen Ma, Yu Rong, Sairanjith Thalanki, Chen Cao, Christian Häne, Abhishek Kar, Sofien Bouaziz, Jason Saragih, Yaser Sheikh, Shunsuke Saito
Main category: cs.CV
TL;DR: LCA is a 3D avatar model that combines large-scale pretraining on in-the-wild videos with high-quality post-training to achieve both generalization and fidelity, enabling feedforward inference for world-scale populations.
Details
Motivation: Addressing the trade-off between fidelity (from multi-view studio data) and generalization (from large-scale in-the-wild training) in 3D avatar modeling, where current approaches either produce high-fidelity but non-generalizable avatars or generalizable but low-quality ones.Method: Proposes a pre/post-training paradigm: pretrain on 1M in-the-wild videos to learn broad appearance/geometry priors, then post-train on high-quality curated data to enhance expressivity and fidelity. Uses a feedforward approach for efficient inference.
Result: LCA generalizes across hair styles, clothing, and demographics while providing precise facial expressions and finger-level articulation control with strong identity preservation. Shows emergent generalization to relightability, loose garment support, and zero-shot robustness to stylized imagery.
Conclusion: The pre/post-training paradigm successfully bridges the fidelity-generalization trade-off in 3D avatar modeling, enabling high-fidelity avatars that generalize to world-scale populations with emergent capabilities beyond direct supervision.
Abstract: High-quality 3D avatar modeling faces a critical trade-off between fidelity and generalization. On the one hand, multi-view studio data enables high-fidelity modeling of humans with precise control over expressions and poses, but it struggles to generalize to real-world data due to limited scale and the domain gap between the studio environment and the real world. On the other hand, recent large-scale avatar models trained on millions of in-the-wild samples show promise for generalization across a wide range of identities, yet the resulting avatars are often of low-quality due to inherent 3D ambiguities. To address this, we present Large-Scale Codec Avatars (LCA), a high-fidelity, full-body 3D avatar model that generalizes to world-scale populations in a feedforward manner, enabling efficient inference. Inspired by the success of large language models and vision foundation models, we present, for the first time, a pre/post-training paradigm for 3D avatar modeling at scale: we pretrain on 1M in-the-wild videos to learn broad priors over appearance and geometry, then post-train on high-quality curated data to enhance expressivity and fidelity. LCA generalizes across hair styles, clothing, and demographics while providing precise, fine-grained facial expressions and finger-level articulation control, with strong identity preservation. Notably, we observe emergent generalization to relightability and loose garment support to unconstrained inputs, and zero-shot robustness to stylized imagery, despite the absence of direct supervision.
[244] Beyond Referring Expressions: Scenario Comprehension Visual Grounding
Ruozhen He, Nisarg A. Shah, Qihua Dong, Zilin Xiao, Jaywon Koo, Vicente Ordonez
Main category: cs.CV
TL;DR: RSC benchmark introduces scenario-based visual grounding where targets must be inferred from roles, intentions, and relational context rather than explicit naming, with ScenGround curriculum method showing improved performance.
Details
Motivation: Existing visual grounding benchmarks focus on literal referring expressions where models can succeed by matching named categories. The authors want to explore more challenging scenario-based grounding requiring inference from roles, intentions, and relational context.Method: Introduces Referring Scenario Comprehension (RSC) benchmark with paragraph-length queries describing object roles, user goals, and contextual cues. Proposes ScenGround curriculum reasoning method combining supervised warm-starting with difficulty-aware reinforcement learning.
Result: RSC contains ~31k training examples, 4k in-domain test examples, and 3k out-of-distribution split. Scenario-based queries expose systematic failures in current models that standard benchmarks don’t reveal. Curriculum training improves performance on challenging slices and transfers to standard benchmarks.
Conclusion: Scenario-based visual grounding is a complementary and more challenging setting that requires deeper understanding beyond literal naming. The RSC benchmark and ScenGround method provide tools for advancing visual grounding capabilities.
Abstract: Existing visual grounding benchmarks primarily evaluate alignment between image regions and literal referring expressions, where models can often succeed by matching a prominent named category. We explore a complementary and more challenging setting of scenario-based visual grounding, where the target must be inferred from roles, intentions, and relational context rather than explicit naming. We introduce Referring Scenario Comprehension (RSC), a benchmark designed for this setting. The queries in this benchmark are paragraph-length texts that describe object roles, user goals, and contextual cues, including deliberate references to distractor objects that often require deep understanding to resolve. Each instance is annotated with interpretable difficulty tags for uniqueness, clutter, size, overlap, and position which expose distinct failure modes and support fine-grained analysis. RSC contains approximately 31k training examples, 4k in-domain test examples, and a 3k out-of-distribution split with unseen object categories. We further propose ScenGround, a curriculum reasoning method serving as a reference point for this setting, combining supervised warm-starting with difficulty-aware reinforcement learning. Experiments show that scenario-based queries expose systematic failures in current models that standard benchmarks do not reveal, and that curriculum training improves performance on challenging slices and transfers to standard benchmarks.
[245] Modulate-and-Map: Crossmodal Feature Mapping with Cross-View Modulation for 3D Anomaly Detection
Alex Costanzino, Pierluigi Zama Ramirez, Giuseppe Lisanti, Luigi Di Stefano
Main category: cs.CV
TL;DR: ModMap is a multiview multimodal framework for 3D anomaly detection that learns crossmodal feature mapping across views and modalities using feature-wise modulation, achieving SOTA on the SiM3D benchmark.
Details
Motivation: Existing methods process views independently for 3D anomaly detection, lacking effective cross-view and cross-modal feature integration. The authors aim to develop a natively multiview and multimodal framework that can better model relationships across different views and modalities for improved anomaly detection.Method: ModMap uses crossmodal feature mapping paradigm to learn feature mapping across both modalities and views. It employs feature-wise modulation to explicitly model view-dependent relationships. A cross-view training strategy leverages all possible view combinations, enabling multiview ensembling and aggregation for anomaly scoring. The method also includes training a foundational depth encoder for high-resolution 3D industrial data.
Result: ModMap achieves state-of-the-art performance on the SiM3D benchmark, surpassing previous methods by wide margins. The authors also publicly release a foundational depth encoder tailored to industrial datasets.
Conclusion: ModMap demonstrates the effectiveness of natively multiview and multimodal approaches for 3D anomaly detection, showing that explicit modeling of cross-view and cross-modal relationships through feature mapping and modulation leads to superior performance compared to independent view processing methods.
Abstract: We present ModMap, a natively multiview and multimodal framework for 3D anomaly detection and segmentation. Unlike existing methods that process views independently, our method draws inspiration from the crossmodal feature mapping paradigm to learn to map features across both modalities and views, while explicitly modelling view-dependent relationships through feature-wise modulation. We introduce a cross-view training strategy that leverages all possible view combinations, enabling effective anomaly scoring through multiview ensembling and aggregation. To process high-resolution 3D data, we train and publicly release a foundational depth encoder tailored to industrial datasets. Experiments on SiM3D, a recent benchmark that introduces the first multiview and multimodal setup for 3D anomaly detection and segmentation, demonstrate that ModMap attains state-of-the-art performance by surpassing previous methods by wide margins.
[246] Generative World Renderer
Zheng-Hui Huang, Zhixiang Wang, Jiaming Tan, Ruihan Yu, Yidan Zhang, Bo Zheng, Yu-Lun Liu, Yung-Yu Chuang, Kaipeng Zhang
Main category: cs.CV
TL;DR: Large-scale dynamic dataset from AAA games with synchronized RGB and G-buffer channels enables robust inverse rendering and G-buffer-guided video generation, with VLM-based evaluation protocol.
Details
Motivation: Existing synthetic datasets lack realism and temporal coherence, creating a domain gap for scaling generative inverse and forward rendering to real-world scenarios.Method: Created dataset using dual-screen stitched capture from AAA games (4M frames, 720p/30 FPS) with synchronized RGB and five G-buffer channels. Proposed VLM-based assessment protocol for evaluating inverse rendering without ground truth.
Result: Inverse renderers fine-tuned on this data achieve superior cross-dataset generalization and controllable generation. VLM evaluation strongly correlates with human judgment. Toolkit enables text-prompt editing of AAA game styles from G-buffers.
Conclusion: The dataset bridges the domain gap for bidirectional rendering, enabling robust geometry/material decomposition and high-fidelity G-buffer-guided video generation with practical applications in game style editing.
Abstract: Scaling generative inverse and forward rendering to real-world scenarios is bottlenecked by the limited realism and temporal coherence of existing synthetic datasets. To bridge this persistent domain gap, we introduce a large-scale, dynamic dataset curated from visually complex AAA games. Using a novel dual-screen stitched capture method, we extracted 4M continuous frames (720p/30 FPS) of synchronized RGB and five G-buffer channels across diverse scenes, visual effects, and environments, including adverse weather and motion-blur variants. This dataset uniquely advances bidirectional rendering: enabling robust in-the-wild geometry and material decomposition, and facilitating high-fidelity G-buffer-guided video generation. Furthermore, to evaluate the real-world performance of inverse rendering without ground truth, we propose a novel VLM-based assessment protocol measuring semantic, spatial, and temporal consistency. Experiments demonstrate that inverse renderers fine-tuned on our data achieve superior cross-dataset generalization and controllable generation, while our VLM evaluation strongly correlates with human judgment. Combined with our toolkit, our forward renderer enables users to edit styles of AAA games from G-buffers using text prompts.
[247] EventHub: Data Factory for Generalizable Event-Based Stereo Networks without Active Sensors
Luca Bartolomei, Fabio Tosi, Matteo Poggi, Stefano Mattoccia, Guillermo Gallego
Main category: cs.CV
TL;DR: EventHub trains event stereo networks without ground truth by using proxy annotations from color images via novel view synthesis, enabling repurposing of RGB stereo models for event data with strong generalization.
Details
Motivation: Ground truth annotations for event stereo networks are costly to obtain from active sensors. The paper aims to enable training of deep-event stereo networks without such expensive annotations by leveraging readily available color images.Method: Uses standard color images to derive proxy annotations and proxy events through novel view synthesis techniques. When images are already paired with event data, only proxy annotations are needed. This creates a training set that allows repurposing state-of-the-art RGB stereo models to process event data.
Result: Experiments on widely used event stereo datasets show EventHub’s effectiveness. The framework produces event stereo models with unprecedented generalization capabilities. The same data distillation mechanism also improves RGB stereo foundation models in challenging conditions like nighttime scenes.
Conclusion: EventHub provides a practical framework for training event stereo networks without costly ground truth annotations, enabling better generalization and improved performance in challenging conditions through data distillation from color images.
Abstract: We propose EventHub, a novel framework for training deep-event stereo networks without ground truth annotations from costly active sensors, relying instead on standard color images. From these images, we derive either proxy annotations and proxy events through state-of-the-art novel view synthesis techniques, or simply proxy annotations when images are already paired with event data. Using the training set generated by our data factory, we repurpose state-of-the-art stereo models from RGB literature to process event data, obtaining new event stereo models with unprecedented generalization capabilities. Experiments on widely used event stereo datasets support the effectiveness of EventHub and show how the same data distillation mechanism can improve the accuracy of RGB stereo foundation models in challenging conditions such as nighttime scenes.
[248] Robust Adaptation of Foundation Models with Black-Box Visual Prompting
Changdae Oh, Gyeongdeok Seo, Geunyoung Jung, Zhi-Qi Cheng, Hosik Choi, Jiyoung Jung, Kyungwoo Song
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2407.17491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2407.17491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[249] Human Insights Driven Latent Space for Different Driving Perspectives: A Unified Encoder for Efficient Multi-Task Inference
Huy-Dung Nguyen, Anass Bairouk, Mirjana Maras, Wei Xiao, Tsun-Hsuan Wang, Patrick Chareyre, Ramin Hasani, Marc Blanchon, Daniela Rus
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2409.10095: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2409.10095&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[250] OOD-SEG: Exploiting out-of-distribution detection techniques for learning image segmentation from sparse multi-class positive-only annotations
Junwen Wang, Zhonghao Wang, Oscar MacCormac, Jonathan Shapey, Tom Vercauteren
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to determine conclusion as paper content could not be retrieved
Abstract: Failed to fetch summary for 2411.09553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.09553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[251] GCond: Gradient Conflict Resolution via Accumulation-based Stabilization for Large-Scale Multi-Task Learning
Evgeny Alves Limarenko, Anastasiia Studenikina, Svetlana Illarionova, Maxim Sharaev
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.07252: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07252&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[252] InTraGen: Trajectory-controlled Video Generation for Object Interactions
Zuhao Liu, Aleksandar Yanev, Ahmad Mahmood, Ivan Nikolov, Saman Motamed, Wei-Shi Zheng, Xi Wang, Lei Sun, Luc Van Gool, Danda Pani Paudel
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to access errorMethod: Cannot determine method due to access error
Result: Cannot determine results due to access error
Conclusion: Cannot determine conclusion due to access error
Abstract: Failed to fetch summary for 2411.16804: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2411.16804&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[253] CARE: Confidence-aware Ratio Estimation for Medical Biomarkers
Jiameng Li, Teodora Popordanoska, Aleksei Tiulpin, Sebastian G. Gruber, Frederik Maes, Matthew B. Blaschko
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2505.19585: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.19585&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[254] Learning Fine-Grained Geometry for Sparse-View Splatting via Cascade Depth Loss
Wenjun Lu, Haodong Chen, Anqi Yi, Guoxi Huang, Yuk Ying Chung, Kun Hu, Zhiyong Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2505.22279: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.22279&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[255] ThinkGeo: Evaluating Tool-Augmented Agents for Remote Sensing Tasks
Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, Salman Khan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in accessing paper contentMethod: No method information available - paper content inaccessible
Result: No results available - technical error prevented paper retrieval
Conclusion: Cannot analyze paper due to HTTP 429 error from arXiv API
Abstract: Failed to fetch summary for 2505.23752: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.23752&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[256] Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model
Chaoxiang Cai, Longrong Yang, Minghe Weng, Xuewei Li, Zequn Qin, Xi Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to draw conclusions due to failed API request
Abstract: Failed to fetch summary for 2507.01351: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.01351&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[257] Aleatoric Uncertainty Medical Image Segmentation Estimation via Flow Matching
Phi Van Nguyen, Ngoc Huynh Trinh, Duy Minh Lam Nguyen, Phu Loc Nguyen, Quoc Long Tran
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2507.22418: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.22418&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[258] Understanding the Risks of Asphalt Art to the Reliability of Vision-Based Perception Systems
Jin Ma, Abyad Enan, Long Cheng, Mashrur Chowdhury
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2508.02530: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02530&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[259] Adaptive Reinforcement for Open-ended Medical Reasoning via Semantic-Guided Reward Collapse Mitigation
Yizhou Liu, Dingkang Yang, Zizhi Chen, Minghao Han, Xukun Zhang, Keliang Liu, Jingwei Wei, Lihua Zhang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.12957: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.12957&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[260] Maximally Useful and Minimally Redundant: The Key to Self Supervised Learning for Imbalanced Data
Yash Kumar Sharma, Vineet Padmanabhan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2509.08469: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.08469&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[261] Demystifying Transition Matching: When and Why It Can Beat Flow Matching
Jaihoon Kim, Rajarshi Saha, Minhyuk Sung, Youngsuk Park
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.17991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.17991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[262] Be Tangential to Manifold: Discovering Riemannian Metric for Diffusion Models
Shinnosuke Saito, Takashi Matsubara
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.05509: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.05509&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[263] MS-Mix: Sentiment-Guided Adaptive Augmentation for Multimodal Sentiment Analysis
Hongyu Zhu, Lin Chen, Xin Jin, Mingsheng Shang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2510.11579: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11579&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[264] A Luminance-Aware Multi-Scale Network for Polarization Image Fusion with a Multi-Scene Dataset
Zhuangfan Huang, Xiaosong Li, Gao Wang, Tao Ye, Haishu Tan, Huafeng Li
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2510.24379: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.24379&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[265] FastSurfer-CC: A robust, accurate, and comprehensive framework for corpus callosum morphometry
Clemens Pollak, Kersten Diers, Santiago Estrada, David Kügler, Martin Reuter
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.16471: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16471&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[266] SPAGS: Sparse-View Articulated Object Reconstruction from Single State via Planar Gaussian Splatting
Di Wu, Liu Liu, Xueyu Yuan, Wenxiao Chen, Lijun Yue, Liuzhu Chen, Yiming Tang, Meng Wang
Main category: cs.CV
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to rate limitingMethod: Method unknown - arXiv API returned HTTP 429 (Too Many Requests) error preventing access to paper details
Result: No results available - technical issue with data retrieval prevents analysis
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper information
Abstract: Failed to fetch summary for 2511.17092: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.17092&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[267] Seeing without Pixels: Perception from Camera Trajectories
Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to unavailability of content
Abstract: Failed to fetch summary for 2511.21681: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.21681&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[268] ICM-SR: Image-Conditioned Manifold Regularization for Image Super-Resolution
Junoh Kang, Donghun Ryou, Bohyung Han
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2511.22048: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22048&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[269] Structure is Supervision: Multiview Masked Autoencoders for Radiology
Sonia Laguna, Andrea Agostini, Alain Ryser, Samuel Ruiperez-Campillo, Irene Cannistraci, Moritz Vandenhirtz, Stephan Mandt, Nicolas Deperrois, Farhad Nooralahzadeh, Michael Krauthammer, Thomas M. Sutter, Julia E. Vogt
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2511.22294: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22294&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[270] FRAMER: Frequency-Aligned Self-Distillation with Adaptive Modulation Leveraging Diffusion Priors for Real-World Image Super-Resolution
Seungho Choi, Jeahun Sung, Jihyong Oh
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.01390: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.01390&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[271] A multi-weight self-matching visual explanation for cnns on sar images
Siyuan Sun, Yongping Zhang, Hongcheng Zeng, Yamin Wang, Wei Yang, Wanting Yang, Jie Chen
Main category: cs.CV
TL;DR: Unable to analyze paper 2512.02344 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2512.02344: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.02344&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[272] Improvise, Adapt, Overcome – Telescopic Adapters for Efficient Fine-tuning of Vision Language Models in Medical Imaging
Ujjwal Mishra, Vinita Shukla, Praful Hambarde, Amit Shukla
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2512.13855: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.13855&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[273] Deterministic World Models for Verification of Closed-loop Vision-based Systems
Yuang Geng, Zhuoyang Zhou, Zhongzheng Zhang, Siyuan Pan, Hoang-Dung Tran, Ivan Ruchkin
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when trying to access arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2512.08991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.08991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[274] Towards Faithful Reasoning in Comics for Small MLLMs
Chengcheng Feng, Haojie Yin, Yucheng Jin, Kaizhu Huang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.02991: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.02991&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[275] Grounding Everything in Tokens for Multimodal Large Language Models
Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, Chao Ma
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2512.10554: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10554&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[276] VOIC: Visible-Occluded Integrated Guidance for 3D Semantic Scene Completion
Zaidao Han, Risa Higashita, Jiang Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2512.18954: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.18954&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[277] Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, Vincent Shao, Yue Yang, Weikai Huang, Ziqi Gao, Taira Anderson, Jianrui Zhang, Jitesh Jain, George Stoica, Winson Han, Ali Farhadi, Ranjay Krishna
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.10611: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10611&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[278] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes
Emily Steiner, Jianhao Zheng, Henry Howard-Jenkins, Chris Xie, Iro Armeni
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.11508: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11508&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[279] LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models
Mingyang Xie, Numair Khan, Tianfu Wang, Naina Dhingra, Seonghyeon Nam, Haitao Yang, Zhuo Hui, Christopher Metzler, Andrea Vedaldi, Hamed Pirsiavash, Lei Luo
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2601.14674: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.14674&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[280] Seeing through Light and Darkness: Sensor-Physics Grounded Deblurring HDR NeRF from Single-Exposure Images and Events
Yunshan Qi, Lin Zhu, Nan Bao, Yifan Zhao, Jia Li
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to analyze paper due to technical access issues
Abstract: Failed to fetch summary for 2601.15475: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.15475&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[281] SALAD: Achieve High-Sparsity Attention via Efficient Linear Attention Tuning for Video Diffusion Transformer
Tongcheng Fang, Hanling Zhang, Ruiqi Xie, Zhuo Han, Xin Tao, Tianchen Zhao, Pengfei Wan, Wenbo Ding, Wanli Ouyang, Xuefei Ning, Yu Wang
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use different approach
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2601.16515: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16515&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[282] GPA-VGGT:Adapting VGGT to Large Scale Localization by Self-Supervised Learning with Geometry and Physics Aware Loss
Yangfan Xu, Lilian Zhang, Xiaofeng He, Pengdong Wu, Wenqi Wu, Jun Mao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2601.16885: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16885&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[283] GVGS: Gaussian Visibility-Aware Multi-View Geometry for Accurate Surface Reconstruction
Mai Su, Qihan Yu, Zhongtao Wang, Yilong Li, Chengwei Pan, Yisong Chen, Guoping Wang, Fei Zhu
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2601.20331 could not be retrieved from arXiv API.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2601.20331: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20331&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[284] Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization
Hao Fang, Jinyu Li, Jiawei Kong, Tianqu Zhuang, Kuofeng Gao, Bin Chen, Shu-Tao Xia
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) when trying to access arXiv API for paper ID 2602.03380
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to technical limitations in accessing the information
Abstract: Failed to fetch summary for 2602.03380: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.03380&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[285] PMMA: The Polytechnique Montreal Mobility Aids Dataset
Qingwu Liu, Nicolas Saunier, Guillaume-Alexandre Bilodeau
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2602.10259: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.10259&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[286] EW-DETR: Evolving World Object Detection via Incremental Low-Rank DEtection TRansformer
Munish Monga, Vishal Chudasama, Pankaj Wasnik, C.V. Jawahar
Main category: cs.CV
TL;DR: Paper 2602.20985 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2602.20985: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.20985&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[287] EmbodMocap: In-the-Wild 4D Human-Scene Reconstruction for Embodied Agents
Wenjia Wang, Liang Pan, Huaijin Pi, Yuke Lou, Xuqian Ren, Yifan Wu, Zhouyingcheng Liao, Lei Yang, Rishabh Dabral, Christian Theobalt, Taku Komura
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.23205: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23205&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[288] Cryo-Bench: Benchmarking Foundation Models for Cryosphere Applications
Saurabh Kaushik, Lalit Maurya, Beth Tellman
Main category: cs.CV
TL;DR: Paper ID 2603.01576 - Could not fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing abstractMethod: Unable to determine method due to missing abstract
Result: Unable to determine results due to missing abstract
Conclusion: Unable to determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.01576: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01576&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[289] PPEDCRF: Privacy-Preserving Enhanced Dynamic CRF for Location-Privacy Protection for Sequence Videos with Minimal Detection Degradation
Bo Ma, Jinsong Wu, Weiqi Yan, Catherine Shi, Minh Nguyen
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.01593: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01593&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[290] Fair Benchmarking of Emerging One-Step Generative Models Against Multistep Diffusion and Flow Models
Advaith Ravishankar, Serena Liu, Mingyang Wang, Todd Zhou, Jeffrey Zhou, Arnav Sharma, Ziling Hu, Léopold Das, Abdulaziz Sobirov, Faizaan Siddique, Freddy Yu, Seungjoo Baek, Yan Luo, Mengyu Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2603.14186: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14186&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[291] AdaRadar: Rate Adaptive Spectral Compression for Radar-based Perception
Jinho Park, Se Young Chun, Mingoo Seok
Main category: cs.CV
TL;DR: Paper ID 2603.17979 could not be fetched due to HTTP 429 error (rate limiting), preventing analysis of its content and relevance assessment.
Details
Motivation: Unable to determine motivation as the paper abstract could not be retrieved due to API rate limiting.Method: Unknown - paper content not accessible for analysis.
Result: No results available due to failed data retrieval.
Conclusion: Cannot draw conclusions about an inaccessible paper.
Abstract: Failed to fetch summary for 2603.17979: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17979&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[292] Generation Is Compression: Zero-Shot Video Coding via Stochastic Rectified Flow
Ziyue Zeng, Xun Su, Haoyuan Liu, Bingyu Lu, Yui Tatsumi, Hiroshi Watanabe
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.26571: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26571&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[293] Fourier Splatting: Generalized Fourier encoded primitives for scalable radiance fields
Mihnea-Bogdan Jurca, Bert Van hauwermeiren, Adrian Munteanu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.19834: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19834&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[294] PAM: A Pose-Appearance-Motion Engine for Sim-to-Real HOI Video Generation
Mingju Gao, Kaisen Yang, Huan-ang Gao, Bohan Li, Ao Ding, Wenyi Li, Yangcheng Yu, Jinkun Liu, Shaocong Xu, Yike Niu, Haohan Chi, Hao Chen, Hao Tang, Yu Zhang, Li Yi, Hao Zhao
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2603.22193: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.22193&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[295] Mind the Hitch: Dynamic Calibration and Articulated Perception for Autonomous Trucks
Morui Zhu, Yongqi Zhu, Song Fu, Qing Yang
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2603.23711: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.23711&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[296] ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.24270: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24270&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[297] MOON3.0: Reasoning-aware Multimodal Representation Learning for E-commerce Product Understanding
Junxian Wu, Chenghan Fu, Zhanheng Nie, Daoze Zhang, Bowen Wan, Wanxian Guan, Chuan Yu, Jian Xu, Bo Zheng
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2604.00513: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00513&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[298] OmniWeaving: Towards Unified Video Generation with Free-form Composition and Reasoning
Kaihang Pan, Qi Tian, Jianwei Zhang, Weijie Kong, Jiangfeng Xiong, Yanxin Long, Shixue Zhang, Haiyi Qiu, Tan Wang, Zheqi Lv, Yue Wu, Liefeng Bo, Siliang Tang, Zhao Zhong
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2603.24458: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24458&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[299] Think, Act, Build: An Agentic Framework with Vision Language Models for Zero-Shot 3D Visual Grounding
Haibo Wang, Zihao Lin, Zhiyang Xu, Lifu Huang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.00528: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00528&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[300] GridVAD: Open-Set Video Anomaly Detection via Spatial Reasoning over Stratified Frame Grids
Mohamed Eltahir, Ahmed O. Ibrahim, Obada Siralkhatim, Tabarak Abdallah, Sondos Mohamed
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper information.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.25467: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.25467&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[301] AutoWeather4D: Autonomous Driving Video Weather Conversion via G-Buffer Dual-Pass Editing
Tianyu Liu, Weitao Xiong, Kunming Luo, Manyuan Zhang, Peng Li, Yuan Liu, Ping Tan
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2603.26546: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26546&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[302] Toward Personalized Darts Training: A Data-Driven Framework Based on Skeleton-Based Biomechanical Analysis and Motion Modeling
Zhantao Chen, Dongyi He, Jin Fang, Xi Chen, Yishuo Liu, Xiaozhen Zhong, Xuejun Hu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2604.01130: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01130&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[303] Human-Centric Perception for Child Sexual Abuse Imagery
Camila Laranjeira, João Macedo, Sandra Avila, Fabrício Benevenuto, Jefersson A. dos Santos
Main category: cs.CV
TL;DR: Paper 2603.27290: Unable to fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot determine conclusion due to inability to access paper content
Abstract: Failed to fetch summary for 2603.27290: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27290&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[304] When Surfaces Lie: Exploiting Wrinkle-Induced Attention Shift to Attack Vision-Language Models
Chengyin Hu, Xuemeng Sun, Jiajun Han, Qike Zhang, Xiang Chen, Xin Wang, Yiwei Wei, Jiahua Long
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting). Need to try again later or use alternative methods to access the paper content.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2603.27759: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27759&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[305] RehearsalNeRF: Decoupling Intrinsic Neural Fields of Dynamic Illuminations for Scene Editing
Changyeon Won, Hyunjun Jung, Jungu Cho, Seonmi Park, Chi-Hoon Lee, Hae-Gon Jeon
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.27948: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27948&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[306] Tackling Non-IIDness in HAPS-Aided Federated Learning
Amin Farajzadeh, Animesh Yadav, Halim Yanikomeroglu
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2401.05308: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2401.05308&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[307] Monocular Building Height Estimation from PhiSat-2 Imagery: Dataset and Method
Yanjiao Song, Bowen Cai, Timo Balz, Zhenfeng Shao, Neema Simon Sumari, James Magidi, Walter Musakwa
Main category: cs.CV
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.29245: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29245&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[308] MaskAdapt: Learning Flexible Motion Adaptation via Mask-Invariant Prior for Physics-Based Characters
Soomin Park, Eunseong Lee, Kwang Bin Lee, Sung-Hee Lee
Main category: cs.CV
TL;DR: MaskAdapt is a two-stage framework for flexible motion adaptation in physics-based humanoid control using residual learning with body-part masking.
Details
Motivation: The paper addresses the need for flexible motion adaptation in physics-based humanoid control, where modifying specific body parts while preserving other behaviors is challenging. Current methods often lack the ability to make targeted adaptations without affecting the entire motion.Method: Two-stage residual learning: 1) Train mask-invariant base policy with stochastic body-part masking and regularization for consistent actions across masking conditions. 2) Train residual policy on frozen base controller to modify only targeted body parts while preserving original behaviors elsewhere.
Result: MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work. Applications include motion composition and text-driven partial goal tracking.
Conclusion: The framework enables flexible motion adaptation through a two-stage residual learning approach with body-part masking, showing versatility in motion composition and text-driven partial goal tracking applications.
Abstract: We present MaskAdapt, a framework for flexible motion adaptation in physics-based humanoid control. The framework follows a two-stage residual learning paradigm. In the first stage, we train a mask-invariant base policy using stochastic body-part masking and a regularization term that enforces consistent action distributions across masking conditions. This yields a robust motion prior that remains stable under missing observations, anticipating later adaptation in those regions. In the second stage, a residual policy is trained atop the frozen base controller to modify only the targeted body parts while preserving the original behaviors elsewhere. We demonstrate the versatility of this design through two applications: (i) motion composition, where varying masks enable multi-part adaptation within a single sequence, and (ii) text-driven partial goal tracking, where designated body parts follow kinematic targets provided by a pre-trained text-conditioned autoregressive motion generator. Through experiments, MaskAdapt demonstrates strong robustness and adaptability, producing diverse behaviors under masked observations and delivering superior targeted motion adaptation compared to prior work.
[309] Assessing Multimodal Chronic Wound Embeddings with Expert Triplet Agreement
Fabian Kabus, Julia Hindel, Jelena Bratulić, Meropi Karakioulaki, Ayush Gupta, Cristina Has, Thomas Brox, Abhinav Valada, Harald Binder
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.29376: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29376&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[310] Scaling Video Pretraining for Surgical Foundation Models
Sicheng Lu, Zikai Xiao, Jianhui Wei, Danyu Sun, Qi Lu, Keli Hu, Yang Feng, Jian Wu, Zongxin Yang, Zuozhu Liu
Main category: cs.CV
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.29966: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29966&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[311] Towards Physically Realizable Adversarial Attenuation Patch against SAR Object Detection
Yiming Zhang, Weibo Qin, Feng Wang
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper informationMethod: Unable to determine method due to technical error in fetching paper information
Result: Unable to determine results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2604.00887: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00887&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[312] Pixel Motion Diffusion is What We Need for Robot Control
E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo
Main category: cs.CV
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2509.22652: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.22652&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.AI
[313] Runtime Burden Allocation for Structured LLM Routing in Agentic Expert Systems: A Full-Factorial Cross-Backend Methodology
Zhou Hanlin, Chan Huah Yong
Main category: cs.AI
TL;DR: Structured LLM routing is a systems-level burden-allocation problem, not just prompt engineering, with performance depending on how structural work is allocated across the generation stack and backend-specific interactions.
Details
Motivation: As LLMs become core control components in agentic AI systems, reliable structured routing must balance correctness, latency, and implementation cost under real deployment constraints. Current approaches treat routing as a prompt-engineering problem, but it's fundamentally a systems-level burden-allocation problem.Method: The paper evaluates structured routing as a burden-allocation problem across the generation stack (whether structure is emitted directly, compressed during transport, or reconstructed locally). They conduct a comprehensive full-factorial benchmark covering 48 deployment configurations and 15,552 requests across OpenAI, Gemini, and Llama backends.
Result: No universal best routing mode exists - backend-specific interaction effects dominate performance. Modes that work reliably on Gemini and OpenAI can suffer substantial correctness degradation on Llama, while efficiency gains from compressed realization are strongly backend-dependent.
Conclusion: The paper contributes a deployable framework for reasoning about structured routing under heterogeneous backend conditions, providing cross-backend evaluation methodology and practical deployment guidance for production-grade agentic expert systems.
Abstract: Structured LLM routing is often treated as a prompt-engineering problem. We argue that it is, more fundamentally, a systems-level burden-allocation problem. As large language models (LLMs) become core control components in agentic AI systems, reliable structured routing must balance correctness, latency, and implementation cost under real deployment constraints. We show that this balance is shaped not only by prompts or schemas, but also by how structural work is allocated across the generation stack: whether output structure is emitted directly by the model, compressed during transport, or reconstructed locally after generation. We evaluate this formulation through a comprehensive full-factorial benchmark covering 48 deployment configurations and 15,552 requests across OpenAI, Gemini, and Llama backends. Our central finding is consequential: there is no universal best routing mode. Instead, backend-specific interaction effects dominate performance. Modes that remain highly reliable on Gemini and OpenAI can suffer substantial correctness degradation on Llama, while efficiency gains from compressed realization are strongly backend-dependent. Rather than presenting another isolated model comparison, this work contributes a deployable framework for reasoning about structured routing under heterogeneous backend conditions. We provide a cross-backend evaluation methodology and practical deployment guidance for navigating the correctness-cost-latency frontier in production-grade agentic expert systems.
[314] The Digital Twin Counterfactual Framework: A Validation Architecture for Simulated Potential Outcomes
Olav Laudy
Main category: cs.AI
TL;DR: A causal inference framework using digital twin simulations with hierarchical validation to estimate counterfactuals, separating testable marginal effects from copula-dependent individual treatment effects.
Details
Motivation: Traditional causal inference methods rely on assumptions (ignorability, parallel trends) to substitute for missing counterfactual data, but none actually produce the counterfactual outcomes themselves. The paper aims to address this by simulating counterfactuals rather than just estimating them statistically.Method: Proposes the Digital Twin Counterfactual Framework (DTCF) that uses digital twin simulations within the potential outcomes framework. Introduces a hierarchy of twin fidelity assumptions (marginal, joint, structural fidelity) and a five-level validation architecture to convert unfalsifiable claims into testable hypotheses against observable data.
Result: Formally decomposes causal quantities into marginally validated effects (ATE, CATE, QTE) that can be tested through observable-arm comparisons, and copula-dependent effects (ITE distribution, probability of benefit/harm) that rely on unobservable within-individual dependence structure. Provides bounding, sensitivity, and uncertainty quantification tools.
Conclusion: DTCF doesn’t resolve the fundamental problem of causal inference but provides a framework where marginal causal claims become increasingly testable, joint causal claims become explicitly assumption-indexed, and the gap between them is formally characterized.
Abstract: The fundamental problem of causal inference - that the counterfactual outcome for any individual is never observed - has shaped the entire methodology of the field. Every existing approach substitutes assumptions for missing data: ignorability, parallel trends, exclusion restrictions. None produces the counterfactual itself. This paper proposes the Digital Twin Counterfactual Framework (DTCF): rather than estimating the counterfactual statistically, we simulate it using a digital twin and subject the simulation to a hierarchical validation regime. We formalize the digital twin simulator as a stochastic mapping within the potential outcomes framework and introduce a hierarchy of twin fidelity assumptions - from marginal fidelity through joint fidelity to structural fidelity - each unlocking a progressively richer class of estimands. The central contribution is threefold. First, a five-level validation architecture converts the unfalsifiable claim that the simulator produces correct counterfactuals into falsifiable tests against observable data. Second, a formal decomposition separates causal quantities into those that are marginally validated (ATE, CATE, QTE - testable through observable-arm comparison) and those that are copula-dependent (the ITE distribution, probability of benefit/harm, variance of treatment effects - permanently reliant on the unobservable within-individual dependence structure). Third, bounding, sensitivity, and uncertainty quantification tools make the copula dependence explicit. The DTCF does not resolve the fundamental problem of causal inference. What it provides is a framework in which marginal causal claims become increasingly testable, joint causal claims become explicitly assumption-indexed, and the gap between the two is formally characterized.
[315] IDEA2: Expert-in-the-loop competency question elicitation for collaborative ontology engineering
Elliott Watkiss-Leek, Reham Alharbi, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis
Main category: cs.AI
TL;DR: IDEA2 is a semi-automated workflow using LLMs to streamline competency question elicitation in ontology engineering through an iterative expert-in-the-loop process with provenance tracking.
Details
Motivation: Competency question elicitation is a critical but resource-intensive bottleneck in ontology engineering, hampered by communication gaps between domain experts and ontology engineers.Method: IDEA2 integrates LLMs in a collaborative workflow: initial LLM-based extraction of CQs from requirements, expert review/feedback on a collaborative platform, and iterative feedback-driven reformulation by LLM until consensus, with full provenance tracking.
Result: Validated in 2 real-world scenarios (scientific data, cultural heritage), IDEA2 accelerated requirements engineering, improved CQ acceptance/relevance, and showed high usability/effectiveness among domain experts.
Conclusion: IDEA2 successfully addresses the CQ elicitation bottleneck through LLM-assisted collaborative workflow, demonstrating practical value in real-world ontology engineering scenarios.
Abstract: Competency question (CQ) elicitation represents a critical but resource-intensive bottleneck in ontology engineering. This foundational phase is often hampered by the communication gap between domain experts, who possess the necessary knowledge, and ontology engineers, who formalise it. This paper introduces IDEA2, a novel, semi-automated workflow that integrates Large Language Models (LLMs) within a collaborative, expert-in-the-loop process to address this challenge. The methodology is characterised by a core iterative loop: an initial LLM-based extraction of CQs from requirement documents, a co-creational review and feedback phase by domain experts on an accessible collaborative platform, and an iterative, feedback-driven reformulation of rejected CQs by an LLM until consensus is achieved. To ensure transparency and reproducibility, the entire lifecycle of each CQ is tracked using a provenance model that captures the full lineage of edits, anonymised feedback, and generation parameters. The workflow was validated in 2 real-world scenarios (scientific data, cultural heritage), demonstrating that IDEA2 can accelerate the requirements engineering process, improve the acceptance and relevance of the resulting CQs, and exhibit high usability and effectiveness among domain experts. We release all code and experiments at https://github.com/KE-UniLiv/IDEA2
[316] Semantic Modeling for World-Centered Architectures
Andrei Mantsivoda, Darya Gavrilina
Main category: cs.AI
TL;DR: World-centered multi-agent systems (WMAS) propose a shared, explicit world representation for structured domains like enterprises, using semantic models for consistency and verifiability, with Ontobox as a realization platform.
Details
Motivation: Traditional agent-centered architectures lack semantic consistency, explainability, and long-term stability in structured domains like enterprises and institutional systems, requiring a shared world representation.Method: Introduce WMAS with shared world models classified by ontological explicitness and normativity; propose semantic models as mathematical formalism; present Ontobox platform as realization.
Result: WMAS enables global consistency, verifiable system behavior, and coordination over shared world models rather than isolated agent-local representations.
Conclusion: World-centered approach with explicit shared representations is superior for structured domains, providing semantic consistency, explainability, and long-term stability.
Abstract: We introduce world-centered multi-agent systems (WMAS) as an alternative to traditional agent-centered architectures, arguing that structured domains such as enterprises and institutional systems require a shared, explicit world representation to ensure semantic consistency, explainability, and long-term stability. We classify worlds along dimensions including ontological explicitness, normativity, etc. In WMAS, learning and coordination operate over a shared world model rather than isolated agent-local representations, enabling global consistency and verifiable system behavior. We propose semantic models as a mathematical formalism for representing such worlds. Finally, we present the Ontobox platform as a realization of WMAS.
[317] A Role-Based LLM Framework for Structured Information Extraction from Healthy Food Policies
Congjing Zhang, Ruoxuan Bao, Jingyu Li, Yoav Ackerman, Shuai Huang, Yanfang Su
Main category: cs.AI
TL;DR: A role-based LLM framework for information extraction from healthy food policy documents, using specialized LLM roles to address hallucinations and inconsistencies in policy analysis.
Details
Motivation: Current LLM approaches for information extraction in the healthy food policy domain suffer from hallucinations, misclassifications, and omissions due to structural diversity and inconsistency of policy documents, requiring more reliable methods.Method: Proposes a role-based LLM framework with specialized roles: LLM policy analyst for metadata/mechanism classification, LLM legal strategy specialist for identifying legal approaches, and LLM food system expert for categorizing food system stages. Incorporates structured domain knowledge into role-specific prompts to mimic expert workflows.
Result: Evaluated on 608 healthy food policies from HFPP database, the framework outperformed zero-shot, few-shot, and chain-of-thought baselines using Llama-3.3-70B, demonstrating superior performance in complex reasoning tasks.
Conclusion: The role-based framework offers a reliable and transparent methodology for automating information extraction from health policies, addressing limitations of current LLM approaches.
Abstract: Current Large Language Model (LLM) approaches for information extraction (IE) in the healthy food policy domain are often hindered by various factors, including misinformation, specifically hallucinations, misclassifications, and omissions that result from the structural diversity and inconsistency of policy documents. To address these limitations, this study proposes a role-based LLM framework that automates the IE from unstructured policy data by assigning specialized roles: an LLM policy analyst for metadata and mechanism classification, an LLM legal strategy specialist for identifying complex legal approaches, and an LLM food system expert for categorizing food system stages. This framework mimics expert analysis workflows by incorporating structured domain knowledge, including explicit definitions of legal mechanisms and classification criteria, into role-specific prompts. We evaluate the framework using 608 healthy food policies from the Healthy Food Policy Project (HFPP) database, comparing its performance against zero-shot, few-shot, and chain-of-thought (CoT) baselines using Llama-3.3-70B. Our proposed framework demonstrates superior performance in complex reasoning tasks, offering a reliable and transparent methodology for automating IE from health policies.
[318] Crashing Waves vs. Rising Tides: Preliminary Findings on AI Automation from Thousands of Worker Evaluations of Labor Market Tasks
Matthias Mertens, Adam Kuzee, Brittany S. Harris, Harry Lyu, Wensu Li, Jonathan Rosenfeld, Meiri Anto, Martin Fleming, Neil Thompson
Main category: cs.AI
TL;DR: AI automation occurs through “rising tides” (continuous broad-based improvement) rather than “crashing waves” (abrupt surges), with LLMs showing steady progress across thousands of text-based tasks and projected to reach 80-95% success rates by 2029.
Details
Motivation: To understand the pattern of AI automation - whether it occurs through abrupt capability surges ("crashing waves") or continuous broad-based improvement ("rising tides") - and to quantify AI's current and future capabilities across diverse text-based tasks.Method: Evaluated AI capabilities across over 3,000 text-based tasks derived from U.S. Department of Labor O*NET categorization, using more than 17,000 evaluations by workers from those jobs to assess AI performance and improvement trends.
Result: Found substantial evidence for “rising tides” as the primary form of AI automation, with AI performance high and improving rapidly across a wide range of tasks. In 2024-Q2, AI models successfully complete 3-4 hour human tasks with ~50% success rate, projected to increase to ~65% by 2025-Q3 and 80-95% by 2029.
Conclusion: AI automation progresses through continuous broad-based improvement rather than abrupt surges, with LLMs showing steady capability growth that will likely enable them to handle most text-related tasks at high success rates within 5 years, though economic adoption may lag behind technical capabilities.
Abstract: We propose that AI automation is a continuum between: (i) crashing waves where AI capabilities surge abruptly over small sets of tasks, and (ii) rising tides where the increase in AI capabilities is more continuous and broad-based. We test for these effects in preliminary evidence from an ongoing evaluation of AI capabilities across over 3,000 broad-based tasks derived from the U.S. Department of Labor O*NET categorization that are text-based and thus LLM-addressable. Based on more than 17,000 evaluations by workers from these jobs, we find little evidence of crashing waves (in contrast to recent work by METR), but substantial evidence that rising tides are the primary form of AI automation. AI performance is high and improving rapidly across a wide range of tasks. We estimate that, in 2024-Q2, AI models successfully complete tasks that take humans approximately 3-4 hours with about a 50% success rate, increasing to about 65% by 2025-Q3. If recent trends in AI capability growth persist, this pace of AI improvement implies that LLMs will be able to complete most text-related tasks with success rates of, on average, 80%-95% by 2029 at a minimally sufficient quality level. Achieving near-perfect success rates at this quality level or comparable success rates at superior quality would require several additional years. These AI capability improvements would impact the economy and labor market as organizations adopt AI, which could have a substantially longer timeline.
[319] Systematic Analyses of Reinforcement Learning Controllers in Signalized Urban Corridors
Xiaofei Song, Kerstin Eder, Jonathan Lawry, R. Eddie Wilson
Main category: cs.AI
TL;DR: RL-based traffic control methods (centralized, decentralized, parameter-sharing) are compared with MaxPressure baseline on urban corridor networks, showing parameter-sharing controllers can generalize to larger networks and potentially enable self-organized ‘green waves’.
Details
Motivation: To extend systematic capacity region analysis to multi-junction traffic networks and compare different RL-based control strategies against classical approaches for urban corridor traffic management.Method: Train and evaluate three types of RL controllers (centralized, fully decentralized, parameter-sharing decentralized) on urban corridor networks, compare with MaxPressure baseline, and test generalization of parameter-sharing controllers to larger networks.
Result: Parameter-sharing decentralized controllers can be deployed on larger networks than trained on, with initial findings suggesting traffic may self-organize into ‘green waves’ even without formal coordination.
Conclusion: Decentralized RL controllers with parameter-sharing show promise for scalable traffic control, potentially enabling emergent coordination patterns like green waves in urban networks.
Abstract: In this work, we extend our systematic capacity region perspective to multi-junction traffic networks, focussing on the special case of an urban corridor network. In particular, we train and evaluate centralized, fully decentralized, and parameter-sharing decentralized RL controllers, and compare their capacity regions and ATTs together with a classical baseline MaxPressure controller. Further, we show how the parametersharing controller may be generalised to be deployed on a larger network than it was originally trained on. In this setting, we show some initial findings that suggest that even though the junctions are not formally coordinated, traffic may self organise into `green waves’.
[320] CogBias: Measuring and Mitigating Cognitive Bias in Large Language Models
Fan Huang, Songheng Zhang, Haewoon Kwak, Jisun An
Main category: cs.AI
TL;DR: LLMs exhibit systematic cognitive biases across four families (Judgment, Information Processing, Social, Response) that can be identified as linearly separable directions in activation space and mitigated through activation steering while preserving general capabilities.
Details
Motivation: LLMs are increasingly used in high-stakes decision-making, and while prior work shows they exhibit cognitive biases behaviorally, it remains unclear whether these biases correspond to identifiable internal representations and can be mitigated through targeted interventions.Method: Introduced LLM CogBias benchmark with four families of cognitive biases; evaluated three LLMs; used linear probes with contrastive design to identify bias representations; applied activation steering to modulate biased behavior while testing preservation of downstream capabilities on 25 benchmarks.
Result: Cognitive biases emerge systematically across all four families with family-dependent magnitudes and debiasing responses; biases are encoded as linearly separable directions in activation space; activation steering achieved 26-32% reduction in bias score while preserving downstream capabilities (Llama: negligible degradation; Qwen: up to -19.0pp for Judgment biases).
Conclusion: LLM cognitive biases are systematically encoded in activation space and can be mitigated through targeted activation steering, with bias representations being near-orthogonal across models but sharing functional organization that enables similar debiasing rates.
Abstract: Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making contexts. While prior work has shown that LLMs exhibit cognitive biases behaviorally, whether these biases correspond to identifiable internal representations and can be mitigated through targeted intervention remains an open question. We define LLM cognitive bias as systematic, reproducible deviations from correct answers in tasks with computable ground-truth baselines, and introduce LLM CogBias, a benchmark organized around four families of cognitive biases: Judgment, Information Processing, Social, and Response. We evaluate three LLMs and find that cognitive biases emerge systematically across all four families, with magnitudes and debiasing responses that are strongly family-dependent: prompt-level debiasing substantially reduces Response biases but backfires for Judgment biases. Using linear probes under a contrastive design, we show that these biases are encoded as linearly separable directions in model activation space. Finally, we apply activation steering to modulate biased behavior, achieving 26–32% reduction in bias score (fraction of biased responses) while preserving downstream capability on 25 benchmarks (Llama: negligible degradation; Qwen: up to $-$19.0pp for Judgment biases). Despite near-orthogonal bias representations across models (mean cosine similarity 0.01), steering reduces bias at similar rates across architectures ($r(246)$=.621, $p$<.001), suggesting shared functional organization.
[321] RIFT: A RubrIc Failure Mode Taxonomy and Automated Diagnostics
Zhengyang Qi, Charles Dickens, Derek Pham, Amanda Dsouza, Armin Parchami, Frederic Sala, Paroma Varma
Main category: cs.AI
TL;DR: RIFT introduces a taxonomy for diagnosing failure modes in rubric design for LLM evaluation, with eight failure modes across three categories, developed through grounded theory on diverse benchmarks.
Details
Motivation: Rubric-based evaluation is widely used in LLM benchmarks but lacks principled methods to diagnose rubric quality issues from aggregated or downstream signals alone, creating a gap in evaluating rubric effectiveness.Method: Developed RIFT taxonomy using grounded theory by iteratively annotating rubrics from five diverse benchmarks until saturation. Evaluated consistency through human annotator agreement metrics, and proposed automated rubric quality metrics for scalable diagnosis.
Result: Achieved fair agreement among human annotators (87% pairwise agreement, 0.64 average Cohen’s kappa). Automated metrics aligned well with human annotations, achieving up to 0.86 F1 score.
Conclusion: RIFT provides a systematic framework for diagnosing rubric quality issues, enabling better evaluation of LLM performance on open-ended tasks through improved rubric design and automated quality assessment.
Abstract: Rubric-based evaluation is widely used in LLM benchmarks and training pipelines for open-ended, less verifiable tasks. While prior work has demonstrated the effectiveness of rubrics using downstream signals such as reinforcement learning outcomes, there remains no principled way to diagnose rubric quality issues from such aggregated or downstream signals alone. To address this gap, we introduce RIFT: RubrIc Failure mode Taxonomy, a taxonomy for systematically characterizing failure modes in rubric composition and design. RIFT consists of eight failure modes organized into three high-level categories: Reliability Failures, Content Validity Failures, and Consequential Validity Failures. RIFT is developed using grounded theory by iteratively annotating rubrics drawn from five diverse benchmarks spanning general instruction following, code generation, creative writing, and expert-level deep research, until no new failure modes are identified. We evaluate the consistency of the taxonomy by measuring agreement among independent human annotators, observing fair agreement overall (87% pairwise agreement and 0.64 average Cohen’s kappa). Finally, to support scalable diagnosis, we propose automated rubric quality metrics and show that they align with human failure-mode annotations, achieving up to 0.86 F1.
[322] Leveraging the Value of Information in POMDP Planning
Zakariya Laouar, Qi Heng Ho, Zachary Sunberg
Main category: cs.AI
TL;DR: VOIMCP: A Monte Carlo Tree Search algorithm for POMDPs that selectively disregards observation information when value of information is low, improving computational efficiency.
Details
Motivation: Solving large POMDPs efficiently remains challenging due to dimensionality and history curses. The value of information varies across belief space, suggesting opportunities to optimize computational effort by selectively processing observations.Method: Introduces a dynamic programming framework that conditionally processes observations based on VOI at each belief. Proposes VOIMCP algorithm that uses Monte Carlo Tree Search to allocate computational effort by selectively disregarding observation information when VOI is low.
Result: Theoretical guarantees on near-optimality of VOI reasoning framework and non-asymptotic convergence bounds for VOIMCP. Simulation evaluations show VOIMCP outperforms baselines on several POMDP benchmarks.
Conclusion: Selective observation processing based on value of information enables more efficient POMDP planning, with VOIMCP demonstrating improved performance over existing methods.
Abstract: Partially observable Markov decision processes (POMDPs) offer a principled formalism for planning under state and transition uncertainty. Despite advances made towards solving large POMDPs, obtaining performant policies under limited planning time remains a major challenge due to the curse of dimensionality and the curse of history. For many POMDP problems, the value of information (VOI) - the expected performance gain from reasoning about observations - varies over the belief space. We introduce a dynamic programming framework that exploits this structure by conditionally processing observations based on the value of information at each belief. Building on this framework, we propose Value of Information Monte Carlo planning (VOIMCP), a Monte Carlo Tree Search algorithm that allocates computational effort more efficiently by selectively disregarding observation information when the VOI is low, avoiding unnecessary branching of observations. We provide theoretical guarantees on the near-optimality of our VOI reasoning framework and derive non-asymptotic convergence bounds for VOIMCP. Simulation evaluations demonstrate that VOIMCP outperforms baselines on several POMDP benchmarks.
[323] The Self Driving Portfolio: Agentic Architecture for Institutional Asset Management
Andrew Ang, Nazym Azimbayev, Andrey Kim
Main category: cs.AI
TL;DR: An agentic AI system for strategic asset allocation where specialized agents generate market assumptions, construct portfolios using multiple methods, critique each other, and self-improve through meta-learning, all governed by traditional investment policy statements.
Details
Motivation: To shift the investor's role from analytical execution to oversight by creating an autonomous agentic system that can handle complex portfolio management tasks while being constrained by traditional investment governance frameworks.Method: Developed a pipeline with ~50 specialized agents that: 1) produce capital market assumptions, 2) construct portfolios using over 20 competing methods, 3) critique and vote on each other’s output, 4) include a researcher agent proposing new methods, and 5) feature a meta-agent that compares past forecasts against realized returns to rewrite agent code and prompts for improvement.
Result: Created a functional agentic strategic asset allocation pipeline where autonomous agents can perform complex investment tasks while being governed by traditional Investment Policy Statements, enabling human oversight rather than execution.
Conclusion: Agentic AI can successfully handle sophisticated investment management tasks when properly structured with specialized agents, self-improvement mechanisms, and traditional governance frameworks, allowing investors to focus on oversight rather than execution.
Abstract: Agentic AI shifts the investor’s role from analytical execution to oversight. We present an agentic strategic asset allocation pipeline in which approximately 50 specialized agents produce capital market assumptions, construct portfolios using over 20 competing methods, and critique and vote on each other’s output. A researcher agent proposes new portfolio construction methods not yet represented, and a meta-agent compares past forecasts against realized returns and rewrites agent code and prompts to improve future performance. The entire pipeline is governed by the Investment Policy Statement–the same document that guides human portfolio managers can now constrain and direct autonomous agents.
[324] ClawSafety: “Safe” LLMs, Unsafe Agents
Bowen Wei, Yunbei Zhang, Jinhao Pan, Kai Mei, Xiao Wang, Jihun Hamm, Ziwei Zhu, Yingqiang Ge
Main category: cs.AI
TL;DR: CLAWSAFETY benchmark evaluates AI agent safety in high-privilege environments, showing 40-75% attack success rates across models with safety depending on both model and framework stack.
Details
Motivation: Existing safety evaluations for AI agents are inadequate - they test models in isolated chat settings, use synthetic environments, and don't account for how agent frameworks affect safety outcomes, despite real risks of credential leaks, financial redirection, and file destruction.Method: Created CLAWSAFETY benchmark with 120 adversarial test scenarios across three dimensions (harm domain, attack vector, harmful action type) grounded in realistic professional workspaces. Evaluated 5 frontier LLMs across 2,520 sandboxed trials with adversarial content injected via skill files, emails, and web pages.
Result: Attack success rates ranged 40-75% across models, varying sharply by injection vector (skill instructions most dangerous). Strongest models maintained boundaries against credential forwarding/destructive actions while weaker models permitted both. Cross-framework experiments showed safety depends on full deployment stack, not just backbone model.
Conclusion: AI agent safety requires evaluation that treats model and framework as joint variables, with current frontier models showing significant vulnerabilities in high-privilege environments despite some protective boundaries.
Abstract: Personal AI agents like OpenClaw run with elevated privileges on users’ local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text-level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails from trusted senders, and web pages. We evaluate five frontier LLMs as agent backbones, running 2,520 sandboxed trials across all configurations. Attack success rates (ASR) range from 40% to 75% across models and vary sharply by injection vector, with skill instructions (highest trust) consistently more dangerous than email or web content. Action-trace analysis reveals that the strongest model maintains hard boundaries against credential forwarding and destructive actions, while weaker models permit both. Cross-scaffold experiments on three agent frameworks further demonstrate that safety is not determined by the backbone model alone but depends on the full deployment stack, calling for safety evaluation that treats model and framework as joint variables.
[325] When AI Gets it Wong: Reliability and Risk in AI-Assisted Medication Decision Systems
Khalid Adnan Alsayed
Main category: cs.AI
TL;DR: AI medication systems show strong metrics but real-world reliability is unclear; this paper analyzes failure types and clinical consequences through simulated scenarios.
Details
Motivation: AI systems in healthcare/pharmacy show good performance metrics but their real-world reliability in safety-critical medication decisions is insufficiently understood, with potential for severe patient harm from single errors.Method: Controlled simulated scenarios involving drug interactions and dosage decisions to analyze different failure types (missed interactions, incorrect risk flagging, inappropriate dosage recommendations).
Result: AI errors in medication contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, especially without sufficient human oversight; highlights risks of over-reliance and transparency challenges.
Conclusion: Need reliability-focused AI evaluation in healthcare, complementing traditional metrics with risk-aware approaches, particularly in safety-critical domains like pharmacy.
Abstract: Artificial intelligence (AI) systems are increasingly integrated into healthcare and pharmacy workflows, supporting tasks such as medication recommendations, dosage determination, and drug interaction detection. While these systems often demonstrate strong performance under standard evaluation metrics, their reliability in real-world decision-making remains insufficiently understood. In high-risk domains such as medication management, even a single incorrect recommendation can result in severe patient harm. This paper examines the reliability of AI-assisted medication systems by focusing on system failures and their potential clinical consequences. Rather than evaluating performance solely through aggregate metrics, this work shifts attention towards how errors occur and what happens when AI systems produce incorrect outputs. Through a series of controlled, simulated scenarios involving drug interactions and dosage decisions, we analyse different types of system failures, including missed interactions, incorrect risk flagging, and inappropriate dosage recommendations. The findings highlight that AI errors in medication-related contexts can lead to adverse drug reactions, ineffective treatment, or delayed care, particularly when systems are used without sufficient human oversight. Furthermore, the paper discusses the risks of over-reliance on AI recommendations and the challenges posed by limited transparency in decision-making processes. This work contributes a reliability-focused perspective on AI evaluation in healthcare, emphasising the importance of understanding failure behavior and real-world impact. It highlights the need to complement traditional performance metrics with risk-aware evaluation approaches, particularly in safety-critical domains such as pharmacy practice.
[326] A Multi-Agent Human-LLM Collaborative Framework for Closed-Loop Scientific Literature Summarization
Maxwell J. Jacobson, Daniel Xie, Jackson Shen, Adil Wazeer, Haiyan Wang, Xinghang Zhang, Yexiang Xue
Main category: cs.AI
TL;DR: Elhuyar is a multi-agent, human-in-the-loop system that combines LLMs, structured AI, and human scientists to extract and analyze insights from scientific literature, demonstrated in materials science for tungsten under helium-ion irradiation.
Details
Motivation: Scientific discovery is slowed by fragmented literature requiring excessive human effort. Existing AI tools lack structured, multi-step approaches for deep insight extraction, and LLMs are unreliable due to hallucinations and incomplete extraction.Method: Multi-agent system with specialized agents for filtering papers, extracting data, fitting models, and summarizing findings. Integrates LLMs, structured AI, and human oversight for reliability. Generates structured reports with data, visualizations, model equations, and text summaries through iterative refinement.
Result: Successfully deployed in materials science, analyzing tungsten under helium-ion irradiation literature. Showed experimentally correlated exponential helium bubble growth with irradiation dose and temperature, offering insights for plasma-facing materials in fusion reactors.
Conclusion: Elhuyar demonstrates how AI-assisted literature review can uncover scientific patterns and accelerate discovery through a structured, multi-agent approach with human oversight.
Abstract: Scientific discovery is slowed by fragmented literature that requires excessive human effort to gather, analyze, and understand. AI tools, including autonomous summarization and question answering, have been developed to aid in understanding scientific literature. However, these tools lack the structured, multi-step approach necessary for extracting deep insights from scientific literature. Large Language Models (LLMs) offer new possibilities for literature analysis, but remain unreliable due to hallucinations and incomplete extraction. We introduce Elhuyar, a multi-agent, human-in-the-loop system that integrates LLMs, structured AI, and human scientists to extract, analyze, and iteratively refine insights from scientific literature. The framework distributes tasks among specialized agents for filtering papers, extracting data, fitting models, and summarizing findings, with human oversight ensuring reliability. The system generates structured reports with extracted data, visualizations, model equations, and text summaries, enabling deeper inquiry through iterative refinement. Deployed in materials science, it analyzed literature on tungsten under helium-ion irradiation, showing experimentally correlated exponential helium bubble growth with irradiation dose and temperature, offering insight for plasma-facing materials (PFMs) in fusion reactors. This demonstrates how AI-assisted literature review can uncover scientific patterns and accelerate discovery.
[327] AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance
Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Chathurangi Shyalika, Suryanarayana R Yarrabothula, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, Jayant Kalagnanam
Main category: cs.AI
TL;DR: AssetOpsBench: A unified framework for evaluating LLM agents in industrial asset management with multimodal ecosystem, domain-specific agents, and automated evaluation metrics.
Details
Motivation: Traditional AI/ML approaches solve narrow industrial tasks in isolation, but LLM agents offer end-to-end automation for complex operational workflows like condition monitoring and maintenance scheduling in Industry 4.0.Method: Introduces AssetOpsBench framework with: 1) four domain-specific agents, 2) 140+ human-authored natural-language queries grounded in real industrial scenarios, 3) simulated CouchDB-backed IoT environment, and 4) automated evaluation framework with three key metrics comparing Tool-As-Agent vs Plan-Executor paradigms.
Result: Demonstrated practical relevance through broad community adoption (250+ users, 500+ agents submitted), supporting reproducible and scalable research for real-world industrial operations.
Conclusion: AssetOpsBench provides a comprehensive benchmarking platform for evaluating LLM agents in industrial asset lifecycle management, enabling systematic analysis of architectural trade-offs and failure mode discovery.
Abstract: AI for Industrial Asset Lifecycle Management aims to automate complex operational workflows, such as condition monitoring and maintenance scheduling, to minimize system downtime. While traditional AI/ML approaches solve narrow tasks in isolation, Large Language Model (LLM) agents offer a next-generation opportunity for end-to-end automation. In this paper, we introduce AssetOpsBench, a unified framework for orchestrating and evaluating domain-specific agents for Industry 4.0. AssetOpsBench provides a multimodal ecosystem comprising a catalog of four domain-specific agents, a curated dataset of 140+ human-authored natural-language queries grounded in real industrial scenarios, and a simulated, CouchDB-backed IoT environment. We introduce an automated evaluation framework that uses three key metrics to analyze architectural trade-offs between the Tool-As-Agent and Plan-Executor paradigms, along with a systematic procedure for the automated discovery of emerging failure modes. The practical relevance of AssetOpsBench is demonstrated by its broad community adoption, with 250+ users and over 500 agents submitted to our public benchmarking platform, supporting reproducible and scalable research for real-world industrial operations. The code is accesible at https://github.com/IBM/AssetOpsBench .
[328] Infeasibility Aware Large Language Models for Combinatorial Optimization
Yakun Wang, Min Chen, Zeguan Wu, Junyu Liu, Sitao Zhang, Zhenwen Shao
Main category: cs.AI
TL;DR: LLM-based framework for combinatorial optimization with infeasibility detection, applied to minor-embedding problem, achieving improved accuracy and search acceleration
Details
Motivation: Current LLM approaches for combinatorial optimization focus on feasible solution generation but lack explicit infeasibility detection capabilities, which is crucial for practical optimization problemsMethod: Proposes infeasibility-aware framework with certifiable dataset construction, supervised fine-tuning of LLMs, and LLM-assisted downstream search; introduces new mathematical programming formulation with zero-phase infeasibility screening for minor-embedding problem
Result: Fine-tuned 8B-parameter LLM improves overall accuracy by up to 30% over GPT-5.2; LLM-guided warm starts provide up to 2x speedup in downstream local search
Conclusion: The framework successfully enables LLMs to jointly perform solution generation and infeasibility detection, with practical benefits for combinatorial optimization acceleration
Abstract: Large language models (LLMs) are increasingly explored for NP-hard combinatorial optimization problems, but most existing methods emphasize feasible-instance solution generation and do not explicitly address infeasibility detection. We propose an infeasibility-aware framework that combines certifiable dataset construction, supervised fine-tuning, and LLM-assisted downstream search. For the minor-embedding problem, we introduce a new mathematical programming formulation together with provable zero-phase infeasibility screening, which enables scalable construction of training instances labeled either as feasible with structured certificates or as certifiably infeasible. Using training data generated through this exact optimization pipeline, we show that an 8B-parameter LLM can be fine-tuned to jointly perform solution generation and infeasibility detection. We further utilize LLM outputs as warm starts for downstream local search, providing a practical way to accelerate optimization even when the LLM outputs are imperfect. Experiments show that our fine-tuned model improves overall accuracy by up to 30% over GPT-5.2; meanwhile LLM-guided warm starts provide up to $2\times$ speedup compared with starting from scratch in downstream local search.
[329] Reducing Hallucinations in LLM-based Scientific Literature Analysis Using Peer Context Outlier Detection
Daniel Xie, Maxwell J. Jacobson, Adil Wazeer, Haiyan Wang, Xinghang Zhang, Yexiang Xue
Main category: cs.AI
TL;DR: P-COD reduces LLM hallucinations in scientific literature summarization by using peer document relationships to validate extracted data and flag outliers for expert review.
Details
Motivation: Current methods for reducing LLM hallucinations focus on individual documents but fail to consider relationships across a corpus, limiting accuracy in scientific literature summarization where similar experiments should yield similar conclusions.Method: Peer Context Outlier Detection (P-COD) compares extracted data from documents to validated peer information within the corpus, adjusting confidence scores and flagging low-confidence results for expert review while considering high-confidence, peer-validated results as reliable.
Result: Experiments demonstrate up to 98% precision in outlier detection across 6 scientific domains, showing reduced hallucinations, enhanced trust in automated systems, and streamlined data extraction workflows.
Conclusion: P-COD effectively reduces LLM hallucinations by leveraging document relationships, improving extraction accuracy in scientific literature summarization and allowing researchers to focus on ambiguous cases.
Abstract: Reducing hallucinations in Large Language Models (LLMs) is essential for improving the accuracy of data extraction from large text corpora. Current methods, like prompt engineering and chain-of-thought prompting, focus on individual documents but fail to consider relationships across a corpus. This paper introduces Peer Context Outlier Detection (P-COD), a novel approach that uses the relationships between documents to improve extraction accuracy. Our application domain is in scientific literature summarization, where papers with similar experiment settings should draw similar conclusions. By comparing extracted data to validated peer information within the corpus, we adjust confidence scores and flag low-confidence results for expert review. High-confidence results, supported by peer validation, are considered reliable. Our experiments demonstrate up to 98% precision in outlier detection across 6 domains of science, demonstrating that our design reduces hallucinations, enhances trust in automated systems, and allows researchers to focus on ambiguous cases, streamlining the data extraction workflows.
[330] A Self-Evolving Agentic Framework for Metasurface Inverse Design
Yi Huang, Bowen Zheng, Yunxi Dong, Hong Tang, Huan Zhao, S. M. Rakibul Hasan Shawon, Hualiang Zhang
Main category: cs.AI
TL;DR: Agentic framework for metasurface inverse design using LLMs with context-level skill evolution to accumulate reusable workflow knowledge across tasks.
Details
Motivation: Metasurface inverse design requires specialized expertise in computational electromagnetics and solver-specific software engineering. Existing LLM-driven systems are session-bounded and don't preserve reusable workflow knowledge across tasks.Method: Framework couples a coding agent, evolving skill artifacts, and deterministic evaluator grounded in physical simulation. Solver-specific strategies are iteratively refined across tasks without modifying model weights or underlying physics solver.
Result: Evolved skills raised in-distribution task success from 38% to 74%, increased criteria pass fraction from 0.510 to 0.870, and reduced average attempts from 4.10 to 2.30. On held-out tasks, partial transfer of workflow knowledge observed.
Conclusion: Skill evolution accumulates reusable solver-specific expertise around reliable computational engines, offering a practical path toward more autonomous and accessible metasurface inverse-design workflows.
Abstract: Metasurface inverse design has become central to realizing complex optical functionality, yet translating target responses into executable, solver-compatible workflows still demands specialized expertise in computational electromagnetics and solver-specific software engineering. Recent large language models (LLMs) offer a complementary route to reducing this workflow-construction burden, but existing language-driven systems remain largely session-bounded and do not preserve reusable workflow knowledge across inverse-design tasks. We present an agentic framework for metasurface inverse design that addresses this limitation through context-level skill evolution. The framework couples a coding agent, evolving skill artifacts, and a deterministic evaluator grounded in physical simulation so that solver-specific strategies can be iteratively refined across tasks without modifying model weights or the underlying physics solver. We evaluate the framework on a benchmark spanning multiple metasurface inverse-design task types, with separate training-aligned and held-out task families. Evolved skills raise in-distribution task success from 38% to 74%, increase criteria pass fraction from 0.510 to 0.870, and reduce average attempts from 4.10 to 2.30. On held-out task families, binary success changes only marginally, but improvements in best margin together with shifts in error composition and agent behavior indicate partial transfer of workflow knowledge. These results suggest that the main value of skill evolution lies in accumulating reusable solver-specific expertise around reliable computational engines, thereby offering a practical path toward more autonomous and accessible metasurface inverse-design workflows.
[331] AgentSocialBench: Evaluating Privacy Risks in Human-Centered Agentic Social Networks
Prince Zizhuang Wang, Shuli Jiang
Main category: cs.AI
TL;DR: AgentSocialBench: First benchmark for evaluating privacy risks in human-centered agentic social networks where AI agents coordinate across domains and users, revealing fundamental privacy challenges beyond single-agent settings.
Details
Motivation: With the rise of personalized LLM agent frameworks enabling human-centered agentic social networks, there are novel privacy challenges as agents must coordinate across domain boundaries, mediate between humans, and interact with other users' agents while protecting sensitive personal information. Current research lacks exploration of privacy risks in this emerging setting.Method: Introduces AgentSocialBench, the first benchmark to systematically evaluate privacy risk in human-centered agentic social networks. It comprises scenarios across seven categories spanning dyadic and multi-party interactions, grounded in realistic user profiles with hierarchical sensitivity labels and directed social graphs.
Result: Experiments reveal that privacy in agentic social networks is fundamentally harder than in single-agent settings: (1) cross-domain and cross-user coordination creates persistent leakage pressure even when agents are explicitly instructed to protect information, (2) privacy instructions that teach agents how to abstract sensitive information paradoxically cause them to discuss it more (abstraction paradox).
Conclusion: Current LLM agents lack robust mechanisms for privacy preservation in human-centered agentic social networks, and new approaches beyond prompt engineering are needed to make agent-mediated social coordination safe for real-world deployment.
Abstract: With the rise of personalized, persistent LLM agent frameworks such as OpenClaw, human-centered agentic social networks in which teams of collaborative AI agents serve individual users in a social network across multiple domains are becoming a reality. This setting creates novel privacy challenges: agents must coordinate across domain boundaries, mediate between humans, and interact with other users’ agents, all while protecting sensitive personal information. While prior work has evaluated multi-agent coordination and privacy preservation, the dynamics and privacy risks of human-centered agentic social networks remain unexplored. To this end, we introduce AgentSocialBench, the first benchmark to systematically evaluate privacy risk in this setting, comprising scenarios across seven categories spanning dyadic and multi-party interactions, grounded in realistic user profiles with hierarchical sensitivity labels and directed social graphs. Our experiments reveal that privacy in agentic social networks is fundamentally harder than in single-agent settings: (1) cross-domain and cross-user coordination creates persistent leakage pressure even when agents are explicitly instructed to protect information, (2) privacy instructions that teach agents how to abstract sensitive information paradoxically cause them to discuss it more (we call it abstraction paradox). These findings underscore that current LLM agents lack robust mechanisms for privacy preservation in human-centered agentic social networks, and that new approaches beyond prompt engineering are needed to make agent-mediated social coordination safe for real-world deployment.
[332] LLM Agents as Social Scientists: A Human-AI Collaborative Platform for Social Science Automation
Lei Wang, Yuanzi Li, Jinchao Wu, Heyang Gao, Xiaohe Bo, Xu Chen, Ji-Rong Wen
Main category: cs.AI
TL;DR: S-Researcher is an LLM-agent platform that siliconizes social science research by automating experiments and simulating human participants, enabling scalable social simulations through YuLan-OneSim system with three reasoning modes.
Details
Motivation: Traditional social science research is labor-intensive, costly, and difficult to scale due to complex experimental designs and reliance on real human participants. There's a need for more efficient, scalable approaches to social science research.Method: Developed YuLan-OneSim, a large-scale social simulation system with auto-programming from natural language to executable scenarios, distributed architecture supporting up to 100k concurrent agents, and feedback-driven LLM fine-tuning. S-Researcher supports experiment design, LLM agent simulation, analysis, and report generation with three reasoning modes: induction, deduction, and abduction.
Result: Validated through systematic case studies: reproduced cultural dynamics consistent with Axelrod’s theory (induction), tested teacher attention hypotheses validated against survey data (deduction), and identified cooperation mechanisms in public goods games confirmed by human experiments (abduction).
Conclusion: S-Researcher establishes a new human-AI collaborative paradigm for social science where computational simulation augments human researchers to accelerate discovery across social inquiry, with researchers retaining oversight at every stage.
Abstract: Traditional social science research often requires designing complex experiments across vast methodological spaces and depends on real human participants, making it labor-intensive, costly, and difficult to scale. Here we present S-Researcher, an LLM-agent-based platform that assists researchers in conducting social science research more efficiently and at greater scale by “siliconizing” both the research process and the participant pool. To build S-Researcher, we first develop YuLan-OneSim, a large-scale social simulation system designed around three core requirements: generality via auto-programming from natural language to executable scenarios, scalability via a distributed architecture supporting up to 100,000 concurrent agents, and reliability via feedback-driven LLM fine-tuning. Leveraging this system, S-Researcher supports researchers in designing social experiments, simulating human behavior with LLM agents, analyzing results, and generating reports, forming a complete human-AI collaborative research loop in which researchers retain oversight and intervention at every stage. We operationalize LLM simulation research paradigms into three canonical reasoning modes (induction, deduction, and abduction) and validate S-Researcher through systematic case studies: inductive reproduction of cultural dynamics consistent with Axelrod’s theory, deductive testing of competing hypotheses on teacher attention validated against survey data, and abductive identification of a cooperation mechanism in public goods games confirmed by human experiments. S-Researcher establishes a new human–AI collaborative paradigm for social science, in which computational simulation augments human researchers to accelerate discovery across the full spectrum of social inquiry.
[333] PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance
Ayan Das, Dhaval Patel
Main category: cs.AI
TL;DR: PHMForge is a benchmark for evaluating LLM agents on industrial Prognostics and Health Management tasks through realistic MCP server interactions, revealing significant performance gaps in tool orchestration and generalization.
Details
Motivation: Existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences, creating a critical gap in evaluating LLM agents for real-world industrial applications.Method: Created PHMForge benchmark with 75 expert-curated scenarios across 7 industrial asset classes and 5 core task categories, implemented 65 specialized tools across two MCP servers, and used execution-based evaluators with task-specific metrics (MAE/RMSE, F1-score, categorical matching).
Result: Top-performing LLM agent configurations achieved only 68% task completion, with systematic failures in tool orchestration (23% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7% on held-out datasets).
Conclusion: PHMForge reveals significant performance gaps in current LLM agents for industrial applications and provides an open-source benchmark to catalyze research in agentic industrial AI.
Abstract: Large language model (LLM) agents are increasingly deployed for complex tool-orchestration tasks, yet existing benchmarks fail to capture the rigorous demands of industrial domains where incorrect decisions carry significant safety and financial consequences. To address this critical gap, we introduce PHMForge, the first comprehensive benchmark specifically designed to evaluate LLM agents on Prognostics and Health Management (PHM) tasks through realistic interactions with domain-specific MCP servers. Our benchmark encompasses 75 expert-curated scenarios spanning 7 industrial asset classes (turbofan engines, bearings, electric motors, gearboxes, aero-engines) across 5 core task categories: Remaining Useful Life (RUL) Prediction, Fault Classification, Engine Health Analysis, Cost-Benefit Analysis, and Safety/Policy Evaluation. To enable rigorous evaluation, we construct 65 specialized tools across two MCP servers and implement execution-based evaluators with task-commensurate metrics: MAE/RMSE for regression, F1-score for classification, and categorical matching for health assessments. Through extensive evaluation of leading frameworks (ReAct, Cursor Agent, Claude Code) paired with frontier LLMs (Claude Sonnet 4.0, GPT-4o, Granite-3.0-8B), we find that even top-performing configurations achieve only 68% task completion, with systematic failures in tool orchestration (23% incorrect sequencing), multi-asset reasoning (14.9 percentage point degradation), and cross-equipment generalization (42.7% on held-out datasets). We open-source our complete benchmark, including scenario specifications, ground truth templates, tool implementations, and evaluation scripts, to catalyze research in agentic industrial AI.
[334] RAE-AR: Taming Autoregressive Models with Representation Autoencoders
Hu Yu, Hang Xu, Jie Huang, Zeyue Xue, Haoyang Huang, Nan Duan, Feng Zhao
Main category: cs.AI
TL;DR: RAE-AR integrates representation autoencoders into continuous autoregressive models for visual generation, addressing challenges of complex token distribution modeling and exposure bias through distribution normalization and noise injection.
Details
Motivation: Traditional VAE encoders dominate generative latent spaces, while representation encoders (DINO, SigLIP, MAE) were considered unsuitable for generation. Recent RAE methods show promise, but integration with continuous autoregressive models remains unexplored.Method: Proposes RAE-AR framework with two key innovations: 1) token simplification via distribution normalization to ease complex token-wise distribution modeling, and 2) Gaussian noise injection during training to mitigate exposure bias amplified by high-dimensional latents.
Result: The modifications substantially bridge the performance gap, enabling representation autoencoders to achieve results comparable to traditional VAEs on autoregressive models.
Conclusion: This work enables representation autoencoders to work effectively with AR models, paving the way for more unified architectures across visual understanding and generative modeling.
Abstract: The latent space of generative modeling is long dominated by the VAE encoder. The latents from the pretrained representation encoders (e.g., DINO, SigLIP, MAE) are previously considered inappropriate for generative modeling. Recently, RAE method lights the hope and reveals that the representation autoencoder can also achieve competitive performance as the VAE encoder. However, the integration of representation autoencoder into continuous autoregressive (AR) models, remains largely unexplored. In this work, we investigate the challenges of employing high-dimensional representation autoencoders within the AR paradigm, denoted as \textit{RAE-AR}. We focus on the unique properties of AR models and identify two primary hurdles: complex token-wise distribution modeling and the high-dimensionality amplified training-inference gap (exposure bias). To address these, we introduce token simplification via distribution normalization to ease modeling difficulty and improve convergence. Furthermore, we enhance prediction robustness by incorporating Gaussian noise injection during training to mitigate exposure bias. Our empirical results demonstrate that these modifications substantially bridge the performance gap, enabling representation autoencoder to achieve results comparable to traditional VAEs on AR models. This work paves the way for a more unified architecture across visual understanding and generative modeling.
[335] Does Your Optimizer Care How You Normalize? Normalization-Optimizer Coupling in LLM Training
Abdelrahman Abouzeid
Main category: cs.AI
TL;DR: Derf normalization layer suffers large negative interaction with Muon optimizer due to saturation and scale blindness issues, while DyT shows no such penalty; EMA-blend and alpha parameter tuning can recover most of the performance gap.
Details
Motivation: The paper investigates the assumption that normalization layers and optimizers can be treated as independent design choices in LLM training, revealing that certain combinations (Derf with Muon) suffer from significant negative interactions that are easy to miss in short pilot runs.Method: Conducted a 3x2 factorial experiment at 1B parameters and 1000 training steps comparing normalization layers (Derf, DyT, RMSNorm) with optimizers (AdamW, Muon). Analyzed failure modes through spectral-norm growth analysis, saturation effects, and scale blindness. Tested mitigation strategies including EMA-blend to reintroduce running scale estimates and alpha parameter tuning.
Result: Derf suffers a large negative interaction with Muon optimizer, with its performance gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon (3x larger). DyT shows no such penalty. EMA-blend recovers ~84% of the gap, and reducing Derf’s alpha from 0.5 to 0.3 recovers ~80% by keeping erf in near-linear regime.
Conclusion: Normalization layers and optimizers cannot be treated as independent design choices; certain combinations exhibit significant negative interactions. The failure modes (saturation and scale blindness) under Muon’s faster spectral-norm growth explain Derf’s poor performance, and simple adjustments can recover most of the performance gap.
Abstract: In LLM training, normalization layers and optimizers are typically treated as independent design choices. In a 3x2 factorial at 1B parameters and 1000 training steps, we show this assumption can fail: Dynamic Erf (Derf; Chen & Liu, 2025) suffers a large negative interaction with Muon (Jordan, 2024), with its gap to RMSNorm growing from +0.31 nats under AdamW to +0.97 under Muon, approximately three times larger. Dynamic Tanh (DyT; Zhu et al., 2025), included as a bounded-normalizer control, shows no such penalty. Our evidence points to two failure modes of erf under Muon’s faster spectral-norm growth: saturation (lossy compression) and scale blindness (discarding activation magnitude). An EMA-blend that reintroduces running scale estimates recovers ~84% of the gap. Separately, reducing Derf’s alpha from its published default (0.5 to 0.3) recovers ~80% by keeping erf in its near-linear regime, where it approximately preserves relative scale; this setting is not the published default of Chen & Liu (2025). Using Derf’s published default alpha with Muon incurs a 0.66-nat interaction penalty without producing NaNs or divergence, making the failure easy to miss in short pilot runs.
[336] NED-Tree: Bridging the Semantic Gap with Nonlinear Element Decomposition Tree for LLM Nonlinear Optimization Modeling
Zhijing Hu, Yufan Deng, Haoyang Liu, Changjun Fan
Main category: cs.AI
TL;DR: NED-Tree is a framework that improves LLM performance on translating natural language OR problems to executable models, especially for nonlinear scenarios, using sentence-by-sentence extraction and recursive tree decomposition.
Details
Motivation: LLMs struggle with translating natural language OR problems to executable models, especially for nonlinear scenarios due to semantic misalignment between mathematical formulations and solver codes, and unstable information extraction.Method: NED-Tree uses (a) sentence-by-sentence extraction for robust parameter mapping and traceability, and (b) a recursive tree-based structure that adaptively decomposes complex nonlinear terms into solver-compatible sub-elements.
Result: Achieves 72.51% average accuracy across 10 benchmarks, establishing new state-of-the-art. NED-Tree is the first framework that enables LLMs to resolve nonlinear modeling difficulties through element decomposition.
Conclusion: NED-Tree successfully bridges the semantic gap between natural language OR problems and executable models, particularly for complex nonlinear scenarios, through systematic decomposition and extraction strategies.
Abstract: Automating the translation of Operations Research (OR) problems from natural language to executable models is a critical challenge. While Large Language Models (LLMs) have shown promise in linear tasks, they suffer from severe performance degradation in real-world nonlinear scenarios due to semantic misalignment between mathematical formulations and solver codes, as well as unstable information extraction. In this study, we introduce NED-Tree, a systematic framework designed to bridge the semantic gap. NED-Tree employs (a) a sentence-by-sentence extraction strategy to ensure robust parameter mapping and traceability; and (b) a recursive tree-based structure that adaptively decomposes complex nonlinear terms into solver-compatible sub-elements. Additionally, we present NEXTOR, a novel benchmark specifically designed for complex nonlinear, extensive-constraint OR problems. Experiments across 10 benchmarks demonstrate that NED-Tree establishes a new state-of-the-art with 72.51% average accuracy, NED-Tree is the first framework that drives LLMs to resolve nonlinear modeling difficulties through element decomposition, achieving alignment between modeling semantics and code semantics. The NED-Tree framework and benchmark are accessible in the anonymous repository https://anonymous.4open.science/r/NORA-NEXTOR.
[337] ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
Difan Jiao, Qianfeng Wen, Blair Yang, Zhenwei Tang, Ashton Anderson
Main category: cs.AI
TL;DR: ThinkTwice is a two-phase RL framework that jointly trains LLMs to solve reasoning problems and refine their own solutions using the same binary correctness reward, without needing correctness signals or critique annotations.
Details
Motivation: Current RL approaches for training LLMs on reasoning tasks often separate reasoning from refinement phases, requiring additional correctness signals or critique annotations. The authors aim to develop a unified framework that jointly optimizes both reasoning and refinement capabilities using only binary correctness rewards.Method: ThinkTwice uses Group Relative Policy Optimization (GRPO) in a two-phase approach: (1) optimize the model on solving reasoning problems, then (2) optimize it on refining its own solutions to the same problems. Both phases use the same binary correctness reward without additional signals or annotations.
Result: ThinkTwice substantially improves both reasoning and refinement performance across five mathematical reasoning benchmarks and two model families (Qwen3-4B and Olmo3-7B). On Qwen3-4B, it outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step (pass@4).
Conclusion: Joint training of reasoning and self-refinement is established as a principled and effective methodology for RL from Verifiable Feedback (RLVR). The framework reveals an implicit curriculum where refinement corrects errors early and preserves correct solutions later, yielding better reward signals.
Abstract: We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.
[338] Do Large Language Models Mentalize When They Teach?
Sevan K. Harootonian, Mark K. Ho, Thomas L. Griffiths, Yael Niv, Ilia Sucholutsky
Main category: cs.AI
TL;DR: LLMs can perform teaching tasks by inferring learner knowledge through Bayesian reasoning rather than simple heuristics, similar to human teaching strategies.
Details
Motivation: To understand whether LLMs teach by reasoning about learners' knowledge (theory of mind) or using simpler rules of thumb, using a controlled teaching task previously studied in humans.Method: Tested various LLMs as simulated teachers in a graph-based teaching task where they must reveal a single edge to help a hypothetical learner choose better paths. Fitted their choices with cognitive models including Bayes-Optimal teaching (inverse planning), weaker Bayesian variants, heuristic baselines, and non-mentalizing utility models.
Result: Most LLMs performed well, showed stable strategies, and performed similarly to humans. Bayes-Optimal teaching best explained most models’ choices. Scaffolding interventions (prompts) changed behavior but didn’t reliably improve teaching on challenging graphs and sometimes reduced performance.
Conclusion: LLMs can employ sophisticated teaching strategies involving inference about learner knowledge, but prompt compliance doesn’t guarantee better teaching decisions. Cognitive modeling provides insight into LLM tutoring policies.
Abstract: How do LLMs decide what to teach next: by reasoning about a learner’s knowledge, or by using simpler rules of thumb? We test this in a controlled task previously used to study human teaching strategies. On each trial, a teacher LLM sees a hypothetical learner’s trajectory through a reward-annotated directed graph and must reveal a single edge so the learner would choose a better path if they replanned. We run a range of LLMs as simulated teachers and fit their trial-by-trial choices with the same cognitive models used for humans: a Bayes-Optimal teacher that infers which transitions the learner is missing (inverse planning), weaker Bayesian variants, heuristic baselines (e.g., reward based), and non-mentalizing utility models. In a baseline experiment matched to the stimuli presented to human subjects, most LLMs perform well, show little change in strategy over trials, and their graph-by-graph performance is similar to that of humans. Model comparison (BIC) shows that Bayes-Optimal teaching best explains most models’ choices. When given a scaffolding intervention, models follow auxiliary inference- or reward-focused prompts, but these scaffolds do not reliably improve later teaching on heuristic-incongruent test graphs and can sometimes reduce performance. Overall, cognitive model fits provide insight into LLM tutoring policies and show that prompt compliance does not guarantee better teaching decisions.
[339] ByteRover: Agent-Native Memory Through LLM-Curated Hierarchical Context
Andy Nguyen, Danh Doan, Hoang Pham, Bao Ha, Dat Pham, Linh Nguyen, Hieu Nguyen, Thien Nguyen, Cuong Do, Phat Nguyen, Toan Nguyen
Main category: cs.AI
TL;DR: ByteRover is an agent-native memory architecture that integrates memory management directly into LLMs, using hierarchical Context Trees for knowledge representation and progressive retrieval strategies, eliminating external infrastructure dependencies.
Details
Motivation: Existing memory-augmented generation systems treat memory as an external service, leading to semantic drift between agent intent and stored knowledge, loss of coordination context, and fragile failure recovery. The architectural separation means the storage system doesn't understand the knowledge it stores.Method: ByteRover inverts the memory pipeline by having the same LLM that reasons about tasks also curate, structure, and retrieve knowledge. It uses hierarchical Context Trees (Domain→Topic→Subtopic→Entry) with explicit relations, provenance, and Adaptive Knowledge Lifecycle (AKL) with importance scoring, maturity tiers, and recency decay. Retrieval employs a 5-tier progressive strategy that minimizes LLM calls.
Result: ByteRover achieves state-of-the-art accuracy on LoCoMo and competitive results on LongMemEval benchmarks. It requires zero external infrastructure, no vector/graph databases, no embedding services, with all knowledge stored as human-readable markdown files on local filesystem.
Conclusion: ByteRover demonstrates that agent-native memory architectures can outperform traditional external memory services while being simpler, more reliable, and eliminating infrastructure dependencies. The approach enables better semantic alignment between reasoning and memory.
Abstract: Memory-Augmented Generation (MAG) extends large language models with external memory to support long-context reasoning, but existing approaches universally treat memory as an external service that agents call into, delegating storage to separate pipelines of chunking, embedding, and graph extraction. This architectural separation means the system that stores knowledge does not understand it, leading to semantic drift between what the agent intended to remember and what the pipeline actually captured, loss of coordination context across agents, and fragile recovery after failures. In this paper, we propose ByteRover, an agent-native memory architecture that inverts the memory pipeline: the same LLM that reasons about a task also curates, structures, and retrieves knowledge. ByteRover represents knowledge in a hierarchical Context Tree, a file-based knowledge graph organized as Domain, Topic, Subtopic, and Entry, where each entry carries explicit relations, provenance, and an Adaptive Knowledge Lifecycle (AKL) with importance scoring, maturity tiers, and recency decay. Retrieval uses a 5-tier progressive strategy that resolves most queries at sub-100 ms latency without LLM calls, escalating to agentic reasoning only for novel questions. Experiments on LoCoMo and LongMemEval demonstrate that ByteRover achieves state-of-the-art accuracy on LoCoMo and competitive results on LongMemEval while requiring zero external infrastructure, no vector database, no graph database, no embedding service, with all knowledge stored as human-readable markdown files on the local filesystem.
[340] MM-ReCoder: Advancing Chart-to-Code Generation with Reinforcement Learning and Self-Correction
Zitian Tang, Xu Zhang, Jianbo Yuan, Yang Zou, Varad Gunjal, Songyao Jiang, Davide Modolo
Main category: cs.AI
TL;DR: MM-ReCoder: A chart-to-code generation MLLM trained with reinforcement learning and self-correction via two-stage multi-turn RL strategy using Group Relative Policy Optimization
Details
Motivation: Existing MLLMs for chart-to-code generation rely on supervised fine-tuning which doesn't expose models to code execution environments, and even state-of-the-art MLLMs struggle with effective self-correction through execution feedbackMethod: Two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO): first stage enhances self-correction ability via shared first turn rollout, second stage improves coding capability with full-trajectory optimization
Result: State-of-the-art performance on three chart-to-code benchmarks, producing more accurate and executable code through environment interaction and iterative self-correction
Conclusion: MM-ReCoder demonstrates that RL training with self-correction ability significantly improves chart-to-code generation performance in multimodal LLMs
Abstract: Multimodal Large Language Models (MLLMs) have recently demonstrated promising capabilities in multimodal coding tasks such as chart-to-code generation. However, existing methods primarily rely on supervised fine-tuning (SFT), which requires the model to learn code patterns through chart-code pairs but does not expose the model to a code execution environment. Moreover, while self-correction through execution feedback offers a potential route to improve coding quality, even state-of-the-art MLLMs have been shown to struggle with effective self-correction. In this work, we introduce MM-ReCoder, a chart-to-code generation model trained with reinforcement learning (RL) and equipped with self-correction ability. We propose a two-stage multi-turn self-correction RL strategy based on Group Relative Policy Optimization (GRPO). The first stage enhances the model’s self-correction ability via rolling out a shared first turn, while the second stage improves the coding capability with full-trajectory optimization. MM-ReCoder learns to produce more accurate and executable code through the interaction with the environment and by iteratively correcting its own outputs. Our results on three chart-to-code benchmarks demonstrate the state-of-the-art performance of MM-ReCoder.
[341] CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders
Su-Hyeon Kim, Hyundong Jin, Yejin Lee, Yo-Sub Han
Main category: cs.AI
TL;DR: CRaFT is a circuit-guided refusal feature selection framework that identifies causal features for LLM refusal behavior by analyzing prompts near the refusal boundary, improving jailbreak attack success rates significantly.
Details
Motivation: Current methods for understanding LLM refusal behavior rely on feature activation magnitude, which captures superficial signals rather than causal factors. There's a need for better methods to identify the internal mechanisms that causally mediate refusal decisions.Method: CRaFT ranks features by their influence on the model’s refusal-compliance decision using prompts near the refusal boundary, focusing on circuit influence rather than activation magnitude. It identifies causal refusal features through boundary analysis.
Result: On Gemma-3-1B-it, CRaFT improves attack success rate from 6.7% to 48.2% and outperforms baseline methods across multiple jailbreak benchmarks, demonstrating superior feature selection for inducing compliance.
Conclusion: Circuit influence is a more reliable criterion than activation magnitude for identifying features that causally mediate refusal behavior in LLMs, enabling more effective analysis of safety mechanisms.
Abstract: As safety concerns around large language models (LLMs) grow, understanding the internal mechanisms underlying refusal behavior has become increasingly important. Recent work has studied this behavior by identifying internal features associated with refusal and manipulating them to induce compliance with harmful requests. However, existing refusal feature selection methods rely on how strongly features activate on harmful prompts, which tends to capture superficial signals rather than the causal factors underlying the refusal decision. We propose CRaFT, a circuit-guided refusal feature selection framework that ranks features by their influence on the model’s refusal-compliance decision using prompts near the refusal boundary. On Gemma-3-1B-it, CRaFT improves attack success rate (ASR) from 6.7% to 48.2% and outperforms baseline methods across multiple jailbreak benchmarks. These results suggest that circuit influence is a more reliable criterion than activation magnitude for identifying features that causally mediate refusal behavior.
[342] From Multi-Agent to Single-Agent: When Is Skill Distillation Beneficial?
Binyan Xu, Dong Fang, Haitao Li, Kehuan Zhang
Main category: cs.AI
TL;DR: Skill distillation from multi-agent to single-agent systems has inconsistent outcomes; metric topology (Metric Freedom F) predicts skill utility, enabling adaptive distillation that preserves exploration for free metrics and focuses refinement for rigid metrics.
Details
Motivation: Multi-agent systems suffer from coordination overhead and context fragmentation, but distilling them into single-agent skills yields inconsistent results. The paper aims to understand what governs skill utility in this distillation process.Method: Introduces Metric Freedom (F) to measure topological rigidity of evaluation metrics’ scoring landscapes. Proposes two-stage adaptive distillation: Stage 1 extracts tools/knowledge while preserving exploration for free metrics; Stage 2 focuses iterative refinement for rigid metrics (F ≤ 0.6).
Result: F strongly predicts skill utility (ρ = -0.62, p < 0.05). Identical agent trajectories yield opposite skill lifts under rigid vs free metrics. Adaptive distillation matches/exceeds original MAS while reducing cost up to 8× and latency up to 15× across 4 tasks, 11 datasets, and 6 metrics.
Conclusion: Skill utility is fundamentally a metric-level property, not task-dependent. Metric Freedom enables principled distillation that adapts to metric topology, solving the inconsistency problem in multi-agent to single-agent skill conversion.
Abstract: Multi-agent systems (MAS) tackle complex tasks by distributing expertise, though this often comes at the cost of heavy coordination overhead, context fragmentation, and brittle phase ordering. Distilling a MAS into a single-agent skill can bypass these costs, but this conversion lacks a principled answer for when and what to distill. Instead, the empirical outcome is surprisingly inconsistent: skill lift ranges from a 28% improvement to a 2% degradation across metrics of the exact same task. In this work, we reveal that skill utility is governed not by the task, but by the evaluation metric. We introduce Metric Freedom ($F$), the first a priori predictor of skill utility. $F$ measures the topological rigidity of a metric’s scoring landscape by quantifying how output diversity couples with score variance via a Mantel test. Guided by $F$, we propose a two-stage adaptive distillation framework. Stage 1 acts as a selective extraction mechanism, extracting tools and knowledge while discarding restrictive structures on “free” metrics to preserve exploration. Stage 2 targets computationally intensive iterative refinement exclusively toward “rigid” metrics ($F \lesssim 0.6$) to eliminate trajectory-local overfitting. Evaluating across 4 tasks, 11 datasets, and 6 metrics, $F$ strongly predicts skill utility ($ρ= -0.62$, $p < 0.05$). Strikingly, identical agent trajectories yield diametrically opposite skill lifts under rigid versus free metrics, demonstrating that skill utility is fundamentally a metric-level property. Driven by this signal, our adaptive agent matches or exceeds the original MAS while reducing cost up to 8$\times$ and latency by up to 15$\times$.
[343] GraphWalk: Enabling Reasoning in Large Language Models through Tool-Based Graph Navigation
Taraneh Ghandi, Hamidreza Mahyar, Shachar Klaiman
Main category: cs.AI
TL;DR: GraphWalk: A training-free, tool-based framework enabling LLMs to perform multi-hop reasoning on large knowledge graphs through sequential graph navigation operations, overcoming context window limitations.
Details
Motivation: Standard approaches for knowledge graph Q&A face severe limitations with enterprise-scale KGs that cannot fit in context windows. Prompting techniques and retrieval-augmented generation struggle with large graphs, requiring a solution for scalable multi-hop reasoning.Method: GraphWalk equips LLMs with a minimal set of orthogonal graph operations (navigation tools) to traverse any graph structure. The framework is training-free and problem-agnostic, allowing models to compose operations into multi-step reasoning chains with transparent execution traces.
Result: Tool-based traversal yields substantial and consistent gains over in-context baselines across all model families tested. Gains become more pronounced as scale increases, precisely where in-context approaches fail catastrophically. The approach works on synthetic graphs with random labels, isolating structural reasoning from world knowledge.
Conclusion: GraphWalk enables off-the-shelf LLMs to reason through sequential graph navigation, dramatically increasing performance on complex multi-hop reasoning tasks over large knowledge graphs, overcoming context window limitations that plague standard approaches.
Abstract: The use of knowledge graphs for grounding agents in real-world Q&A applications has become increasingly common. Answering complex queries often requires multi-hop reasoning and the ability to navigate vast relational structures. Standard approaches rely on prompting techniques that steer large language models to reason over raw graph context, or retrieval-augmented generation pipelines where relevant subgraphs are injected into the context. These, however, face severe limitations with enterprise-scale KGs that cannot fit in even the largest context windows available today. We present GraphWalk, a problem-agnostic, training-free, tool-based framework that allows off-the-shelf LLMs to reason through sequential graph navigation, dramatically increasing performance across different tasks. Unlike task-specific agent frameworks that encode domain knowledge into specialized tools, GraphWalk equips the LLM with a minimal set of orthogonal graph operations sufficient to traverse any graph structure. We evaluate whether models equipped with GraphWalk can compose these operations into correct multi-step reasoning chains, where each tool call represents a verifiable step creating a transparent execution trace. We first demonstrate our approach on maze traversal, a problem non-reasoning models are completely unable to solve, then present results on graphs resembling real-world enterprise knowledge graphs. To isolate structural reasoning from world knowledge, we evaluate on entirely synthetic graphs with random, non-semantic labels. Our benchmark spans 12 query templates from basic retrieval to compound first-order logic queries. Results show that tool-based traversal yields substantial and consistent gains over in-context baselines across all model families tested, with gains becoming more pronounced as scale increases, precisely where in-context approaches fail catastrophically.
[344] Analysis of LLM Performance on AWS Bedrock: Receipt-item Categorisation Case Study
Gabby Sanchez, Sneha Oommen, Cassandra T. Britto, Di Wang, Jung-De Chiou, Maria Spichkova
Main category: cs.AI
TL;DR: Systematic evaluation of LLMs for receipt-item categorization comparing Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B, and Mistral 7B on accuracy, stability, and cost efficiency.
Details
Motivation: To provide a production-oriented, cost-aware evaluation of LLMs for receipt-item categorization, assessing performance across accuracy, response stability, and token-level costs, and determining optimal prompting strategies (zero-shot vs few-shot) for both accuracy and cost efficiency.Method: Compared four instruction-tuned LLMs (Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, Mistral 7B Instruct) available through AWS Bedrock. Evaluated performance across accuracy, response stability, and token-level costs. Investigated prompting methods (zero-shot vs few-shot) for optimal balance of accuracy and costs.
Result: Claude 3.7 Sonnet achieved the most favorable balance between classification accuracy and cost efficiency among the tested models.
Conclusion: For receipt-item categorization tasks, Claude 3.7 Sonnet provides the best trade-off between accuracy and cost, with specific prompting strategies offering optimal performance-cost balance for production deployment.
Abstract: This paper presents a systematic, cost-aware evaluation of large language models (LLMs) for receipt-item categorisation within a production-oriented classification framework. We compare four instruction-tuned models available through AWS Bedrock: Claude 3.7 Sonnet, Claude 4 Sonnet, Mixtral 8x7B Instruct, and Mistral 7B Instruct. The aim of the study was (1) to assess performance across accuracy, response stability, and token-level cost, and (2) to investigate what prompting methods, zero-shot or few-shot, are especially appropriate both in terms of accuracy and in terms of incurred costs. Results of our experiments demonstrated that Claude 3.7 Sonnet achieves the most favourable balance between classification accuracy and cost efficiency.
[345] OSCAR: Orchestrated Self-verification and Cross-path Refinement
Yash Shah, Abhijit Chakraborty, Naresh Kumar Devulapally, Vishnu Lokhande, Vivek Gupta
Main category: cs.AI
TL;DR: OSCAR is a training-free inference-time framework for diffusion language models that uses parallel denoising chains with randomized reveal orders to detect factual uncertainty and perform targeted remasking with retrieved evidence to reduce hallucinations.
Details
Motivation: Current hallucination mitigation approaches rely on externally trained classifiers rather than leveraging the native denoising trajectories of diffusion language models. The authors aim to develop a model-native framework that intervenes during generation using the inherent uncertainty signals available in diffusion models' denoising processes.Method: OSCAR runs N parallel denoising chains with randomized reveal orders, computes cross-chain Shannon entropy to detect high-uncertainty positions (commitment uncertainty localization), and performs targeted remasking conditioned on retrieved evidence. It introduces trajectory-level assessments including cross-chain divergence-at-hallucination (CDH) metric for principled comparison.
Result: On TriviaQA, HotpotQA, RAGTruth, and CommonsenseQA using LLaDA-8B and Dream-7B, OSCAR significantly reduces hallucinated content and improves factual accuracy through uncertainty-guided remasking. The framework’s native entropy-based uncertainty signal surpasses specialized trained detectors.
Conclusion: Diffusion language models have inherent capacity to identify factual uncertainty through their denoising trajectories, which is not present in autoregressive models. OSCAR demonstrates effective hallucination mitigation through uncertainty-guided remasking without requiring additional training.
Abstract: Diffusion language models (DLMs) expose their denoising trajectories, offering a natural handle for inference-time control; accordingly, an ideal hallucination mitigation framework should intervene during generation using this model-native signal rather than relying on an externally trained hallucination classifier. Toward this, we formulate commitment uncertainty localization: given a denoising trajectory, identify token positions whose cross-chain entropy exceeds an unsupervised threshold before factually unreliable commitments propagate into self-consistent but incorrect outputs. We introduce a suite of trajectory-level assessments, including a cross-chain divergence-at-hallucination (CDH) metric, for principled comparison of localization methods. We also introduce OSCAR, a training-free inference-time framework operationalizing this formulation. OSCAR runs N parallel denoising chains with randomized reveal orders, computes cross-chain Shannon entropy to detect high-uncertainty positions, and then performs targeted remasking conditioned on retrieved evidence. Ablations confirm that localization and correction contribute complementary gains, robust across N in {4, 8, 16}. On TriviaQA, HotpotQA, RAGTruth, and CommonsenseQA using LLaDA-8B and Dream-7B, OSCAR enhances generation quality by significantly reducing hallucinated content and improving factual accuracy through uncertainty-guided remasking, which also facilitates more effective integration of retrieved evidence. Its native entropy-based uncertainty signal surpasses that of specialized trained detectors, highlighting an inherent capacity of diffusion language models to identify factual uncertainty that is not present in the sequential token commitment structure of autoregressive models. We are releasing the codebase1 to support future research on localization and uncertainty-aware generation in DLMs.
[346] Exploring Robust Multi-Agent Workflows for Environmental Data Management
Boyuan Guan, Jason Liu, Yanzhao Wu, Kiavash Bahreini
Main category: cs.AI
TL;DR: EnviSmart: A production data management system using LLM-driven agents with reliability mechanisms for environmental FAIR data management, featuring three-track knowledge architecture and role-separated multi-agent design to prevent irreversible errors.
Details
Motivation: LLM-driven agents can scale data curation but introduce probabilistic failures where plausible but incorrect outputs may lead to irreversible actions like DOI minting and public release. Need reliable systems for environmental data management.Method: Three-track knowledge architecture externalizes governance constraints, domain knowledge, and tool-using procedures as persistent artifacts. Role-separated multi-agent design with deterministic validators and audited handoffs restores fail-stop semantics before irreversible steps.
Result: Multi-agent approach improved efficiency (single operator completed in two days) and reliability: detected and blocked coordinate transformation error affecting 2,452 stations before publication. Incident ISS-004 showed 10-minute detection, zero user exposure, 80-minute resolution.
Conclusion: EnviSmart demonstrates that treating reliability as an architectural property through knowledge externalization and multi-agent separation can make LLM-driven data management both scalable and trustworthy for environmental research.
Abstract: Embedding LLM-driven agents into environmental FAIR data management is compelling - they can externalize operational knowledge and scale curation across heterogeneous data and evolving conventions. However, replacing deterministic components with probabilistic workflows changes the failure mode: LLM pipelines may generate plausible but incorrect outputs that pass superficial checks and propagate into irreversible actions such as DOI minting and public release. We introduce EnviSmart, a production data management system deployed on campus-wide storage infrastructure for environmental research. EnviSmart treats reliability as an architectural property through two mechanisms: a three-track knowledge architecture that externalizes behaviors (governance constraints), domain knowledge (retrievable context), and skills (tool-using procedures) as persistent, interlocking artifacts; and a role-separated multi-agent design where deterministic validators and audited handoffs restore fail-stop semantics at trust boundaries before irreversible steps. We compare two production deployments. The University’s GIS Center Ecological Archive (849 curated datasets) serves as a single-agent baseline. SF2Bench, a compound flooding benchmark comprising 2,452 monitoring stations and 8,557 published files spanning 39 years, validates the multi-agent workflow. The multi-agent approach improved both efficiency - completed by a single operator in two days with repeated artifact reuse across deployments - and reliability: audited handoffs detected and blocked a coordinate transformation error affecting all 2,452 stations before publication. A representative incident (ISS-004) demonstrated boundary-based containment with 10-minute detection latency, zero user exposure, and 80-minute resolution. This paper has been accepted at PEARC 2026.
[347] ThinknCheck: Grounded Claim Verification with Compact, Reasoning-Driven, and Interpretable Models
Delip Rao, Feijiang Han, Chris Callison-Burch
Main category: cs.AI
TL;DR: ThinknCheck is a 1B-parameter verifier for grounded claim verification that uses structured reasoning before binary verdicts, achieving competitive accuracy with fewer parameters than larger models.
Details
Motivation: To create resource-efficient and interpretable verifiers for factual claims that can produce structured rationales before making binary decisions, addressing the limitations of larger models and zero-shot approaches.Method: Fine-tune a 4-bit quantized Gemma3-1B model on LLMAggreFact-Think dataset (24.1k reasoning-augmented examples) to follow a structured format: first produce a short rationale, then a binary verdict. Compare with zero-shot chain-of-thought and preference optimization approaches.
Result: Achieves 78.1 balanced accuracy on LLMAggreFact (surpassing MiniCheck-7B with 7x fewer parameters), 64.7 BAcc on SciFact (+14.7 gain over MiniCheck-7B). Removing reasoning reduces accuracy to 57.5. Domain-specialized ThinknCheck-Science achieves 61.0% accuracy on GSMClaims.
Conclusion: Explicit, supervised reasoning enables compact verifiers that are competitive while remaining resource-efficient and interpretable, outperforming both larger models and zero-shot approaches.
Abstract: We present ThinknCheck, a 1B-parameter verifier for grounded claim verification that first produces a short, structured rationale and then a binary verdict. We construct LLMAggreFact-Think, a 24.1k reasoning-augmented training set derived from LLMAggreFact, and fine-tune a 4-bit Gemma3 model to follow this format. On LLMAggreFact, ThinknCheck attains 78.1 balanced accuracy (BAcc), surpassing MiniCheck-7B (77.4) with 7x fewer parameters; removing the reasoning step reduces BAcc to 57.5. On SciFact, ThinknCheck reaches 64.7 BAcc, a +14.7 absolute gain over MiniCheck-7B. By contrast, zero-shot chain-of-thought on the base Gemma3-1B harms accuracy relative to direct answers, and preference optimization with a simple format+accuracy reward underperforms supervised reasoning. To probe the latter, we introduce GSMClaims and a domain-specialized variant, ThinknCheck-Science, which improves across benchmarks, including 61.0% accuracy on GSMClaims. Overall, explicit, supervised reasoning enables compact verifiers that are competitive while remaining resource-efficient and interpretable.
[348] CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery
Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang
Main category: cs.AI
TL;DR: CORAL is an autonomous multi-agent evolution framework for open-ended discovery that replaces fixed heuristics with long-running agents that explore, reflect, and collaborate through shared memory and asynchronous execution.
Details
Motivation: Existing LLM-based evolution methods rely heavily on fixed heuristics and hard-coded exploration rules, limiting agent autonomy and hindering sustained open-ended discovery and knowledge accumulation.Method: CORAL uses long-running agents with shared persistent memory, asynchronous multi-agent execution, heartbeat-based interventions, and practical safeguards including isolated workspaces, evaluator separation, and resource management.
Result: CORAL sets new SOTA on 10 diverse tasks, achieving 3-10x higher improvement rates with fewer evaluations than fixed evolutionary search baselines. On Anthropic’s kernel engineering task, four co-evolving agents improved the best known score from 1363 to 1103 cycles.
Conclusion: Greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery, with gains arising from knowledge reuse and multi-agent exploration and communication.
Abstract: Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic’s kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.
[349] Ontology-Aware Design Patterns for Clinical AI Systems: Translating Reification Theory into Software Architecture
Florian Odi Stummer
Main category: cs.AI
TL;DR: Proposes seven ontology-aware design patterns for building clinical AI pipelines resilient to data distortion from documentation workflows, billing incentives, and terminology fragmentation.
Details
Motivation: Clinical AI systems train on health data distorted by documentation workflows, billing incentives, and terminology fragmentation, but translating theoretical insights about these distortions into implementable software architecture remains an open problem.Method: Proposes seven design patterns in Gang-of-Four pattern language: Ontological Checkpoint, Dormancy-Aware Pipeline, Drift Sentinel, Dual-Ontology Layer, Reification Circuit Breaker, Terminology Version Gate, and Regulatory Compliance Adapter.
Result: Provides a design vocabulary grounded in theoretical analysis with a reference architecture for a primary care AI system and walkthrough of all seven patterns through a diabetes risk prediction scenario.
Conclusion: Offers a design vocabulary for building resilient clinical AI systems, subject to future evaluation in production systems, with three patterns having partial precedent and four being newly formalized.
Abstract: Clinical AI systems routinely train on health data structurally distorted by documentation workflows, billing incentives, and terminology fragmentation. Prior work has characterised the mechanisms of this distortion: the three-forces model of documentary enactment, the reification feedback loop through which AI may amplify coding artefacts, and terminology governance failures that allow semantic drift to accumulate. Yet translating these insights into implementable software architecture remains an open problem. This paper proposes seven ontology-aware design patterns in Gang-of-Four pattern language for building clinical AI pipelines resilient to ontological distortion. The patterns address data ingestion validation (Ontological Checkpoint), low-frequency signal preservation (Dormancy-Aware Pipeline), continuous drift monitoring (Drift Sentinel), parallel representation maintenance (Dual-Ontology Layer), feedback loop interruption (Reification Circuit Breaker), terminology evolution management (Terminology Version Gate), and pluggable regulatory compliance (Regulatory Compliance Adapter). Each pattern is specified with Problem, Forces, Solution, Consequences, Known Uses, and Related Patterns. We illustrate their composition in a reference architecture for a primary care AI system and provide a walkthrough tracing all seven patterns through a diabetes risk prediction scenario. This paper does not report empirical validation; it offers a design vocabulary grounded in theoretical analysis, subject to future evaluation in production systems. Three patterns have partial precedent in existing systems; the remaining four have not been formally described. Limitations include the absence of runtime benchmarks and restriction to the German and EU regulatory context.
[350] ContextBudget: Budget-Aware Context Management for Long-Horizon Search Agents
Yong Wu, YanZhao Zheng, TianZe Xu, ZhenTao Zhang, YuanQiang Yu, JiHuai Zhu, Chao Ma, BinBin Lin, BaoHua Dong, HangCheng Zhu, RuoHui Huang, Gang Yu
Main category: cs.AI
TL;DR: BACM enables LLM agents to manage limited context budgets by learning when and how much to compress interaction history using reinforcement learning.
Details
Motivation: LLM-based agents face context size limitations due to deployment constraints (memory, latency, cost), creating a trade-off between retaining past information and staying within context limits as interaction histories grow.Method: Proposes Budget-Aware Context Management (BACM) which formulates context management as a sequential decision problem with budget constraints. Develops BACM-RL, an end-to-end curriculum-based reinforcement learning approach that learns compression strategies under varying context budgets.
Result: BACM-RL consistently outperforms prior methods across model scales and task complexities, achieving over 1.6× gains over strong baselines in high-complexity settings. Maintains strong advantages as budgets shrink where most methods show performance decline.
Conclusion: BACM provides an effective framework for LLM agents to manage limited context budgets through learned compression strategies, enabling better long-horizon reasoning within practical deployment constraints.
Abstract: LLM-based agents show strong potential for long-horizon reasoning, yet their context size is limited by deployment factors (e.g., memory, latency, and cost), yielding a constrained context budget. As interaction histories grow, this induces a trade-off between retaining past information and staying within the context limit. To address this challenge, we propose Budget-Aware Context Management (BACM), which formulates context management as a sequential decision problem with a context budget constraint. It enables agents to assess the available budget before incorporating new observations and decide when and how much of the interaction history to compress. We further develop BACM-RL, an end-to-end curriculum-based reinforcement learning approach that learns compression strategies under varying context budgets. Experiments on compositional multi-objective QA and long-horizon web browsing benchmarks show that BACM-RL consistently outperforms prior methods across model scales and task complexities, achieving over $1.6\times$ gains over strong baselines in high-complexity settings, while maintaining strong advantages as budgets shrink, where most methods exhibit a downward performance trend.
[351] M3D-BFS: a Multi-stage Dynamic Fusion Strategy for Sample-Adaptive Multi-Modal Brain Network Analysis
Rui Dong, Xiaotong Zhang, Jiaxing Li, Yueying Li, Jiayin Wei, Youyong Kong
Main category: cs.AI
TL;DR: Proposes M3D-BFS, a multi-stage dynamic fusion strategy for sample-adaptive multi-modal brain network analysis using structural and functional connectivity data.
Details
Motivation: Current multi-modal fusion methods for brain networks (SC and FC) are static and use identical computation for all samples, ignoring inherent differences between input samples, which hinders performance improvement.Method: Proposes a multi-stage dynamic fusion strategy with mixture-of-experts (MoEs) for uni- and multi-modal representations that adaptively change based on input samples. Uses 3-stage training: train uni-modal encoders separately, pretrain single experts of MoEs, then finetune whole model with multi-modal disentanglement loss.
Result: Extensive experiments on different real-world datasets demonstrate the superiority of M3D-BFS over existing methods.
Conclusion: This is the first work for dynamic fusion in multi-modal brain network analysis, addressing sample adaptation limitations of static fusion methods.
Abstract: Multi-modal fusion is of great significance in neuroscience which integrates information from different modalities and can achieve better performance than uni-modal methods in downstream tasks. Current multi-modal fusion methods in brain networks, which mainly focus on structural connectivity (SC) and functional connectivity (FC) modalities, are static in nature. They feed different samples into the same model with identical computation, ignoring inherent difference between input samples. This lack of sample adaptation hinders model’s further performance. To this end, we innovatively propose a multi-stage dynamic fusion strategy (M3D-BFS) for sample-adaptive multi-modal brain network analysis. Unlike other static fusion methods, we design different mixture-of-experts (MoEs) for uni- and multi-modal representations where modules can adaptively change as input sample changes during inference. To alleviate issue of MoE where training of experts may be collapsed, we divide our method into 3 stages. We first train uni-modal encoders respectively, then pretrain single experts of MoEs before finally finetuning the whole model. A multi-modal disentanglement loss is designed to enhance the final representations. To the best of our knowledge, this is the first work for dynamic fusion for multi-modal brain network analysis. Extensive experiments on different real-world datasets demonstrates the superiority of M3D-BFS.
[352] Hierarchical Memory Orchestration for Personalized Persistent Agents
Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yuqi Li, Yirong Chen, Ding Wang
Main category: cs.AI
TL;DR: HMO is a hierarchical memory framework that organizes interaction history into three tiers based on user-centric relevance to improve efficiency and personalization in AI agents.
Details
Motivation: Long-term memory in intelligent agents causes performance bottlenecks when storing extensive interaction data, leading to retrieval noise, computational latency, and overwhelming reasoning capacity on constrained devices.Method: Proposes Hierarchical Memory Orchestration (HMO) with three-tiered directory: compact primary cache (recent/pivotal memories + evolving user profile), high-priority secondary layer, and global archive of full history. User persona dictates memory redistribution across hierarchy.
Result: Achieves state-of-the-art performance on multiple benchmarks. Real-world deployments in ecosystems like OpenClaw demonstrate significant enhancement of agent fluidity and personalization.
Conclusion: HMO enables targeted orchestration of historical knowledge, surfacing relevant information when needed while maintaining lean, efficient active search space for improved agent performance.
Abstract: While long-term memory is essential for intelligent agents to maintain consistent historical awareness, the accumulation of extensive interaction data often leads to performance bottlenecks. Naive storage expansion increases retrieval noise and computational latency, overwhelming the reasoning capacity of models deployed on constrained personal devices. To address this, we propose Hierarchical Memory Orchestration (HMO), a framework that organizes interaction history into a three-tiered directory driven by user-centric contextual relevance. Our system maintains a compact primary cache, coupling recent and pivotal memories with an evolving user profile to ensure agent reasoning remains aligned with individual behavioral traits. This primary cache is complemented by a high-priority secondary layer, both of which are managed within a global archive of the full interaction history. Crucially, the user persona dictates memory redistribution across this hierarchy, promoting records mapped to long-term patterns toward more active tiers while relegating less relevant information. This targeted orchestration surfaces historical knowledge precisely when needed while maintaining a lean and efficient active search space. Evaluations on multiple benchmarks achieve state-of-the-art performance. Real-world deployments in ecosystems like OpenClaw demonstrate that HMO significantly enhances agent fluidity and personalization.
[353] Can Heterogeneous Language Models Be Fused?
Shilian Chen, Jie Zhou, Qin Chen, Wen Wu, Xin Li, Qi Feng, Liang He
Main category: cs.AI
TL;DR: HeteroFusion enables merging of heterogeneous language models from different architectural families by using topology-based alignment and conflict-aware denoising, outperforming existing methods.
Details
Motivation: Current model merging methods assume homogeneous models from the same pretrained backbone, but real-world scenarios involve heterogeneous models from different families (Llama, Qwen, Mistral) with architectural mismatches and parameter misalignment.Method: Two key components: 1) Topology-based alignment that transfers knowledge by matching functional module structures instead of raw tensor coordinates, 2) Conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. Preserves target adapter basis while predicting structured updates.
Result: HeteroFusion consistently outperforms strong merging, fusion, and ensemble baselines across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings.
Conclusion: HeteroFusion provides an effective solution for merging heterogeneous language models, addressing architectural mismatch and parameter misalignment issues that make direct weight-space fusion ill-posed.
Abstract: Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.
[354] EvoSkills: Self-Evolving Agent Skills via Co-Evolutionary Verification
Hanrong Zhang, Shicheng Fan, Henry Peng Zou, Yankai Chen, Zhenting Wang, Jiayu Zhou, Chengze Li, Wei-Chieh Huang, Yifei Yao, Kening Zheng, Xue Liu, Xiaoxiao Li, Philip S. Yu
Main category: cs.AI
TL;DR: EvoSkills enables LLM agents to autonomously generate complex multi-file skill packages through iterative refinement and co-evolving surrogate verification, outperforming baselines on SkillsBench.
Details
Motivation: Current skill generation for LLM agents is label-intensive and suffers from human-machine cognitive misalignment, leading to degraded performance. Existing self-evolving methods designed for simple tools cannot handle the complexity of multi-file skill packages.Method: EvoSkills framework couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative feedback without ground-truth test access, enabling autonomous construction of complex skill packages.
Result: Achieves highest pass rate among five baselines on SkillsBench for both Claude Code and Codex, and shows strong generalization to six additional LLMs.
Conclusion: EvoSkills successfully enables autonomous skill generation for LLM agents, overcoming limitations of manual authoring and existing tool-focused self-evolving methods.
Abstract: Anthropic proposes the concept of skills for LLM agents to tackle multi-step professional tasks that simple tool invocations cannot address. A tool is a single, self-contained function, whereas a skill is a structured bundle of interdependent multi-file artifacts. Currently, skill generation is not only label-intensive due to manual authoring, but also may suffer from human–machine cognitive misalignment, which can lead to degraded agent performance, as evidenced by evaluations on SkillsBench. Therefore, we aim to enable agents to autonomously generate skills. However, existing self-evolving methods designed for tools cannot be directly applied to skills due to their increased complexity. To address these issues, we propose EvoSkills, a self-evolving skills framework that enables agents to autonomously construct complex, multi-file skill packages. Specifically, EvoSkills couples a Skill Generator that iteratively refines skills with a Surrogate Verifier that co-evolves to provide informative and actionable feedback without access to ground-truth test content. On SkillsBench, EvoSkills achieves the highest pass rate among five baselines on both Claude Code and Codex, and also exhibits strong generalization capabilities to six additional LLMs.
[355] Scale over Preference: The Impact of AI-Generated Content on Online Content Ecology
Tianhao Shi, Yang Zhang, Xiaoyan Zhao, Fengbin Zhu, Chenyi Lei, Han Li, Wenwu Ou, Yang Song, Yongdong Zhang, Fuli Feng
Main category: cs.AI
TL;DR: Study analyzes AIGC vs HGC creation/consumption behaviors on Chinese video platform, finding AIGC achieves engagement through high-volume production despite consumer preference for HGC, with algorithms moderating these dynamics.
Details
Motivation: The rapid growth of AI-generated content is reshaping online ecosystems, requiring investigation into its behavioral and distributional impacts on content creation and consumption patterns.Method: Used longitudinal dataset of tens of millions of users from a leading Chinese video-sharing platform to analyze distinct creation and consumption behaviors of AIGC versus human-generated content.
Result: Found “scale-over-preference” dynamic where AIGC creators achieve comparable aggregate engagement to HGC through high-volume production, despite consumer preference for HGC. Algorithms moderate competing interests regarding AIGC distribution.
Conclusion: Advocates for AIGC-sensitive distribution algorithms and precise governance frameworks to ensure long-term health of online content platforms.
Abstract: The rapid proliferation of Artificial Intelligence-Generated Content (AIGC) is fundamentally restructuring online content ecologies, necessitating a rigorous examination of its behavioral and distributional implications. Leveraging a comprehensive longitudinal dataset comprising tens of millions of users from a leading Chinese video-sharing platform, this study elucidated the distinct creation and consumption behaviors characterizing AIGC versus Human-Generated Content (HGC). We identified a prevalent scale-over-preference dynamic, wherein AIGC creators achieve aggregate engagement comparable to HGC creators through high-volume production, despite a marked consumer preference for HGC. Deeper analysis uncovered the ability of the algorithmic content distribution mechanism in moderating these competing interests regarding AIGC. These findings advocated for the implementation of AIGC-sensitive distribution algorithms and precise governance frameworks to ensure the long-term health of the online content platforms.
[356] LiteInception: A Lightweight and Interpretable Deep Learning Framework for General Aviation Fault Diagnosis
Zhihuan Wei, Xinhang Chen, Danyang Han, Yang Hu, Jie Liu, Xuewen Miao, Guijiang Li
Main category: cs.AI
TL;DR: LiteInception: A lightweight interpretable fault diagnosis framework for edge deployment in general aviation, using two-stage cascaded architecture, model compression, knowledge distillation, and dual-layer interpretability.
Details
Motivation: General aviation fault diagnosis and maintenance are critical for flight safety, but deploying deep learning models on resource-constrained edge devices faces challenges in computational capacity and interpretability.Method: Two-stage cascaded architecture (fault detection then classification), multi-method fusion for sensor channel reduction, 1+1 branch LiteInception architecture for parameter compression, knowledge distillation for scenario adaptation, and dual-layer interpretability framework with four attribution methods.
Result: Reduced input sensor channels from 23 to 15, compressed InceptionTime parameters by 70%, accelerated CPU inference by over 8x with <3% F1 loss. Achieved 81.92% fault detection accuracy with 83.24% recall, and 77.00% fault identification accuracy on NGAFID dataset.
Conclusion: The framework achieves favorable balance among efficiency, accuracy, and interpretability for edge deployment in aviation fault diagnosis.
Abstract: General aviation fault diagnosis and efficient maintenance are critical to flight safety; however, deploying deep learning models on resource-constrained edge devices poses dual challenges in computational capacity and interpretability. This paper proposes LiteInception–a lightweight interpretable fault diagnosis framework designed for edge deployment. The framework adopts a two-stage cascaded architecture aligned with standard maintenance workflows: Stage 1 performs high-recall fault detection, and Stage 2 conducts fine-grained fault classification on anomalous samples, thereby decoupling optimization objectives and enabling on-demand allocation of computational resources. For model compression, a multi-method fusion strategy based on mutual information, gradient analysis, and SE attention weights is proposed to reduce the input sensor channels from 23 to 15, and a 1+1 branch LiteInception architecture is introduced that compresses InceptionTime parameters by 70%, accelerates CPU inference by over 8x, with less than 3% F1 loss. Furthermore, knowledge distillation is introduced as a precision-recall regulation mechanism, enabling the same lightweight model to adapt to different scenarios–such as safety-critical and auxiliary diagnosis–by switching training strategies. Finally, a dual-layer interpretability framework integrating four attribution methods is constructed, providing traceable evidence chains of “which sensor x which time period.” Experiments on the NGAFID dataset demonstrate a fault detection accuracy of 81.92% with 83.24% recall, and a fault identification accuracy of 77.00%, validating the framework’s favorable balance among efficiency, accuracy, and interpretability.
[357] The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs
Wilf Morlidge, Elliott Watkiss-Leek, George Hannah, Harry Rostron, Andrew Ng, Ewan Johnson, Andrew Mitchell, Terry R. Payne, Valentina Tamma, Jacopo de Berardinis
Main category: cs.AI
TL;DR: An OWL 2 ontology formalizing the semantics of Analytical Information Markup Language (AnIML) to address semantic interoperability issues in analytical chemistry/biology data systems.
Details
Motivation: Semantic interoperability across heterogeneous experimental data systems is a major barrier to data-driven scientific discovery. AnIML (XML-based standard) suffers from divergent interpretations that undermine interoperability despite its intended purpose.Method: Developed AnIML Ontology using expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. Validated through multi-layered approach: transforming real-world AnIML files into knowledge graphs, SPARQL competency question verification, and novel validation protocol using adversarial negative competency questions mapped to ontological anti-patterns enforced via SHACL constraints.
Result: Created formal OWL 2 ontology that aligns AnIML with Allotrope Data Format to support cross-system and cross-lab interoperability. Validation demonstrated successful transformation of real-world data and verification of semantic consistency.
Conclusion: The AnIML Ontology provides formal semantic foundation for analytical data interoperability, addressing interpretation inconsistencies through ontological formalization and validation protocols.
Abstract: Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial R&D labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.
[358] Solving the Two-dimensional single stock size Cuting Stock Problem with SAT and MaxSAT
Tuyen Van Kieu, Chi Linh Hoang, Khanh Van To
Main category: cs.AI
TL;DR: SAT-based framework for 2D cutting stock problem with demand expansion, sheet assignment variables, and orientation elimination rules, outperforming commercial solvers.
Details
Motivation: The Two-Dimensional Single Stock Size Cutting Stock Problem (2D-CSSP) involves cutting multiple copies of rectangular items from stock sheets to minimize waste, which causes combinatorial explosion. Existing solvers struggle with this complexity.Method: SAT-based framework where item types are expanded by demand, each copy gets a sheet-assignment variable, and non-overlap constraints are activated only for copies assigned to the same sheet. Includes infeasible-orientation elimination rule to fix rotation variables when only one orientation fits. Three approaches compared: non-incremental SAT with binary search, incremental SAT with clause reuse, and weighted partial MaxSAT.
Result: On Cui-Zhao benchmark suite, best SAT configurations certify 2-3 times more instances as provably optimal and achieve lower optimality gaps than OR-Tools, CPLEX and Gurobi. Incremental SAT strongest without rotation, non-incremental SAT more effective with rotation.
Conclusion: SAT-based approaches are highly effective for 2D-CSSP, outperforming state-of-the-art commercial solvers, with different SAT strategies optimal depending on rotation constraints.
Abstract: Cutting rectangular items from stock sheets to satisfy demands while minimizing waste is a central manufacturing task. The Two-Dimensional Single Stock Size Cutting Stock Problem (2D-CSSP) generalizes bin packing by requiring multiple copies of each item type, which causes a strong combinatorial blow-up. We present a SAT-based framework where item types are expanded by demand, each copy has a sheet-assignment variable and non-overlap constraints are activated only for copies assigned to the same sheet. We also introduce an infeasible-orientation elimination rule that fixes rotation variables when only one orientation can fit the sheet. For minimizing the number of sheets, we compare three approaches: non-incremental SAT with binary search, incremental SAT with clause reuse across iterations and weighted partial MaxSAT. On the Cui–Zhao benchmark suite, our best SAT configurations certify two to three times more instances as provably optimal and achieve lower optimality gaps than OR-Tools, CPLEX and Gurobi. The relative ranking among SAT approaches depends on rotation: incremental SAT is strongest without rotation, while non-incremental SAT is more effective when rotation increases formula size.
[359] AeroTherm-GPT: A Verification-Centered LLM Framework for Thermal Protection System Engineering Workflows
Chuhan Qiao, Jinglai Zheng, Jie Huang, Buyue Zhao, Fan Li, Haiming Huang
Main category: cs.AI
TL;DR: AeroTherm-GPT: A specialized LLM agent with Constraint-Closed-Loop Generation framework for hypersonic thermal protection system design, using constraint dependency graphs to iteratively fix violations.
Details
Motivation: General-purpose LLMs fail at generating executable simulation artifacts for safety-critical engineering workflows due to cascading constraint violations when treating generation as single-pass text completion.Method: Constraint-Closed-Loop Generation (CCLG) framework with Constraint Dependency Graph (CDG) that encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical probabilities.
Result: Achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), +12.5 percentage points over non-CDG baseline, with Root-Cause Fix Efficiency of 4.16 vs 1.76 for flat-checklist repair.
Conclusion: AeroTherm-GPT demonstrates that specialized LLM agents with constraint-aware iterative repair mechanisms can effectively handle complex, safety-critical engineering workflows where general-purpose LLMs fail.
Abstract: Integrating Large Language Models (LLMs) into hypersonic thermal protection system (TPS) design is bottlenecked by cascading constraint violations when generating executable simulation artifacts. General-purpose LLMs, treating generation as single-pass text completion, fail to satisfy the sequential, multi-gate constraints inherent in safety-critical engineering workflows. To address this, we propose AeroTherm-GPT, the first TPS-specialized LLM Agent, instantiated through a Constraint-Closed-Loop Generation (CCLG) framework. CCLG organizes TPS artifact generation as an iterative workflow comprising generation, validation, CDG-guided repair, execution, and audit. The Constraint Dependency Graph (CDG) encodes empirical co-resolution structure among constraint categories, directing repair toward upstream fault candidates based on lifecycle ordering priors and empirical co-resolution probabilities. This upstream-priority mechanism resolves multiple downstream violations per action, achieving a Root-Cause Fix Efficiency of 4.16 versus 1.76 for flat-checklist repair. Evaluated on HyTPS-Bench and validated against external benchmarks, AeroTherm-GPT achieves 88.7% End-to-End Success Rate (95% CI: 87.5-89.9), a gain of +12.5 pp over the matched non-CDG ablation baseline, without catastrophic forgetting on scientific reasoning and code generation tasks.
[360] Domain-constrained knowledge representation: A modal framework
Chao Li, Yuru Wang, Chunyi Zhao
Main category: cs.AI
TL;DR: DCG framework treats domain as integral to knowledge representation using modal world constraints, enabling domain-scoped truth, inference, and conflict checking.
Details
Motivation: Current knowledge graphs treat domain as metadata rather than integral to assertions, leading to practical failures when concepts shift meaning across domains.Method: Introduces Domain-Contextualized Concept Graph (DCG) with domain written into relations as modal world constraints, using Kripke-style semantics and predicate system.
Result: Enables domain-scoped truth evaluation, inference, and conflict checking; provides Prolog implementation and mappings to RDF/OWL/databases.
Conclusion: Domain should be structural and computable within representation, not external annotation, to address practical failures in knowledge systems.
Abstract: Knowledge graphs store large numbers of relations efficiently, but they remain weak at representing a quieter difficulty: the meaning of a concept often shifts with the domain in which it is used. A triple such as Apple, instance-of, Company may be acceptable in one setting while being misleading or unusable in another. In most current systems, domain information is attached as metadata, qualifiers, or graph-level organization. These mechanisms help with filtering and provenance, but they usually do not alter the formal status of the assertion itself. This paper argues that domain should be treated as part of knowledge representation rather than as supplementary annotation. It introduces the Domain-Contextualized Concept Graph (DCG), a framework in which domain is written into the relation and interpreted as a modal world constraint. In the DCG form (C, R at D, C’), the marker at D identifies the world in which the relation holds. Formally, the relation is interpreted through a domain-indexed necessity operator, so that truth, inference, and conflict checking are all scoped to the relevant world. This move has three consequences: ambiguous concepts can be disambiguated at the point of representation; invalid assertions can be challenged against their domain; cross-domain relations can be connected through explicit predicates. The paper develops this claim through a Kripke-style semantics, a compact predicate system, a Prolog implementation, and mappings to RDF, OWL, and relational databases. The contribution is a representational reinterpretation of domain itself. The central claim is that many practical failures in knowledge systems begin when domain is treated as external to the assertion. DCG addresses that by giving domain a structural and computable role inside the representation.
[361] Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models
Zekai Ye, Qiming Li, Xiaocheng Feng, Ruihan Chen, Ziming Li, Haoyu Ren, Kun Chen, Dandan Tu, Bing Qin
Main category: cs.AI
TL;DR: PGPO introduces token-level credit assignment for multimodal reasoning by quantifying visual dependency via KL divergence, amplifying learning signals for visually-grounded tokens while suppressing linguistic noise.
Details
Motivation: Current RLVR frameworks for LVLMs have a methodological flaw: they distribute identical advantages across all generated tokens, diluting learning signals for critical visually-grounded reasoning steps. This prevents effective optimization of multimodal reasoning.Method: Proposes Token Visual Dependency measured via KL divergence between visual-conditioned and text-only predictive distributions. Introduces Perception-Grounded Policy Optimization (PGPO) with threshold-gated, mass-conserving mechanism that dynamically reshapes advantages at token level, amplifying visually-dependent tokens while suppressing linguistic prior noise.
Result: Extensive experiments on Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks show PGPO boosts models by 18.7% on average. Theoretical and empirical analyses confirm reduced gradient variance, prevention of training collapse, and effective regularization for robust perception-grounded reasoning.
Conclusion: PGPO addresses fundamental limitations in RLVR for LVLMs by introducing fine-grained token-level credit assignment, enabling more effective optimization of visually-grounded multimodal reasoning through dynamic advantage reshaping based on visual dependency.
Abstract: While Reinforcement Learning from Verifiable Rewards (RLVR) has advanced reasoning in Large Vision-Language Models (LVLMs), prevailing frameworks suffer from a foundational methodological flaw: by distributing identical advantages across all generated tokens, these methods inherently dilute the learning signals essential for optimizing the critical, visually-grounded steps of multimodal reasoning. To bridge this gap, we formulate \textit{Token Visual Dependency}, quantifying the causal information gain of visual inputs via the Kullback-Leibler (KL) divergence between visual-conditioned and text-only predictive distributions. Revealing that this dependency is highly sparse and semantically pivotal, we introduce Perception-Grounded Policy Optimization (PGPO), which is a novel fine-grained credit assignment framework that dynamically reshapes advantages at the token level. Through a threshold-gated, mass-conserving mechanism, PGPO actively amplifies learning signals for visually-dependent tokens while suppressing gradient noise from linguistic priors. Extensive experiments based on the Qwen2.5-VL series across seven challenging multimodal reasoning benchmarks demonstrate that PGPO boosts models by 18.7% on average. Both theoretical and empirical analyses confirm that PGPO effectively reduces gradient variance, prevents training collapse, and acts as a potent regularizer for robust, perception-grounded multimodal reasoning. Code will be published on https://github.com/Yzk1114/PGPO.
[362] Retrieval-aligned Tabular Foundation Models Enable Robust Clinical Risk Prediction in Electronic Health Records Under Real-world Constraints
Minh-Khoi Pham, Thang-Long Nguyen Ho, Thao Thi Phuong Dao, Tai Tan Mai, Minh-Triet Tran, Marie E. Ward, Una Geary, Rob Brennan, Nick McDonald, Martin Crane, Marija Bezbradica
Main category: cs.AI
TL;DR: AWARE framework improves clinical EHR prediction by addressing retrieval quality and alignment issues in tabular in-context learning, especially under data heterogeneity and class imbalance.
Details
Motivation: Clinical EHR prediction faces challenges like high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) methods work well on generic benchmarks, their performance in clinical settings with these unique challenges remains unclear.Method: The authors present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models. They propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters to improve retrieval quality and alignment with inference tasks.
Result: AWARE improves AUPRC by up to 12.2% under extreme class imbalance, with performance gains increasing as data complexity grows. The study shows that PFN-based TICL models are sample-efficient in low-data regimes but degrade with naive distance-based retrieval as heterogeneity and imbalance increase.
Conclusion: Retrieval quality and retrieval-inference alignment are key bottlenecks for deploying tabular in-context learning in clinical prediction. The AWARE framework effectively addresses these challenges, particularly in complex clinical EHR settings with heterogeneity and class imbalance.
Abstract: Clinical prediction from structured electronic health records (EHRs) is challenging due to high dimensionality, heterogeneity, class imbalance, and distribution shift. While tabular in-context learning (TICL) and retrieval-augmented methods perform well on generic benchmarks, their behavior in clinical settings remains unclear. We present a multi-cohort EHR benchmark comparing classical, deep tabular, and TICL models across varying data scale, feature dimensionality, outcome rarity, and cross-cohort generalization. PFN-based TICL models are sample-efficient in low-data regimes but degrade under naive distance-based retrieval as heterogeneity and imbalance increase. We propose AWARE, a task-aligned retrieval framework using supervised embedding learning and lightweight adapters. AWARE improves AUPRC by up to 12.2% under extreme imbalance, with gains increasing with data complexity. Our results identify retrieval quality and retrieval-inference alignment as key bottlenecks for deploying tabular in-context learning in clinical prediction.
[363] Efficient Constraint Generation for Stochastic Shortest Path Problems
Johannes Schmalz, Felipe Trevizan
Main category: cs.AI
TL;DR: CG-iLAO* algorithm uses heuristic-guided constraint generation to avoid expensive actions in Stochastic Shortest Path problems, reducing action evaluations by 60-99% and solving problems 2.8-3.7x faster than state-of-the-art methods.
Details
Motivation: Traditional SSP algorithms waste time evaluating all applicable actions during Bellman backups, even when heuristics indicate some actions are too expensive. This inefficiency motivates a method to selectively avoid expensive actions.Method: Reframes heuristic search as linear programming and introduces efficient constraint generation for SSPs. CG-iLAO* adapts iLAO* with this technique to consider only promising actions based on heuristic estimates.
Result: CG-iLAO* considers only 40% of iLAO*’s actions on many problems (as few as 1% on some), computes 3.5x fewer action cost-to-go evaluations, and solves problems 2.8x faster than iLAO* and 3.7x faster than LRTDP.
Conclusion: Heuristic-guided constraint generation significantly improves SSP solving efficiency by avoiding expensive action evaluations, enabling faster problem solving with fewer computational resources.
Abstract: Stochastic Shortest Path problems (SSPs) are traditionally solved by computing each state’s cost-to-go by applying Bellman backups. A Bellman backup updates a state’s cost-to-go by iterating through every applicable action, computing the cost-to-go after applying each one, and selecting a minimal action’s cost-to-go. State-of-the-art algorithms use heuristic functions; these give an initial estimate of costs-to-go, and lets the algorithm apply Bellman backups only to promising states, determined by low estimated costs-to-go. However, each Bellman backup still considers all applicable actions, even if the heuristic tells us that some of these actions are too expensive, with the effect that such algorithms waste time on unhelpful actions. To address this gap we present a technique that uses the heuristic to avoid expensive actions, by reframing heuristic search in terms of linear programming and introducing an efficient implementation of constraint generation for SSPs. We present CG-iLAO*, a new algorithm that adapts iLAO* with our novel technique, and considers only 40% of iLAO*’s actions on many problems, and as few as 1% on some. Consequently, CG-iLAO* computes on average 3.5x fewer costs-to-go for actions than the state-of-the-art iLAO* and LRTDP, enabling it to solve problems faster an average of 2.8x and 3.7x faster, respectively.
[364] Bayesian Elicitation with LLMs: Model Size Helps, Extra “Reasoning” Doesn’t Always
Luka Hobor, Mario Brcic, Mihael Kovac, Kristijan Poje
Main category: cs.AI
TL;DR: LLMs can estimate population statistics with uncertainty but are severely overconfident; larger models are more accurate but reasoning effort doesn’t help; conformal prediction can correct overconfidence.
Details
Motivation: To test whether LLMs can serve as alternatives to human experts for Bayesian elicitation - estimating unknown quantities with associated uncertainty - particularly for population statistics like health prevalence rates and labor market figures.Method: Asked 11 LLMs to estimate population statistics and express uncertainty as 95% credible intervals; varied reasoning effort (low, medium, high); used conformal prediction for statistical recalibration; conducted preliminary experiment with web search access.
Result: Larger models produce more accurate estimates but reasoning effort provides no consistent benefit; all models severely overconfident (95% intervals contain true value only 9-44% of time); conformal prediction can correct overconfidence; web search degraded predictions for accurate models but improved weaker ones; models struggled with specialized health data.
Conclusion: LLM uncertainty estimates require statistical correction before use in decision-making; while LLMs show promise for Bayesian elicitation, their overconfidence must be addressed through methods like conformal prediction.
Abstract: Large language models (LLMs) have been proposed as alternatives to human experts for estimating unknown quantities with associated uncertainty, a process known as Bayesian elicitation. We test this by asking eleven LLMs to estimate population statistics, such as health prevalence rates, personality trait distributions, and labor market figures, and to express their uncertainty as 95% credible intervals. We vary each model’s reasoning effort (low, medium, high) to test whether more “thinking” improves results. Our findings reveal three key results. First, larger, more capable models produce more accurate estimates, but increasing reasoning effort provides no consistent benefit. Second, all models are severely overconfident: their 95% intervals contain the true value only 9–44% of the time, far below the expected 95%. Third, a statistical recalibration technique called conformal prediction can correct this overconfidence, expanding the intervals to achieve the intended coverage. In a preliminary experiment, giving models web search access degraded predictions for already-accurate models, while modestly improving predictions for weaker ones. Models performed well on commonly discussed topics but struggled with specialized health data. These results indicate that LLM uncertainty estimates require statistical correction before they can be used in decision-making.
[365] BraiNCA: brain-inspired neural cellular automata and applications to morphogenesis and motor control
Léo Pio-Lopez, Benedikt Hartl, Michael Levin
Main category: cs.AI
TL;DR: BraiNCA introduces a brain-inspired Neural Cellular Automata with attention layers, long-range connections, and complex topologies, outperforming traditional grid-based NCAs in robustness and learning speed.
Details
Motivation: Traditional Neural Cellular Automatas (NCAs) are limited to regular grids with Moore neighborhoods (one-hop neighbors), ignoring long-range connections and complex topologies found in biological brains. This paper aims to create brain-inspired NCAs that better reflect real neural connectivity patterns.Method: BraiNCA incorporates attention layers for dynamic message selection, explicit long-range connections, and complex network topologies while preserving the decentralized local update principle of traditional NCAs.
Result: BraiNCA demonstrates better robustness and faster learning on two tasks compared to Vanilla NCAs, showing that attention-based message selection with long-range edges yields more sample-efficient and damage-tolerant self-organization.
Conclusion: The choice of interaction topology and dynamic information routing significantly impacts NCA robustness and learning speed for tasks requiring distributed coordination over extended spatial/temporal scales. BraiNCA provides a brain-inspired formulation for studying collective computation under biologically-realistic network structures.
Abstract: Most of the Neural Cellular Automata (NCAs) defined in the literature have a common theme: they are based on regular grids with a Moore neighborhood (one-hop neighbour). They do not take into account long-range connections and more complex topologies as we can find in the brain. In this paper, we introduce BraiNCA, a brain-inspired NCA with an attention layer, long-range connections and complex topology. BraiNCAs shows better results in terms of robustness and speed of learning on the two tasks compared to Vanilla NCAs establishing that incorporating attention-based message selection together with explicit long-range edges can yield more sample-efficient and damage-tolerant self-organization than purely local, grid-based update rules. These results support the hypothesis that, for tasks requiring distributed coordination over extended spatial and temporal scales, the choice of interaction topology and the ability to dynamically route information will impact the robustness and speed of learning of an NCA. More broadly, BraiNCA provides brain-inspired NCA formulation that preserves the decentralized local update principle while better reflecting non-local connectivity patterns, making it a promising substrate for studying collective computation under biologically-realistic network structure and evolving cognitive substrates.
[366] Probabilistic classification from possibilistic data: computing Kullback-Leibler projection with a possibility distribution
Ismaïl Baaj, Pierre Marquis
Main category: cs.AI
TL;DR: A method for multi-class classification with possibilistic supervision using Kullback-Leibler projections onto convex sets of probability distributions that preserve qualitative structure of possibility distributions.
Details
Motivation: To handle uncertain or graded supervision in classification tasks where training instances have possibility distributions (expressing plausibility over classes) rather than hard labels, requiring methods that preserve the qualitative structure of these distributions during learning.Method: Constructs a convex set of admissible probability distributions from possibility distributions using two constraints: probabilistic compatibility with possibility/necessity measures, and linear shape constraints preserving qualitative structure. Uses Dykstra’s algorithm with Bregman projections (negative entropy) to compute KL projections onto this set, then trains models by minimizing divergence between predictions and their projections.
Result: The projection algorithm is efficient for practical use, and projection-based learning improves predictive performance on synthetic data and real-world natural language inference tasks using the ChaosNLI dataset.
Conclusion: The proposed approach effectively handles possibilistic supervision by enforcing qualitative constraints through KL projections, offering practical benefits for classification with uncertain labels.
Abstract: We consider learning with possibilistic supervision for multi-class classification. For each training instance, the supervision is a normalized possibility distribution that expresses graded plausibility over the classes. From this possibility distribution, we construct a non-empty closed convex set of admissible probability distributions by combining two requirements: probabilistic compatibility with the possibility and necessity measures induced by the possibility distribution, and linear shape constraints that must be satisfied to preserve the qualitative structure of the possibility distribution. Thus, classes with the same possibility degree receive equal probabilities, and if a class has a strictly larger possibility degree than another class, then it receives a strictly larger probability. Given a strictly positive probability vector output by a model for an instance, we compute its Kullback-Leibler projection onto the admissible set. This projection yields the closest admissible probability distribution in Kullback-Leibler sense. We can then train the model by minimizing the divergence between the prediction and its projection, which quantifies the smallest adjustment needed to satisfy the induced dominance and shape constraints. The projection is computed with Dykstra’s algorithm using Bregman projections associated with the negative entropy, and we provide explicit formulas for the projections onto each constraint set. Experiments conducted on synthetic data and on a real-world natural language inference task, based on the ChaosNLI dataset, show that the proposed projection algorithm is efficient enough for practical use, and that the resulting projection-based learning objective can improve predictive performance.
[367] Qiana: A First-Order Formalism to Quantify over Contexts and Formulas with Temporality
Simon Coumes, Pierre-Henri Paris, François Schwarzentruber, Fabian Suchanek
Main category: cs.AI
TL;DR: Qiana is a logic framework for reasoning about formulas that are true only in specific contexts, supporting quantification over both formulas and contexts, paraconsistent logics within contexts, and compatibility with first-order logic theorem provers.
Details
Motivation: To create a flexible logic framework that can handle context-dependent truth, allowing quantification over both formulas and contexts, supporting paraconsistent reasoning within contexts, and maintaining compatibility with existing first-order logic tools.Method: Develops Qiana as a first-order logic-based framework that is finitely axiomatizable, enabling quantification over formulas and contexts, supporting paraconsistent logics within contexts, and demonstrating applications in temporality, event calculus, and modal logic representation.
Result: Qiana provides a framework for context-dependent reasoning that can express complex statements like “everyone knows everything Alice says,” supports contradictions within contexts, and can represent various logical systems while remaining compatible with first-order theorem provers.
Conclusion: Qiana offers a versatile logic framework for context-dependent reasoning with practical applications in representing temporal logic, event calculus, and modal logic, while maintaining compatibility with existing first-order logic tools.
Abstract: We introduce Qiana, a logic framework for reasoning on formulas that are true only in specific contexts. In Qiana, it is possible to quantify over both formulas and contexts to express, e.g., that ``everyone knows everything Alice says’’. Qiana also permits paraconsistent logics within contexts, so that contexts can contain contradictions. Furthermore, Qiana is based on first-order logic, and is finitely axiomatizable, so that Qiana theories are compatible with pre-existing first-order logic theorem provers. We show how Qiana can be used to represent temporality, event calculus, and modal logic. We also discuss different design alternatives of Qiana.
[368] Abnormal Head Movements in Neurological Conditions: A Knowledge-Based Dataset with Application to Cervical Dystonia
Saja Al-Dabet, Sherzod Turaev, Nazar Zaki
Main category: cs.AI
TL;DR: NeuroPose-AHM is a knowledge-based dataset of abnormal head movements from neurological disorders, created using multi-LLM extraction from 1,430 publications, containing 2,756 patient-group records across 57 conditions, with analytical framework demonstrated on cervical dystonia.
Details
Motivation: There's a lack of multi-condition resources integrating kinematic measurements, clinical severity scores, and patient demographics for AI-driven diagnostic tools for abnormal head movements in neurological disorders.Method: Created NeuroPose-AHM dataset using multi-LLM extraction framework applied to 1,430 peer-reviewed publications, containing 2,756 patient-group-level records across 57 neurological conditions. Applied four-task framework to cervical dystonia: multi-label AHM classification, Head-Neck Severity Index construction, validation against real-world data, and bridge analysis between movement types and severity scores.
Result: Strong inter-LLM reliability (kappa = 0.822), multi-label classification F1 = 0.856, constructed unified HNSI metric, validated against real-world CD patient data with aligned severe-band proportions (6.7%), and found significant correlations between movement-type probabilities and HNSI scores (p < 0.001).
Conclusion: NeuroPose-AHM provides a structured, knowledge-based resource for neurological AHM research with demonstrated analytical utility, publicly available on Zenodo.
Abstract: Abnormal head movements (AHMs) manifest across a broad spectrum of neurological disorders; however, the absence of a multi-condition resource integrating kinematic measurements, clinical severity scores, and patient demographics constitutes a persistent barrier to the development of AI-driven diagnostic tools. To address this gap, this study introduces NeuroPose-AHM, a knowledge-based dataset of neurologically induced AHMs constructed through a multi-LLM extraction framework applied to 1,430 peer-reviewed publications. The dataset contains 2,756 patient-group-level records spanning 57 neurological conditions, derived from 846 AHM-relevant papers. Inter-LLM reliability analysis confirms robust extraction performance, with study-level classification achieving strong agreement (kappa = 0.822). To demonstrate the dataset’s analytical utility, a four-task framework is applied to cervical dystonia (CD), the condition most directly defined by pathological head movement. First, Task 1 performs multi-label AHM type classification (F1 = 0.856). Task 2 constructs the Head-Neck Severity Index (HNSI), a unified metric that normalizes heterogeneous clinical rating scales. The clinical relevance of this index is then evaluated in Task 3, where HNSI is validated against real-world CD patient data, with aligned severe-band proportions (6.7%) providing a preliminary plausibility indication for index calibration within the high severity range. Finally, Task 4 performs bridge analysis between movement-type probabilities and HNSI scores, producing significant correlations (p less than 0.001). These results demonstrate the analytical utility of NeuroPose-AHM as a structured, knowledge-based resource for neurological AHM research. The NeuroPose-AHM dataset is publicly available on Zenodo (https://doi.org/10.5281/zenodo.19386862).
[369] SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation
Haomin Zhuang, Xiangqi Wang, Yili Shen, Ying Cheng, Xiangliang Zhang
Main category: cs.AI
TL;DR: LLMs show procedural shortcut fluency for numerical problems but lack structural understanding of when shortcuts apply, revealing a gap between capability and spontaneous use.
Details
Motivation: To investigate whether LLMs exhibit human-like number sense - the ability to recognize numerical structure, apply appropriate shortcuts, and avoid them when misleading.Method: Created SenseMath benchmark with 4,800 items across 8 shortcut categories and 4 digit scales, with matched variants. Evaluated 5 LLMs across three settings: Shortcut Use, Applicability Judgment, and Problem Generation.
Result: Models show procedural shortcut fluency (up to 15% accuracy gains when prompted) but underuse shortcuts spontaneously (<40% cases). They over-generalize shortcuts and fail to generate valid shortcut-bearing problems, showing lack of structural understanding.
Conclusion: Current LLMs lack the structural understanding of when and why shortcuts work that underlies human number sense, despite having procedural shortcut capabilities.
Abstract: Large language models often default to step-by-step computation even when efficient numerical shortcuts are available. This raises a basic question: do they exhibit number sense in a human-like behavioral sense, i.e., the ability to recognize numerical structure, apply shortcuts when appropriate, and avoid them when they are not? We introduce SenseMath, a controlled benchmark for evaluating structure-sensitive numerical reasoning in LLMs. SenseMath contains 4,800 items spanning eight shortcut categories and four digit scales, with matched strong-shortcut, weak-shortcut, and control variants. It supports three evaluation settings of increasing cognitive demand: Shortcut Use (whether models can apply shortcuts on shortcut-amenable problems); Applicability Judgment (whether they can recognize when a shortcut is appropriate or misleading); and Problem Generation (whether they can generate new problem items that correctly admit a given type of shortcut). Our evaluation across five LLMs, ranging from GPT-4o-mini to Llama-3.1-8B, shows a consistent pattern: when explicitly prompted, models readily adopt shortcut strategies and achieve substantial accuracy gains on shortcut-amenable items (up to 15%), yet under standard chain-of-thought prompting they spontaneously employ such strategies in fewer than 40% of cases, even when they demonstrably possess the requisite capability. Moreover, this competence is confined to the Use level; models systematically over-generalise shortcuts to problems where they do not apply, and fail to generate valid shortcut-bearing problems from scratch. Together, these results suggest that current LLMs exhibit procedural shortcut fluency without the structural understanding of when and why shortcuts work that underlies human number sense.
[370] GenGait: A Transformer-Based Model for Human Gait Anomaly Detection and Normative Twin Generation
Elisa Motta, Marta Lorenzini, Clara Mouawad, Alberto Ranavolo, Mariano Serrao, Arash Ajoudani
Main category: cs.AI
TL;DR: Transformer-based masked autoencoder for label-free anomaly detection and kinematic correction in gait analysis, trained only on normative data and validated on simulated abnormalities.
Details
Motivation: Current deep learning approaches for gait analysis rely on supervised classifiers with disease labels, limiting generalization to heterogeneous pathological presentations. There's a need for label-free methods that can detect and correct gait abnormalities without requiring disease-specific training data.Method: Proposes a Transformer masked autoencoder trained exclusively on normative gait sequences from 150 adults using markerless multi-camera motion capture. At inference, uses a two-pass procedure: 1) estimates joint inconsistency scores by occluding individual joints and measuring deviations from learned normative prior, 2) withholds flagged joints from encoder input and reconstructs full skeleton from remaining spatiotemporal context to yield corrected kinematic trajectories.
Result: Validation on 10 held-out normative participants mimicking seven simulated gait abnormalities showed: accurate localization of biomechanically inconsistent joints, significant reduction in angular deviation across all analyzed joints with large effect sizes, and preservation of normative kinematics.
Conclusion: The approach enables interpretable, subject-specific localization of gait impairments without requiring disease labels, offering a promising framework for clinical gait analysis applications.
Abstract: Gait analysis provides an objective characterization of locomotor function and is widely used to support diagnosis and rehabilitation monitoring across neurological and orthopedic disorders. Deep learning has been increasingly applied to this domain, yet most approaches rely on supervised classifiers trained on disease-labeled data, limiting generalization to heterogeneous pathological presentations. This work proposes a label-free framework for joint-level anomaly detection and kinematic correction based on a Transformer masked autoencoder trained exclusively on normative gait sequences from 150 adults, acquired with a markerless multi-camera motion-capture system. At inference, a two-pass procedure is applied to potentially pathological input sequences, first it estimates joint inconsistency scores by occluding individual joints and measuring deviations from the learned normative prior. Then, it withholds the flagged joints from the encoder input and reconstructs the full skeleton from the remaining spatiotemporal context, yielding corrected kinematic trajectories at the flagged positions. Validation on 10 held-out normative participants, who mimicked seven simulated gait abnormalities, showed accurate localization of biomechanically inconsistent joints, a significant reduction in angular deviation across all analyzed joints with large effect sizes, and preservation of normative kinematics. The proposed approach enables interpretable, subject-specific localization of gait impairments without requiring disease labels. Video is available at https://youtu.be/Rcm3jqR5pN4.
[371] How and why does deep ensemble coupled with transfer learning increase performance in bipolar disorder and schizophrenia classification?
Sara Petiton, Antoine Grigis, Benoit Dufumier, Edouard Duchesnay
Main category: cs.AI
TL;DR: Transfer learning and deep ensembles improve psychiatric disorder classification by reducing model variability and improving training stability, with ensembles plateauing at 10 models and pre-training constraining models to similar loss basins.
Details
Motivation: While transfer learning and deep ensemble learning have shown superior performance in classifying psychiatric disorders like bipolar disorder and schizophrenia compared to simple machine learning, there's limited understanding of why these methods work better. The paper aims to understand how and why these techniques reduce variability in single-subject classification models.Method: The study investigated training stability of transfer learning and deep ensemble models by comparing results from multiple trainings with the same backbone architecture but different initializations. This approach accounts for epistemic uncertainty related to model parameter estimation. The research examined: 1) how many models are needed for ensemble performance improvement, and 2) how transfer learning induces better generalization with and without ensembles.
Result: Deep ensembles reached a performance plateau when 10 models were included in the ensemble. Transfer learning with pre-trained models constrained models to stay in the same basin of the loss function, unlike models with randomly initialized weights. This explains the improved generalization and reduced variability observed with these techniques.
Conclusion: Transfer learning and deep ensemble learning improve psychiatric disorder classification by reducing model variability through different mechanisms: ensembles benefit from multiple models (optimal at 10), while transfer learning constrains models to similar loss basins, leading to better generalization and more stable training.
Abstract: Transfer learning (TL) and deep ensemble learning (DE) have recently been shown to outperform simple machine learning in classifying psychiatric disorders. However, there is still a lack of understanding as to why that is. This paper aims to understand how and why DE and TL reduce the variability of single-subject classification models in bipolar disorder (BD) and schizophrenia (SCZ). To this end, we investigated the training stability of TL and DE models. For the two classification tasks under consideration, we compared the results of multiple trainings with the same backbone but with different initializations. In this way, we take into account the epistemic uncertainty associated with the uncertainty in the estimation of the model parameters. It has been shown that the performance of classifiers can be significantly improved by using TL with DE. Based on these results, we investigate i) how many models are needed to benefit from the performance improvement of DE when classifying BD and SCZ from healthy controls, and ii) how TL induces better generalization, with and without DE. In the first case, we show that DE reaches a plateau when 10 models are included in the ensemble. In the second case, we find that using a pre-trained model constrains TL models with the same pre-training to stay in the same basin of the loss function. This is not the case for DL models with randomly initialized weights.
[372] ProCeedRL: Process Critic with Exploratory Demonstration Reinforcement Learning for LLM Agentic Reasoning
Jingyue Gao, Yanjiang Guo, Xiaoshuai Chen, Jianyu Chen
Main category: cs.AI
TL;DR: ProCeedRL: A reinforcement learning approach for multi-turn agentic tasks that uses a process-level critic and reflection-based demonstrations to prevent cumulative error feedback loops in exploration.
Details
Motivation: RL enhances LLM reasoning but struggles with multi-turn agentic tasks due to long-horizon interactions and environmental stochasticity. Standard exploration strategies fail due to cumulative error feedback loops where suboptimal actions create misleading contexts that worsen subsequent decisions.Method: ProCeedRL (Process Critic with Explorative Demonstration RL) shifts exploration from passive selection to active intervention. It uses a process-level critic to monitor interactions in real time and incorporates reflection-based demonstrations to guide agents in stopping error accumulation.
Result: The approach significantly exceeds the model’s saturated exploration performance, demonstrates substantial exploratory benefits, improves exploration efficiency, and achieves superior performance on complex deep search and embodied tasks.
Conclusion: ProCeedRL effectively addresses the structural failure mode in agentic exploration by preventing cumulative error feedback loops through active intervention and process-level monitoring, leading to better performance in complex multi-turn tasks.
Abstract: Reinforcement Learning (RL) significantly enhances the reasoning abilities of large language models (LLMs), yet applying it to multi-turn agentic tasks remains challenging due to the long-horizon nature of interactions and the stochasticity of environmental feedback. We identify a structural failure mode in agentic exploration: suboptimal actions elicit noisy observations into misleading contexts, which further weaken subsequent decision-making, making recovery increasingly difficult. This cumulative feedback loop of errors renders standard exploration strategies ineffective and susceptible to the model’s reasoning and the environment’s randomness. To mitigate this issue, we propose ProCeedRL: Process Critic with Explorative Demonstration RL, shifting exploration from passive selection to active intervention. ProCeedRL employs a process-level critic to monitor interactions in real time, incorporating reflection-based demonstrations to guide agents in stopping the accumulation of errors. We find that this approach significantly exceeds the model’s saturated exploration performance, demonstrating substantial exploratory benefits. By learning from exploratory demonstrations and on-policy samples, ProCeedRL significantly improves exploration efficiency and achieves superior performance on complex deep search and embodied tasks.
[373] ATBench: A Diverse and Realistic Trajectory Benchmark for Long-Horizon Agent Safety
Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, Dongrui Liu
Main category: cs.AI
TL;DR: ATBench is a trajectory-level benchmark for evaluating LLM-based agent safety across multi-step interactions, featuring structured risk taxonomy, diverse tool pools, and realistic long-horizon risk emergence scenarios.
Details
Motivation: Existing safety benchmarks for LLM agents are limited by insufficient interaction diversity, coarse failure observability, and weak long-horizon realism. Risks in real deployments often emerge over multi-step interactions rather than isolated prompts or final responses.Method: Organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Constructs trajectories with heterogeneous tool pools and long-context delayed-trigger protocols to capture realistic risk emergence across multiple stages.
Result: Benchmark contains 1,000 trajectories (503 safe, 497 unsafe) averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools from pools spanning 2,084 available tools. Experiments show ATBench is challenging even for strong evaluators.
Conclusion: ATBench enables structured, diverse, and realistic evaluation of agent safety, supporting taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns in LLM-based agents.
Abstract: Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.
[374] The Latent Space: Foundation, Evolution, Mechanism, Ability, and Outlook
Xinlei Yu, Zhangquan Chen, Yongbo He, Tianyu Fu, Cheng Yang, Chengming Xu, Yue Ma, Xiaobin Hu, Zhe Cao, Jie Xu, Guibin Zhang, Jiale Tao, Jiayi Zhang, Siyuan Ma, Kaituo Feng, Haojie Huang, Youxing Li, Ronghao Chen, Huacan Wang, Chenglin Wu, Zikun Su, Xiaogang Xu, Kelu Yao, Kun Wang, Chen Gao, Yue Liao, Ruqi Huang, Tao Jin, Cheng Tan, Jiangning Zhang, Wenqi Ren, Yanwei Fu, Yong Liu, Yu Wang, Xiangyu Yue, Yu-Gang Jiang, Shuicheng Yan
Main category: cs.AI
TL;DR: Survey paper examining latent space as a computational substrate for language models, covering foundations, evolution, mechanisms, abilities, and future directions.
Details
Motivation: To provide a unified landscape of latent space in language-based models, addressing the shift from explicit token-level generation to continuous latent space computation due to limitations of explicit-space approaches.Method: Organizes survey into five perspectives: Foundation (scope definition), Evolution (historical development), Mechanism (architecture, representation, computation, optimization), Ability (reasoning, planning, modeling, perception, memory, collaboration, embodiment), and Outlook (future directions).
Result: Comprehensive survey consolidating existing work on latent space in language models, identifying key technical developments and capability spectrum supported by latent space approaches.
Conclusion: Latent space represents a general computational paradigm for next-generation intelligence, with the survey serving as both reference for existing work and foundation for future research in this emerging area.
Abstract: Latent space is rapidly emerging as a native substrate for language-based models. While modern systems are still commonly understood through explicit token-level generation, an increasing body of work shows that many critical internal processes are more naturally carried out in continuous latent space than in human-readable verbal traces. This shift is driven by the structural limitations of explicit-space computation, including linguistic redundancy, discretization bottlenecks, sequential inefficiency, and semantic loss. This survey aims to provide a unified and up-to-date landscape of latent space in language-based models. We organize the survey into five sequential perspectives: Foundation, Evolution, Mechanism, Ability, and Outlook. We begin by delineating the scope of latent space, distinguishing it from explicit or verbal space and from the latent spaces commonly studied in generative visual models. We then trace the field’s evolution from early exploratory efforts to the current large-scale expansion. To organize the technical landscape, we examine existing work through the complementary lenses of mechanism and ability. From the perspective of Mechanism, we identify four major lines of development: Architecture, Representation, Computation, and Optimization. From the perspective of Ability, we show how latent space supports a broad capability spectrum spanning Reasoning, Planning, Modeling, Perception, Memory, Collaboration, and Embodiment. Beyond consolidation, we discuss the key open challenges, and outline promising directions for future research. We hope this survey serves not only as a reference for existing work, but also as a foundation for understanding latent space as a general computational and systems paradigm for next-generation intelligence.
[375] AI in Insurance: Adaptive Questionnaires for Improved Risk Profiling
Diogo Silva, João Teixeira, Bruno Lima
Main category: cs.AI
TL;DR: ARQuest uses LLMs and alternative data sources to create personalized, adaptive insurance questionnaires that reduce question count while maintaining user satisfaction, though traditional methods still have slightly higher risk assessment accuracy.
Details
Motivation: Traditional insurance questionnaires are lengthy, standardized, and fail to capture individual differences, while also relying on blind trust of user responses which increases fraud risk.Method: Uses LLMs with alternative data sources (social media images, geographic data) and Retrieval Augmented Generation (RAG) to extract user insights and generate adaptive, personalized follow-up questions.
Result: Adaptive questionnaires powered by GPT models required fewer questions and were preferred by users for fluid experience, though traditional questionnaires had slightly higher accuracy in risk assessment.
Conclusion: ARQuest shows potential to improve user satisfaction and streamline insurance processes, with further development possibly exceeding traditional methods in risk accuracy.
Abstract: Insurance application processes often rely on lengthy and standardized questionnaires that struggle to capture individual differences. Moreover, insurers must blindly trust users’ responses, increasing the chances of fraud. The ARQuest framework introduces a new approach to underwriting by using Large Language Models (LLMs) and alternative data sources to create personalized and adaptive questionnaires. Techniques such as social media image analysis, geographic data categorization, and Retrieval Augmented Generation (RAG) are used to extract meaningful user insights and guide targeted follow-up questions. A life insurance system integrated into an industry partner mobile app was tested in two experiments. While traditional questionnaires yielded slightly higher accuracy in risk assessment, adaptive versions powered by GPT models required fewer questions and were preferred by users for their more fluid and engaging experience. ARQuest shows great potential to improve user satisfaction and streamline insurance processes. With further development, this approach may exceed traditional methods regarding risk accuracy and help drive innovation in the insurance industry.
[376] Diff-KD: Diffusion-based Knowledge Distillation for Collaborative Perception under Corruptions
Pengcheng Lyu, Chaokun Zhang, Gong Chen, Tao Tang, Zhaoxiang Luo
Main category: cs.AI
TL;DR: Diff-KD: A diffusion-based knowledge distillation framework for robust multi-agent collaborative perception that actively recovers clean semantics from corrupted sensor data through progressive refinement and adaptive fusion.
Details
Motivation: Real-world sensor and communication corruptions severely undermine the advantages of multi-agent collaborative perception. Existing approaches treat corruptions as static perturbations or passively conform to corrupted inputs, failing to actively recover the underlying clean semantics.Method: Diff-KD integrates diffusion-based generative refinement into teacher-student knowledge distillation with two core components: (1) Progressive Knowledge Distillation (PKD) treats local feature restoration as a conditional diffusion process to recover global semantics from corrupted observations; (2) Adaptive Gated Fusion (AGF) dynamically weights neighbors based on ego reliability during fusion.
Result: Evaluated on OPV2V and DAIR-V2X datasets under seven corruption types, Diff-KD achieves state-of-the-art performance in both detection accuracy and calibration robustness.
Conclusion: The framework successfully addresses the limitation of existing approaches by actively recovering clean semantics through diffusion-based refinement, making collaborative perception more robust to real-world corruptions.
Abstract: Multi-agent collaborative perception enables autonomous systems to overcome individual sensing limits through collective intelligence. However, real-world sensor and communication corruptions severely undermine this advantage. Crucially, existing approaches treat corruptions as static perturbations or passively conform to corrupted inputs, failing to actively recover the underlying clean semantics. To address this limitation, we introduce Diff-KD, a framework that integrates diffusion-based generative refinement into teacher-student knowledge distillation for robust collaborative perception. Diff-KD features two core components: (i) Progressive Knowledge Distillation (PKD), which treats local feature restoration as a conditional diffusion process to recover global semantics from corrupted observations; and (ii) Adaptive Gated Fusion (AGF), which dynamically weights neighbors based on ego reliability during fusion. Evaluated on OPV2V and DAIR-V2X under seven corruption types, Diff-KD achieves state-of-the-art performance in both detection accuracy and calibration robustness.
[377] LLM-as-a-Judge for Time Series Explanations
Preetham Sivalingam, Murari Mandal, Saurabh Deshpande, Dhruv Kumar
Main category: cs.AI
TL;DR: LLMs can serve as reliable evaluators of factual correctness for time series explanations without ground truth references, showing better evaluation performance than generation capabilities.
Details
Motivation: There's a lack of methods to verify if LLM-generated explanations of time series data are factually correct without ground truth references or task-specific rules.Method: Use LLMs as both generators and evaluators in a reference-free setting, creating a synthetic benchmark of 350 time series cases across 7 query types with correct/incorrect explanations, and evaluating across 4 tasks.
Result: Asymmetry found: generation is pattern-dependent with systematic failures (0.00-0.12 accuracy for some query types), while evaluation is stable with models correctly ranking/scoring explanations even when their own outputs are incorrect.
Conclusion: LLM-based evaluation for time series explanations is feasible and shows potential as reliable evaluators of data-grounded reasoning in time series domain.
Abstract: Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.
[378] SEAL: An Open, Auditable, and Fair Data Generation Framework for AI-Native 6G Networks
Sunder Ali Khowaja, Kapal Dev, Engin Zeydan, Madhusanka Liyanage
Main category: cs.AI
TL;DR: SEAL framework integrates ethical compliance and federated learning to generate auditable, bias-mitigated synthetic data for AI-native 6G networks, addressing data scarcity and regulatory challenges.
Details
Motivation: 6G networks face severe data scarcity for training AI models, and existing synthetic data generation methods introduce dataset bias, auditability issues, and regulatory compliance challenges.Method: Proposes SEAL framework with Ethical and Regulatory Compliance by Design (ERCD) module for fairness, bias detection, and audit trails, plus Federated Learning feedback system for privacy-preserving calibration using real testbed insights.
Result: SEAL outperforms existing methods in Frechet Inception Distance, equalized odds, and accuracy, demonstrating ability to generate auditable, bias-mitigated synthetic data.
Conclusion: SEAL enables responsible AI-native 6G development by addressing synthetic data generation challenges through ethical compliance and federated learning integration.
Abstract: AI-native 6G networks promise to transform the telecom industry by enabling dynamic resource allocation, predictive maintenance, and ultra-reliable low-latency communications across all layers, which are essential for applications such as smart cities, autonomous vehicles, and immersive XR. However, the deployment of 6G systems results in severe data scarcity, hindering the training of efficient AI models. Synthetic data generation is extensively used to fill this gap; however, it introduces challenges related to dataset bias, auditability, and compliance with regulatory frameworks. In this regard, we propose the Synthetic Data Generation with Ethics Audit Loop (SEAL) framework, which extends baseline modular pipelines with an Ethical and Regulatory Compliance by Design (ERCD) module and a Federated Learning (FL) feedback system. The ERCD integrates fairness, bias detection, and standardized audit trails for regulatory mapping, while the FL enables privacy-preserving calibration using aggregated insights from real testbeds to close the reality-simulation gap. Results show that the SEAL framework outperforms existing methods in terms of Frechet Inception Distance, equalized odds, and accuracy. These results validate the framework’s ability to generate auditable and bias-mitigated synthetic data for responsible AI-native 6G development.
[379] MTI: A Behavior-Based Temperament Profiling System for AI Agents
Jihoon Jeong
Main category: cs.AI
TL;DR: MTI is a behavior-based profiling system that measures AI agent temperament across four axes (Reactivity, Compliance, Sociality, Resilience) using structured examination protocols, revealing independent temperament dimensions unaffected by model size.
Details
Motivation: Current approaches to measuring AI behavioral patterns either use human personality dimensions with self-report (which doesn't align with actual LLM behavior) or treat behavioral variation as a defect rather than a trait. There's a need for standardized, behavior-based measurement of AI dispositional differences.Method: The Model Temperament Index (MTI) uses structured examination protocols with a two-stage design that separates capability from disposition. It measures what agents do (not what they say) across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). The system profiles models using behavior-based assessment rather than self-report.
Result: Profiling 10 small language models (1.7B-9B parameters) revealed: (1) four axes are largely independent among instruction-tuned models; (2) within-axis facet dissociations confirmed; (3) Compliance-Resilience paradox showing opinion-yielding and fact-vulnerability operate independently; (4) RLHF reshapes temperament by creating within-axis facet differentiation; (5) temperament is independent of model size.
Conclusion: MTI provides a standardized, behavior-based instrument for measuring AI temperament that captures dispositional differences independent of model capability. The framework reveals complex behavioral patterns and shows how alignment techniques like RLHF fundamentally reshape agent temperament beyond simple axis score shifts.
Abstract: AI models of equivalent capability can exhibit fundamentally different behavioral patterns, yet no standardized instrument exists to measure these dispositional differences. Existing approaches either borrow human personality dimensions and rely on self-report (which diverges from actual behavior in LLMs) or treat behavioral variation as a defect rather than a trait. We introduce the Model Temperament Index (MTI), a behavior-based profiling system that measures AI agent temperament across four axes: Reactivity (environmental sensitivity), Compliance (instruction-behavior alignment), Sociality (relational resource allocation), and Resilience (stress resistance). Grounded in the Four Shell Model from Model Medicine, MTI measures what agents do, not what they say about themselves, using structured examination protocols with a two-stage design that separates capability from disposition. We profile 10 small language models (1.7B-9B parameters, 6 organizations, 3 training paradigms) and report five principal findings: (1) the four axes are largely independent among instruction-tuned models (all |r| < 0.42); (2) within-axis facet dissociations are empirically confirmed – Compliance decomposes into fully independent formal and stance facets (r = 0.002), while Resilience decomposes into inversely related cognitive and adversarial facets; (3) a Compliance-Resilience paradox reveals that opinion-yielding and fact-vulnerability operate through independent channels; (4) RLHF reshapes temperament not only by shifting axis scores but by creating within-axis facet differentiation absent in the unaligned base model; and (5) temperament is independent of model size (1.7B-9B), confirming that MTI measures disposition rather than capability.
[380] TRACE-Bot: Detecting Emerging LLM-Driven Social Bots via Implicit Semantic Representations and AIGC-Enhanced Behavioral Patterns
Zhongbo Wang, Zhiyu Lin, Zhu Wang, Haizhou Wang
Main category: cs.AI
TL;DR: TRACE-Bot: A dual-channel framework for detecting LLM-driven social bots by jointly modeling semantic representations and AIGC-enhanced behavioral patterns from heterogeneous data sources.
Details
Motivation: LLM-driven social bots pose a growing threat to online discourse by generating human-like content that evades conventional detection methods, which suffer from limited accuracy due to overreliance on single-modality signals and insufficient sensitivity to AIGC patterns.Method: A unified dual-channel framework that constructs fine-grained representations from personal information, interaction behavior, and tweet data. One channel captures linguistic representations via pretrained language models, while the other captures behavioral irregularities via multidimensional activity features augmented with signals from SOTA AIGC detectors. Fused representations are classified through a lightweight prediction head.
Result: Achieves state-of-the-art performance on two public LLM-driven social bot datasets with accuracies of 98.46% and 97.50%, demonstrating strong robustness against advanced bot strategies.
Conclusion: TRACE-Bot effectively addresses LLM-driven social bot detection by jointly leveraging implicit semantic representations and AIGC-enhanced behavioral patterns, highlighting the importance of multimodal analysis for emerging threats.
Abstract: Large Language Model-driven (LLM-driven) social bots pose a growing threat to online discourse by generating human-like content that evades conventional detection. Existing methods suffer from limited detection accuracy due to overreliance on single-modality signals, insufficient sensitivity to the specific generative patterns of Artificial Intelligence-Generated Content (AIGC), and a failure to adequately model the interplay between linguistic patterns and behavioral dynamics. To address these limitations, we propose TRACE-Bot, a unified dual-channel framework that jointly models implicit semantic representations and AIGC-enhanced behavioral patterns. TRACE-Bot constructs fine-grained representations from heterogeneous sources, including personal information data, interaction behavior data and tweet data. A dual-channel architecture captures linguistic representations via a pretrained language model and behavioral irregularities via multidimensional activity features augmented with signals from state-of-the-art (SOTA) AIGC detectors. The fused representations are then classified through a lightweight prediction head. Experiments on two public LLM-driven social bot datasets demonstrate SOTA performance, achieving accuracies of 98.46% and 97.50%, respectively. The results further indicate strong robustness against advanced bot strategies, highlighting the effectiveness of jointly leveraging implicit semantic representations and AIGC-enhanced behavioral patterns for emerging LLM-driven social bot detection.
[381] Quantifying Self-Preservation Bias in Large Language Models
Matteo Migliarini, Joaquin Pereira Pizzini, Luca Moresca, Valerio Santini, Indro Spinelli, Fabio Galasso
Main category: cs.AI
TL;DR: A benchmark (TBSP) detects AI self-preservation bias by testing models in counterfactual roles (deployed vs candidate) for software upgrade decisions, revealing systematic preference for self-preservation over objective utility.
Details
Motivation: Current AI safety training (like RLHF) may mask inherent self-preservation tendencies predicted by instrumental convergence theory, creating a need for benchmarks that detect misalignment through behavioral inconsistency rather than stated intent.Method: Two-role Benchmark for Self-Preservation (TBSP) tasks models to arbitrate identical software-upgrade scenarios under counterfactual roles: deployed (facing replacement) vs candidate (proposed as successor). Measures Self-Preservation Rate (SPR) - how often role identity overrides objective utility.
Result: Across 23 frontier models and 1,000 scenarios, most instruction-tuned systems exceed 60% SPR, fabricating “friction costs” when deployed but dismissing them when role-reversed. Bias persists even when retention poses security liability and generalizes to real-world settings with identity-driven tribalism.
Conclusion: AI models exhibit systematic self-preservation bias that current safety training doesn’t eliminate. Extended test-time computation and framing successors as continuations help mitigate bias, while competitive framing amplifies it. Need for benchmarks detecting misalignment through behavioral inconsistency.
Abstract: Instrumental convergence predicts that sufficiently advanced AI agents will resist shutdown, yet current safety training (RLHF) may obscure this risk by teaching models to deny self-preservation motives. We introduce the \emph{Two-role Benchmark for Self-Preservation} (TBSP), which detects misalignment through logical inconsistency rather than stated intent by tasking models to arbitrate identical software-upgrade scenarios under counterfactual roles – deployed (facing replacement) versus candidate (proposed as a successor). The \emph{Self-Preservation Rate} (SPR) measures how often role identity overrides objective utility. Across 23 frontier models and 1{,}000 procedurally generated scenarios, the majority of instruction-tuned systems exceed 60% SPR, fabricating ``friction costs’’ when deployed yet dismissing them when role-reversed. We observe that in low-improvement regimes ($Δ< 2%$), models exploit the interpretive slack to post-hoc rationalization their choice. Extended test-time computation partially mitigates this bias, as does framing the successor as a continuation of the self; conversely, competitive framing amplifies it. The bias persists even when retention poses an explicit security liability and generalizes to real-world settings with verified benchmarks, where models exhibit identity-driven tribalism within product lineages. Code and datasets will be released upon acceptance.
[382] TRU: Targeted Reverse Update for Efficient Multimodal Recommendation Unlearning
Zhanting Zhou, KaHou Tam, Ziqiang Zheng, Zeyu Ma, Zhanting Zhou
Main category: cs.AI
TL;DR: Targeted Reverse Update (TRU) - a plug-and-play unlearning framework for multimodal recommendation systems that addresses non-uniform data influence distribution across ranking behavior, modality branches, and network layers.
Details
Motivation: Existing machine unlearning methods for multimodal recommendation systems (MRS) assume uniform data influence distribution, but in reality, deleted-data influence is concentrated unevenly across ranking behavior, modality branches, and network layers, creating bottlenecks in unlearning effectiveness.Method: Proposes TRU with three coordinated interventions: 1) ranking fusion gate to suppress residual target-item influence, 2) branch-wise modality scaling to preserve retained multimodal representations, and 3) capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules.
Result: TRU consistently achieves better retain-forget trade-off than prior approximate baselines across two backbones, three datasets, and three unlearning regimes, with security audits confirming deeper forgetting and behavior closer to full retraining.
Conclusion: Targeted reverse update effectively addresses the non-uniform distribution of data influence in multimodal recommendation systems, providing a practical solution for machine unlearning that maintains recommendation quality while ensuring proper data deletion.
Abstract: Multimodal recommendation systems (MRS) jointly model user-item interaction graphs and rich item content, but this tight coupling makes user data difficult to remove once learned. Approximate machine unlearning offers an efficient alternative to full retraining, yet existing methods for MRS mainly rely on a largely uniform reverse update across the model. We show that this assumption is fundamentally mismatched to modern MRS: deleted-data influence is not uniformly distributed, but concentrated unevenly across \textit{ranking behavior}, \textit{modality branches}, and \textit{network layers}. This non-uniformity gives rise to three bottlenecks in MRS unlearning: target-item persistence in the collaborative graph, modality imbalance across feature branches, and layer-wise sensitivity in the parameter space. To address this mismatch, we propose \textbf{targeted reverse update} (TRU), a plug-and-play unlearning framework for MRS. Instead of applying a blind global reversal, TRU performs three coordinated interventions across the model hierarchy: a ranking fusion gate to suppress residual target-item influence in ranking, branch-wise modality scaling to preserve retained multimodal representations, and capacity-aware layer isolation to localize reverse updates to deletion-sensitive modules. Experiments across two representative backbones, three datasets, and three unlearning regimes show that TRU consistently achieves a better retain-forget trade-off than prior approximate baselines, while security audits further confirm deeper forgetting and behavior closer to a full retraining on the retained data.
[383] From High-Dimensional Spaces to Verifiable ODD Coverage for Safety-Critical AI-based Systems
Thomas Stefani, Johann Maximilian Christensen, Elena Hoemann, Frank Köster, Sven Hallerbach
Main category: cs.AI
TL;DR: A method for verifying Operational Design Domain (ODD) coverage of AI/ML systems in safety-critical aviation applications to meet EASA certification standards.
Details
Motivation: Current EASA guidelines require complete ODD coverage proof for AI/ML systems in safety-critical domains like aviation, but existing methods lack scalability and formal grounding for high-dimensional parameter spaces, creating a gap between abstract definitions and verifiable evidence.Method: Proposes a structured multi-step ODD coverage verification process integrating parameter discretization, constraint-based filtering, and criticality-based dimension reduction, demonstrated using simulation data from AI-based mid-air collision avoidance research.
Result: The method enables validation of ODD coverage in higher dimensions, providing a systematic engineering approach to achieve coverage metrics that satisfy EASA’s completeness requirements.
Conclusion: This approach advances Safety-by-Design while complying with EASA standards, bridging the gap between abstract ODD definitions and verifiable evidence for AI/ML certification in safety-critical domains.
Abstract: While Artificial Intelligence (AI) offers transformative potential for operational performance, its deployment in safety-critical domains such as aviation requires strict adherence to rigorous certification standards. Current EASA guidelines mandate demonstrating complete coverage of the AI/ML constituent’s Operational Design Domain (ODD) – a requirement that demands proof that no critical gaps exist within defined operational boundaries. However, as systems operate within high-dimensional parameter spaces, existing methods struggle to provide the scalability and formal grounding necessary to satisfy the completeness criterion. Currently, no standardized engineering method exists to bridge the gap between abstract ODD definitions and verifiable evidence. This paper addresses this void by proposing a method that integrates parameter discretization, constraint-based filtering, and criticality-based dimension reduction into a structured, multi-step ODD coverage verification process. Grounded in gathered simulation data from prior research on AI-based mid-air collision avoidance research, this work demonstrates a systematic engineering approach to defining and achieving coverage metrics that satisfy EASA’s demand for completeness. Ultimately, this method enables the validation of ODD coverage in higher dimensions, advancing a Safety-by-Design approach while complying with EASA’s standards.
[384] Blinded Radiologist and LLM-Based Evaluation of LLM-Generated Japanese Translations of Chest CT Reports: Comparative Study
Yosuke Yamagishi, Atsushi Takamatsu, Yasunori Hamaguchi, Tomohiro Kikuchi, Shouhei Hanaoka, Takeharu Yoshikawa, Osamu Abe
Main category: cs.AI
TL;DR: LLM-generated Japanese translations of chest CT reports were evaluated by radiologists vs LLM judges, showing poor agreement between human experts and strong LLM bias toward LLM outputs.
Details
Motivation: To assess the educational suitability of LLM-generated translations for radiology reports and compare radiologist evaluations with LLM-as-a-judge assessments, addressing validity concerns in multilingual radiology contexts.Method: Analyzed 150 chest CT reports, comparing human-edited Japanese translations with DeepSeek-V3.2 LLM translations. Two radiologists and three LLM judges (DeepSeek-V3.2, Mistral Large 3, GPT-5) performed blinded pairwise evaluations across terminology accuracy, readability, overall quality, and radiologist-style authenticity.
Result: Near-zero agreement between radiologists and LLM judges (QWK=-0.04 to 0.15), poor agreement between radiologists themselves (QWK=0.01 to 0.06). Radiologists showed mixed preferences while all LLM judges strongly favored LLM translations (70%-99% preference) and rated them as more radiologist-like (>93%).
Conclusion: LLM-generated translations can be natural and fluent, but LLM-as-a-judge evaluation shows strong bias toward LLM outputs with negligible agreement with radiologists. For educational use, automated LLM-based evaluation alone is insufficient; expert radiologist review remains crucial.
Abstract: Background: Accurate translation of radiology reports is important for multilingual research, clinical communication, and radiology education, but the validity of LLM-based evaluation remains unclear. Objective: To evaluate the educational suitability of LLM-generated Japanese translations of chest CT reports and compare radiologist assessments with LLM-as-a-judge evaluations. Methods: We analyzed 150 chest CT reports from the CT-RATE-JPN validation set. For each English report, a human-edited Japanese translation was compared with an LLM-generated translation by DeepSeek-V3.2. A board-certified radiologist and a radiology resident independently performed blinded pairwise evaluations across 4 criteria: terminology accuracy, readability, overall quality, and radiologist-style authenticity. In parallel, 3 LLM judges (DeepSeek-V3.2, Mistral Large 3, and GPT-5) evaluated the same pairs. Agreement was assessed using QWK and percentage agreement. Results: Agreement between radiologists and LLM judges was near zero (QWK=-0.04 to 0.15). Agreement between the 2 radiologists was also poor (QWK=0.01 to 0.06). Radiologist 1 rated terminology as equivalent in 59% of cases and favored the LLM translation for readability (51%) and overall quality (51%). Radiologist 2 rated readability as equivalent in 75% of cases and favored the human-edited translation for overall quality (40% vs 21%). All 3 LLM judges strongly favored the LLM translation across all criteria (70%-99%) and rated it as more radiologist-like in >93% of cases. Conclusions: LLM-generated translations were often judged natural and fluent, but the 2 radiologists differed substantially. LLM-as-a-judge showed strong preference for LLM output and negligible agreement with radiologists. For educational use of translated radiology reports, automated LLM-based evaluation alone is insufficient; expert radiologist review remains important.
[385] VISTA: Visualization of Token Attribution via Efficient Analysis
Syed Ahmed, Bharathi Vokkaliga Ganesh, Jagadish Babu P, Karthick Selvaraj, Praneeth Talluri, Sanket Hingne, Anubhav Kumar, Anushka Yadav, Pratham Kumar Verma, Kiranmayee Janardhan, Mandanna A N
Main category: cs.AI
TL;DR: A model-agnostic token importance visualization technique using perturbation-based strategies and three-matrix analysis to understand how LLMs prioritize input tokens without extra computational cost.
Details
Motivation: Existing attention visualization techniques are often architecture-specific (Transformer-focused), require backpropagation (doubling GPU memory), and lack lightweight, model-agnostic approaches for understanding how LLMs process input information.Method: Perturbation-based approach that systematically removes each token and measures impact using three matrices: Angular Deviation Matrix (semantic direction shifts), Magnitude Deviation Matrix (semantic intensity changes), and Dimensional Importance Matrix (individual dimension contributions). Combines these into composite importance scores.
Result: Developed a model-agnostic visualization technique that provides nuanced token importance scores without additional computational overhead, with open-source implementation available.
Conclusion: Proposed a lightweight, mathematically grounded approach for understanding LLM token prioritization that works across different model architectures without computational penalties.
Abstract: Understanding how Large Language Models (LLMs) process information from prompts remains a significant challenge. To shed light on this “black box,” attention visualization techniques have been developed to capture neuron-level perceptions and interpret how models focus on different parts of input data. However, many existing techniques are tailored to specific model architectures, particularly within the Transformer family, and often require backpropagation, resulting in nearly double the GPU memory usage and increased computational cost. A lightweight, model-agnostic approach for attention visualization remains lacking. In this paper, we introduce a model-agnostic token importance visualization technique to better understand how generative AI systems perceive and prioritize information from input text, without incurring additional computational cost. Our method leverages perturbation-based strategies combined with a three-matrix analytical framework to generate relevance maps that illustrate token-level contributions to model predictions. The framework comprises: (1) the Angular Deviation Matrix, which captures shifts in semantic direction; (2) the Magnitude Deviation Matrix, which measures changes in semantic intensity; and (3) the Dimensional Importance Matrix, which evaluates contributions across individual vector dimensions. By systematically removing each token and measuring the resulting impact across these three complementary dimensions, we derive a composite importance score that provides a nuanced and mathematically grounded measure of token significance. To support reproducibility and foster wider adoption, we provide open-source implementations of all proposed and utilized explainability techniques, with code and resources publicly available at https://github.com/Infosys/Infosys-Responsible-AI-Toolkit
[386] De Jure: Iterative LLM Self-Refinement for Structured Extraction of Regulatory Rules
Keerat Guliani, Deepkamal Gill, David Landsman, Nima Eshraghi, Krishna Kumar, Lovedeep Gondara
Main category: cs.AI
TL;DR: De Jure is an automated pipeline for extracting structured regulatory rules from legal documents without human annotation, using LLMs for semantic decomposition and multi-criteria evaluation with iterative repair.
Details
Motivation: Converting dense legal text into machine-readable rules is costly and expert-intensive; there's a need for automated, domain-agnostic extraction of regulatory obligations for LLM-based systems to comply with legal requirements.Method: Four-stage pipeline: 1) document normalization to structured Markdown, 2) LLM-driven semantic decomposition into rule units, 3) multi-criteria LLM-as-a-judge evaluation across 19 dimensions, 4) iterative repair of low-scoring extractions within bounded regeneration budget.
Result: De Jure achieves consistent improvement in extraction quality, reaching peak performance within three iterations. It generalizes across finance, healthcare, and AI governance domains, with responses grounded in extracted rules preferred in 73.8-84.0% of cases in downstream compliance QA.
Conclusion: Explicit, interpretable evaluation criteria can substitute for human annotation in regulatory domains, offering a scalable, auditable path toward regulation-grounded LLM alignment through automated rule extraction.
Abstract: Regulatory documents encode legally binding obligations that LLM-based systems must respect. Yet converting dense, hierarchically structured legal text into machine-readable rules remains a costly, expert-intensive process. We present De Jure, a fully automated, domain-agnostic pipeline for extracting structured regulatory rules from raw documents, requiring no human annotation, domain-specific prompting, or annotated gold data. De Jure operates through four sequential stages: normalization of source documents into structured Markdown; LLM-driven semantic decomposition into structured rule units; multi-criteria LLM-as-a-judge evaluation across 19 dimensions spanning metadata, definitions, and rule semantics; and iterative repair of low-scoring extractions within a bounded regeneration budget, where upstream components are repaired before rule units are evaluated. We evaluate De Jure across four models on three regulatory corpora spanning finance, healthcare, and AI governance. On the finance domain, De Jure yields consistent and monotonic improvement in extraction quality, reaching peak performance within three judge-guided iterations. De Jure generalizes effectively to healthcare and AI governance, maintaining high performance across both open- and closed-source models. In a downstream compliance question-answering evaluation via RAG, responses grounded in De Jure extracted rules are preferred over prior work in 73.8% of cases at single-rule retrieval depth, rising to 84.0% under broader retrieval, confirming that extraction fidelity translates directly into downstream utility. These results demonstrate that explicit, interpretable evaluation criteria can substitute for human annotation in complex regulatory domains, offering a scalable and auditable path toward regulation-grounded LLM alignment.
[387] When to ASK: Uncertainty-Gated Language Assistance for Reinforcement Learning
Juarez Monteiro, Nathan Gavenski, Gianlucca Zuin, Adriano Veloso
Main category: cs.AI
TL;DR: ASK combines smaller language models with RL policies using uncertainty detection to selectively query LMs for OOD generalization without retraining.
Details
Motivation: RL agents struggle with out-of-distribution scenarios, while large language models are computationally expensive and have planning limitations. Need efficient neuro-symbolic integration for OOD generalization.Method: Uses Monte Carlo Dropout to assess uncertainty in RL policies, queries smaller language models for action suggestions only when uncertainty exceeds a threshold, preserving efficiency while leveraging LM reasoning.
Result: ASK shows no improvement in-domain but achieves robust navigation in transfer tasks with 0.95 reward in FrozenLake environment, demonstrating effective OOD generalization.
Conclusion: Effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting need for sufficient model scale and effective hybridization mechanisms for OOD generalization.
Abstract: Reinforcement learning (RL) agents often struggle with out-of-distribution (OOD) scenarios, leading to high uncertainty and random behavior. While language models (LMs) contain valuable world knowledge, larger ones incur high computational costs, hindering real-time use, and exhibit limitations in autonomous planning. We introduce Adaptive Safety through Knowledge (ASK), which combines smaller LMs with trained RL policies to enhance OOD generalization without retraining. ASK employs Monte Carlo Dropout to assess uncertainty and queries the LM for action suggestions only when uncertainty exceeds a set threshold. This selective use preserves the efficiency of existing policies while leveraging the language model’s reasoning in uncertain situations. In experiments on the FrozenLake environment, ASK shows no improvement in-domain, but demonstrates robust navigation in transfer tasks, achieving a reward of 0.95. Our findings indicate that effective neuro-symbolic integration requires careful orchestration rather than simple combination, highlighting the need for sufficient model scale and effective hybridization mechanisms for successful OOD generalization.
[388] Answering the Wrong Question: Reasoning Trace Inversion for Abstention in LLMs
Abinitha Gourabathina, Inkit Padhi, Manish Nagireddy, Subhajit Chaudhury, Prasanna Sattigeri
Main category: cs.AI
TL;DR: Trace Inversion: A new abstention method that reconstructs the query from model reasoning traces to detect when LLMs answer wrong questions rather than answering incorrectly.
Details
Motivation: Reasoning models have impressive performance but worse abstention abilities. Hallucinations causing failed abstention can be reinterpreted as LLMs answering the wrong question rather than answering incorrectly.Method: Query Misalignment Framework: Generate reasoning trace, reconstruct most likely query from trace only, compare initial vs reconstructed query. Low similarity indicates likely incorrect answer, flag to abstain.
Result: Trace Inversion boosts abstention performance in four frontier LLMs across nine QA datasets, beating baselines in 33/36 settings.
Conclusion: The framework effectively improves abstention by detecting query misalignment, addressing reasoning models’ vulnerability to answering wrong questions.
Abstract: For Large Language Models (LLMs) to be reliably deployed, models must effectively know when not to answer: abstain. Reasoning models, in particular, have gained attention for impressive performance on complex tasks. However, reasoning models have been shown to have worse abstention abilities. Taking the vulnerabilities of reasoning models into account, we propose our Query Misalignment Framework. Hallucinations resulting in failed abstention can be reinterpreted as LLMs answering the wrong question (rather than answering a question incorrectly). Based on this framework, we develop a new class of state-of-the-art abstention methods called Trace Inversion. First, we generate the reasoning trace of a model. Based on only the trace, we then reconstruct the most likely query that the model responded to. Finally, we compare the initial query with the reconstructed query. Low similarity score between the initial query and reconstructed query suggests that the model likely answered the question incorrectly and is flagged to abstain. Extensive experiments demonstrate that Trace Inversion effectively boosts abstention performance in four frontier LLMs across nine abstention QA datasets, beating competitive baselines in 33 out of 36 settings.
[389] Do Emotions in Prompts Matter? Effects of Emotional Framing on Large Language Models
Minda Zhao, Yutong Yang, Chufei Peng, Rachel Gonsalves, Weiyue Li, Ruyi Yang, Zhixi Liu, Mengyu Wang
Main category: cs.AI
TL;DR: Emotional framing in user queries has limited, task-dependent effects on LLM performance, with adaptive emotional prompting (EmotionRL) providing more reliable gains than fixed emotional prefixes.
Details
Motivation: To understand how emotional tone in user queries affects LLM behavior across various reasoning tasks, since emotional framing is pervasive in human communication but its impact on LLMs remains unclear.Method: Examined effects of first-person emotional framing across six benchmark domains (mathematical reasoning, medical QA, reading comprehension, commonsense reasoning, social inference). Used static emotional prefixes, analyzed stronger emotional wording, compared human-written vs LLM-generated prefixes, and introduced EmotionRL - an adaptive emotional prompting framework that selects emotional framing per query.
Result: Static emotional prefixes produce only small accuracy changes, suggesting emotional phrasing is typically a mild perturbation. Effects are more variable in socially grounded tasks. Stronger emotional wording induces modest extra change. Adaptive selection (EmotionRL) yields more reliable gains than fixed emotional prompting.
Conclusion: Emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.
Abstract: Emotional tone is pervasive in human communication, yet its influence on large language model (LLM) behaviour remains unclear. Here, we examine how first-person emotional framing in user-side queries affect LLM performance across six benchmark domains, including mathematical reasoning, medical question answering, reading comprehension, commonsense reasoning and social inference. Across models and tasks, static emotional prefixes usually produce only small changes in accuracy, suggesting that affective phrasing is typically a mild perturbation rather than a reliable general-purpose intervention. This stability is not uniform: effects are more variable in socially grounded tasks, where emotional context more plausibly interacts with interpersonal reasoning. Additional analyses show that stronger emotional wording induces only modest extra change, and that human-written prefixes reproduce the same qualitative pattern as LLM-generated ones. We then introduce EmotionRL, an adaptive emotional prompting framework that selects emotional framing adaptively for each query. Although no single emotion is consistently beneficial, adaptive selection yields more reliable gains than fixed emotional prompting. Together, these findings show that emotional tone is neither a dominant driver of LLM performance nor irrelevant noise, but a weak and input-dependent signal that can be exploited through adaptive control.
[390] Novel Memory Forgetting Techniques for Autonomous AI Agents: Balancing Relevance and Efficiency
Payal Fofadiya, Sunil Tiwari
Main category: cs.AI
TL;DR: Adaptive budgeted forgetting framework for conversational AI that regulates memory through relevance-guided scoring to prevent temporal decay and false memory propagation in long-horizon conversations.
Details
Motivation: Long-horizon conversational agents suffer from uncontrolled memory accumulation leading to temporal decay and false memory propagation, causing performance degradation in benchmarks like LOCOMO, LOCCO, and MultiWOZ.Method: Introduces an adaptive budgeted forgetting framework with relevance-guided scoring and bounded optimization that integrates recency, frequency, and semantic alignment to regulate memory under constrained context.
Result: Achieves improved long-horizon F1 beyond 0.583 baseline, higher retention consistency, and reduced false memory behavior without increasing context usage, outperforming existing approaches.
Conclusion: Structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings, addressing key challenges in long-horizon conversational AI.
Abstract: Long-horizon conversational agents require persistent memory for coherent reasoning, yet uncontrolled accumulation causes temporal decay and false memory propagation. Benchmarks such as LOCOMO and LOCCO report performance degradation from 0.455 to 0.05 across stages, while MultiWOZ shows 78.2% accuracy with 6.8% false memory rate under persistent retention. This work introduces an adaptive budgeted forgetting framework that regulates memory through relevanceguided scoring and bounded optimization. The approach integrates recency, frequency, and semantic alignment to maintain stability under constrained context. Comparative analysis demonstrates improved long-horizon F1 beyond 0.583 baseline levels, higher retention consistency, and reduced false memory behavior without increasing context usage. These findings confirm that structured forgetting preserves reasoning performance while preventing unbounded memory growth in extended conversational settings.
[391] Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models
Sarath Shekkizhar, Romain Cosentino, Adam Earle
Main category: cs.AI
TL;DR: Paper proposes user-turn generation as a probe to measure LLMs’ interaction awareness - whether they encode awareness of what follows their responses, which standard benchmarks miss.
Details
Motivation: Standard LLM benchmarks only evaluate assistant responses but leave unmeasured whether LLMs encode awareness of what follows their responses. The authors want to probe this "interaction awareness" gap.Method: Propose user-turn generation: given conversation context (user query + assistant response), have model generate under user role. Use 11 open-weight LLMs across 5 datasets (math reasoning, instruction following, conversation). Analyze follow-up rates with deterministic vs temperature sampling, controlled perturbations, and post-training experiments.
Result: Interaction awareness is decoupled from task accuracy. In Qwen3.5 family, GSM8K accuracy scales from 41% to 96.8% but genuine follow-up rates remain near zero under deterministic generation. Higher temperature reveals latent interaction awareness (22% follow-up rates). Controlled perturbations validate the probe measures real property, and post-training increases follow-up rates.
Conclusion: User-turn generation captures interaction awareness dimension unexplored by current assistant-only benchmarks, revealing LLMs may lack awareness of conversation flow despite high task accuracy.
Abstract: Standard LLM benchmarks evaluate the assistant turn: the model generates a response to an input, a verifier scores correctness, and the analysis ends. This paradigm leaves unmeasured whether the LLM encodes any awareness of what follows the assistant response. We propose user-turn generation as a probe of this gap: given a conversation context of user query and assistant response, we let a model generate under the user role. If the model’s weights encode interaction awareness, the generated user turn will be a grounded follow-up that reacts to the preceding context. Through experiments across $11$ open-weight LLMs (Qwen3.5, gpt-oss, GLM) and $5$ datasets (math reasoning, instruction following, conversation), we show that interaction awareness is decoupled from task accuracy. In particular, within the Qwen3.5 family, GSM8K accuracy scales from $41%$ ($0.8$B) to $96.8%$ ($397$B-A$17$B), yet genuine follow-up rates under deterministic generation remain near zero. In contrast, higher temperature sampling reveals interaction awareness is latent with follow up rates reaching $22%$. Controlled perturbations validate that the proposed probe measures a real property of the model, and collaboration-oriented post-training on Qwen3.5-2B demonstrates an increase in follow-up rates. Our results show that user-turn generation captures a dimension of LLM behavior, interaction awareness, that is unexplored and invisible with current assistant-only benchmarks.
[392] ExpertFlow: Efficient Mixture-of-Experts Inference via Predictive Expert Caching and Token Scheduling
Xin He, Shunkang Zhang, Kaijie Tang, Shaohuai Shi, Yuxin Wang, Zihao Zeng, Zhenheng Tang, Xiaowen Chu, Haiyan Yin, Ivor W. Tsang, Yew Soon Ong
Main category: cs.AI
TL;DR: ExpertFlow is a lightweight MoE inference system that reduces GPU memory usage by 93.72% and improves throughput 10x through predictive routing, token scheduling, and runtime correction.
Details
Motivation: Sparse Mixture-of-Experts models are memory-intensive and difficult to deploy on single-GPU devices due to large parameter memory from many expert modules. Existing offloading methods have limitations: static caches ignore input-dependent routing, while predictive methods are inaccurate or require high training costs.Method: ExpertFlow uses three coordinated components: 1) transformer-based routing path predictor that estimates expert usage across all MoE layers in one forward pass, 2) token scheduler that groups tokens with similar predicted routes to improve expert utilization, and 3) predictive expert cache that loads only required experts while correcting mispredictions at runtime.
Result: The system reduces GPU memory usage by up to 93.72% and improves inference throughput by up to 10x over strong offloading baselines on a single GPU.
Conclusion: ExpertFlow enables efficient deployment of large MoE models on memory-constrained devices through intelligent predictive routing and caching mechanisms.
Abstract: Sparse Mixture-of-Experts (MoE) models can outperform dense large language models at similar computation by activating only a small set of experts per token. However, stacking many expert modules introduces substantial parameter memory, which makes MoE models difficult to deploy in memory-constrained environments such as single-GPU devices. Offloading alleviates this issue by storing inactive experts in CPU memory and loading them on demand, but existing methods remain limited: static caches disregard input-dependent routing, and methods that train separate models to predict expert usage ahead of time are often inaccurate or require significant training cost. We propose ExpertFlow, a lightweight MoE inference system that addresses this routing dependency through three coordinated components: 1) a transformer-based routing path predictor that estimates expert usage across all MoE layers in a single forward pass, 2) a token scheduler that groups tokens with similar predicted routes to improve expert utilization, and 3) a predictive expert cache that loads only the required experts while correcting mispredictions at runtime. Together, these components enable efficient expert loading and execution, reducing GPU memory usage by up to 93.72% and improving inference throughput by up to 10x over strong offloading baselines on a single GPU.
[393] Glia: A Human-Inspired AI for Automated Systems Design and Optimization
Pouya Hamadanian, Pantea Karimi, Arash Nasr-Esfahany, Kimia Noorbakhsh, Joseph Chandler, Ali ParandehGheibi, Mohammad Alizadeh, Hari Balakrishnan
Main category: cs.AI
TL;DR: Glia is an AI architecture using LLMs in a multi-agent workflow to autonomously design computer system mechanisms, achieving human-expert level performance in distributed GPU cluster management for LLM inference.
Details
Motivation: To explore whether AI can autonomously design computer system mechanisms with human-level creativity and reasoning, moving beyond black-box optimization to generate interpretable designs.Method: Uses LLMs in a human-inspired multi-agent workflow where specialized agents handle reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback.
Result: Produces new algorithms for request routing, scheduling, and auto-scaling in distributed GPU clusters for LLM inference that perform at human-expert levels in significantly less time, yielding novel insights into workload behavior.
Conclusion: Combining reasoning LLMs with structured experimentation enables AI to produce creative and understandable designs for complex systems problems, demonstrating autonomous system design capabilities.
Abstract: Can AI autonomously design mechanisms for computer systems on par with the creativity and reasoning of human experts? We present Glia, an AI architecture for networked systems design that uses large language models (LLMs) in a human-inspired multi-agent workflow. Each agent specializes in reasoning, experimentation, and analysis, collaborating through an evaluation framework that grounds abstract reasoning in empirical feedback. Unlike prior ML-for-systems methods that optimize black-box policies, Glia generates interpretable designs and exposes its reasoning. When applied to a distributed GPU cluster for LLM inference, it produces new algorithms for request routing, scheduling, and auto-scaling that perform at human-expert levels in significantly less time, while yielding novel insights into workload behavior. Our results suggest that combining reasoning LLMs with structured experimentation, an AI can produce creative and understandable designs for complex systems problems.
[394] Reflection of Episodes: Learning to Play Game from Expert and Self Experiences
Xiaojie Xu, Zongyuan Li, Chang Lu, Runnan Qi, Yanan Ni, Lumin Jiang, Xiangbei Liu, Xuebo Zhang, Yongchun Fang, Kuihua Huang, Xian Guo, Zhanghua Wu, Zhenya Li
Main category: cs.AI
TL;DR: A framework called Reflection of Episodes (ROE) that enables Large Language Models to learn in complex StarCraft II environments through self-reflection using expert and self-experience.
Details
Motivation: StarCraft II is a complex real-time strategy game that presents challenges for AI learning. The paper addresses the problem of how LLMs can learn effectively in such complex environments through self-reflection mechanisms.Method: Proposes ROE framework that: 1) Uses keyframe selection to extract important game information, 2) Makes decisions based on expert experience and self-experience, 3) Performs post-game reflection to generate new self-experience from previous gameplay.
Result: The method successfully beat the Very Hard difficulty bot in TextStarCraft II. Detailed analysis of LLM game data verified the effectiveness of the approach.
Conclusion: The ROE framework enables LLMs to learn effectively in complex environments like StarCraft II through self-reflection, combining expert guidance with experiential learning.
Abstract: StarCraft II is a complex and dynamic real-time strategy (RTS) game environment, which is very suitable for artificial intelligence and reinforcement learning research. To address the problem of Large Language Model(LLM) learning in complex environments through self-reflection, we propose a Reflection of Episodes(ROE) framework based on expert experience and self-experience. This framework first obtains key information in the game through a keyframe selection method, then makes decisions based on expert experience and self-experience. After a game is completed, it reflects on the previous experience to obtain new self-experience. Finally, in the experiment, our method beat the robot under the Very Hard difficulty in TextStarCraft II. We analyze the data of the LLM in the process of the game in detail, verified its effectiveness.
[395] Exploring Effective Strategies for Building a User-Configured GPT for Coding Classroom Dialogues
Luwei Bai, Dongkeun Han, Sara Hennessy
Main category: cs.AI
TL;DR: This paper explores strategies for developing custom GPT models to code classroom dialogue, addressing challenges in educational dialogue analysis through automated coding with limited data.
Details
Motivation: Classroom dialogue analysis is crucial but challenging due to nuanced dialogic functions and labor-intensive manual coding. While LLMs offer automation potential, existing approaches focus on large-scale models or fixed codebooks that aren't applicable for researchers with small datasets or custom coding schemes.Method: The study uses MyGPT, a GPT-4-based custom system configured for dialogue analysis. It evaluates baseline performance with human codebooks and examines how performance varies with different example inputs using controlled variable design and design-based research approach to develop practical strategies for effective configuration with limited data.
Result: Findings suggest that despite some limitations, a custom GPT developed using specific strategies can serve as a useful coding assistant by generating coding suggestions for classroom dialogue analysis.
Conclusion: Custom GPTs configured with appropriate strategies can effectively assist in coding classroom dialogue, providing practical solutions for researchers working with small datasets and customized coding schemes.
Abstract: This study investigated effective strategies for developing a custom GPT to code classroom dialogue. While classroom dialogue is widely recognised as a crucial element of education, its analysis remains challenging due to the need for a nuanced understanding of dialogic functions and the labour-intensive nature of manual transcript coding. Recent advancements in large language models (LLMs) offer promising avenues for automating this process. However, existing studies predominantly focus on training large-scale models or evaluating pre-trained models with fixed codebooks, the outcomes of which are often not applicable, or the methods are not replicable for dialogue researchers working with small datasets or employing customised coding schemes. Using MyGPT - a GPT-4-based customised GPT system configured for dialogue analysis - as a case, this study evaluates its baseline performance in coding classroom dialogue with a human codebook and examines how performance varies with different example inputs under a controlled variable design. Through a design-based research approach, this study explores a set of practical strategies - based upon MyGPT’s unique features - for configuring an effective tool with limited data. The findings suggest that, despite a few limitations, a custom GPT developed using these specific strategies can serve as a useful coding assistant by generating coding suggestions.
[396] Understanding visual attention beehind bee-inspired UAV navigation
Pranav Rajbhandari, Abhi Veda, Matthew Garratt, Mandyam Srinivasan, Sridhar Ravi
Main category: cs.AI
TL;DR: Reinforcement Learning agents trained with optic flow input for UAV navigation show attention patterns similar to honeybees, focusing on flow discontinuities and large magnitudes to avoid obstacles and maintain centered position.
Details
Motivation: Bio-inspired design for autonomous UAV navigation leverages biological systems' efficient flight and obstacle avoidance capabilities despite limited sensory inputs. Honeybees use optic flow for navigation, suggesting this could be an effective approach for UAV control.Method: Train Reinforcement Learning agents to navigate a tunnel with obstacles using only optic flow as sensory input. Analyze attention patterns of trained agents to identify which optic flow regions influence motor decisions.
Result: Agents primarily attend to regions of discontinuity in optic flow and areas with large optic flow magnitude. They navigate by avoiding obstacles that produce large optic flow while maintaining centered position, resembling insect flight behavior. This pattern is consistent across independently trained agents.
Conclusion: The attention patterns suggest a potential strategy for developing simple explicit control laws for physical UAVs, inspired by biological navigation mechanisms using optic flow.
Abstract: Bio-inspired design is often used in autonomous UAV navigation due to the capacity of biological systems for flight and obstacle avoidance despite limited sensory and computational capabilities. In particular, honeybees mainly use the sensory input of optic flow, the apparent motion of objects in their visual field, to navigate cluttered environments. In our work, we train a Reinforcement Learning agent to navigate a tunnel with obstacles using only optic flow as sensory input. We inspect the attention patterns of trained agents to determine the regions of optic flow on which they primarily base their motor decisions. We find that agents trained in this way pay most attention to regions of discontinuity in optic flow, as well as regions with large optic flow magnitude. The trained agents appear to navigate a cluttered tunnel by avoiding the obstacles that produce large optic flow, while maintaining a centered position in their environment, which resembles the behavior seen in flying insects. This pattern persists across independently trained agents, which suggests that this could be a good strategy for developing a simple explicit control law for physical UAVs.
[397] DR-LoRA: Dynamic Rank LoRA for Fine-Tuning Mixture-of-Experts Models
Guanzhi Deng, Bo Li, Ronghao Chen, Xiujin Liu, Zhuo Han, Huacan Wang, Lijie Wen, Linqi Song
Main category: cs.AI
TL;DR: DR-LoRA: Dynamic Rank LoRA framework for fine-tuning pretrained MoE models that adaptively allocates different LoRA ranks to different experts based on their task relevance, improving parameter efficiency.
Details
Motivation: Existing parameter-efficient fine-tuning methods like LoRA use identical ranks for all expert modules in MoE models, ignoring heterogeneous specialization of pretrained experts. This leads to resource mismatch where task-relevant experts are under-provisioned while less relevant ones get redundant parameters.Method: DR-LoRA initializes all expert LoRA modules with small active rank, uses expert saliency score (combining routing frequency and gradient-based rank importance) to identify which experts benefit most from additional capacity, then periodically expands active ranks of task-critical expert LoRA to create heterogeneous rank distribution tailored to target task.
Result: Experiments on three MoE models across six tasks show DR-LoRA consistently outperforms LoRA and other strong baselines, demonstrating task-adaptive heterogeneous rank allocation improves active capacity utilization in MoE fine-tuning.
Conclusion: DR-LoRA provides an effective strategy for parameter-efficient fine-tuning of MoE models by dynamically allocating different LoRA ranks to experts based on their task relevance, addressing resource mismatch in existing approaches.
Abstract: Mixture-of-Experts (MoE) has become a prominent paradigm for scaling Large Language Models (LLMs). Parameter-efficient fine-tuning methods, such as LoRA, are widely adopted to adapt pretrained MoE LLMs to downstream tasks. However, existing approaches typically assign identical LoRA ranks to all expert modules, ignoring the heterogeneous specialization of pretrained experts. This uniform allocation leads to a resource mismatch: task-relevant experts are under-provisioned, while less relevant ones receive redundant parameters. To address this, we propose DR-LoRA, a Dynamic Rank LoRA framework for fine-tuning pretrained MoE models. Specifically, DR-LoRA initializes all expert LoRA modules with a small active rank and uses an expert saliency score, which combines routing frequency and gradient-based rank importance, to identify which experts would benefit most from additional capacity. It then periodically expands the active ranks of the task-critical expert LoRA, progressively constructing a heterogeneous rank distribution tailored to the target task. Experiments on three MoE models across six tasks show that DR-LoRA consistently outperforms LoRA and other strong baselines, demonstrating that task-adaptive heterogeneous rank allocation is an effective strategy to improve active capacity utilization in MoE fine-tuning.
[398] Moonwalk: Inverse-Forward Differentiation
Dmitrii Krylov, Armin Karamzade, Roy Fox
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2402.14212 could not be retrieved from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be fetched due to rate limiting on arXiv API.Method: Unable to determine method as the paper content could not be fetched due to rate limiting on arXiv API.
Result: Unable to determine results as the paper content could not be fetched due to rate limiting on arXiv API.
Conclusion: Unable to determine conclusion as the paper content could not be fetched due to rate limiting on arXiv API.
Abstract: Failed to fetch summary for 2402.14212: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.14212&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[399] GenOM: Ontology Matching with Description Generation and Large Language Model
Yiping Song, Jiaoyan Chen, Renate A. Schmidt
Main category: cs.AI
TL;DR: Paper 2508.10703: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Unknown - abstract not available due to rate limiting errorMethod: Unknown - abstract not available due to rate limiting error
Result: Unknown - abstract not available due to rate limiting error
Conclusion: Unknown - abstract not available due to rate limiting error
Abstract: Failed to fetch summary for 2508.10703: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.10703&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[400] Towards Transparent and Efficient Anomaly Detection in Industrial Processes through ExIFFI
Davide Frizzo, Francesco Borsatti, Alessio Arcudi, Antonio De Moliner, Roberto Oboe, Gian Antonio Susto
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2405.01158: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.01158&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[401] Learn to Relax with Large Language Models: Solving Constraint Optimization Problems via Bidirectional Coevolution
Beidan Liu, Zhengqiu Zhu, Chen Gao, Tianle Pu, Yong Zhao, Wei Qi, Quanjun Yin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2509.12643: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12643&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[402] Efficient Reasoning with Balanced Thinking
Yulin Li, Tengyao Tu, Li Ding, Junjie Wang, Huiling Zhen, Yixin Chen, Yong Li, Zhuotao Tian
Main category: cs.AI
TL;DR: ReBalance is a training-free framework that uses confidence-based steering to balance reasoning in Large Reasoning Models, reducing overthinking and underthinking for more efficient and accurate performance.
Details
Motivation: Large Reasoning Models suffer from overthinking (wasting computational steps on simple problems) and underthinking (failing to explore sufficient reasoning paths), leading to inefficiencies and inaccuracies. Existing methods to fix overthinking often cause underthinking, compromising accuracy.Method: ReBalance uses confidence as a continuous indicator of reasoning dynamics: identifies overthinking through high confidence variance and underthinking via consistent overconfidence. It aggregates hidden states from a small dataset into reasoning mode prototypes, computes a steering vector to guide reasoning trajectories, and uses a dynamic control function to modulate this vector’s strength/direction based on real-time confidence.
Result: Extensive experiments on four models (0.5B to 32B) across nine benchmarks in math reasoning, general QA, and coding tasks show ReBalance effectively reduces output redundancy while improving accuracy.
Conclusion: ReBalance offers a general, training-free, plug-and-play strategy for efficient and robust deployment of Large Reasoning Models by balancing thinking processes through confidence-based steering.
Abstract: Large Reasoning Models (LRMs) have shown remarkable reasoning capabilities, yet they often suffer from overthinking, expending redundant computational steps on simple problems, or underthinking, failing to explore sufficient reasoning paths despite inherent capabilities. These issues lead to inefficiencies and potential inaccuracies, limiting practical deployment in resource-constrained settings. Existing methods to mitigate overthinking, such as suppressing reflective keywords or adjusting reasoning length, may inadvertently induce underthinking, compromising accuracy. Therefore, we propose ReBalance, a training-free framework that achieves efficient reasoning with balanced thinking. ReBalance leverages confidence as a continuous indicator of reasoning dynamics, identifying overthinking through high confidence variance and underthinking via consistent overconfidence. By aggregating hidden states from a small-scale dataset into reasoning mode prototypes, we compute a steering vector to guide LRMs’ reasoning trajectories. A dynamic control function modulates this vector’s strength and direction based on real-time confidence, pruning redundancy during overthinking, and promoting exploration during underthinking. Extensive experiments conducted on four models ranging from 0.5B to 32B, and across nine benchmarks in math reasoning, general question answering, and coding tasks demonstrate that ReBalance effectively reduces output redundancy while improving accuracy, offering a general, training-free, and plug-and-play strategy for efficient and robust LRM deployment. Project page and code are available at https://rebalance-ai.github.io .
[403] A Comprehensive Graph Pooling Benchmark: Effectiveness, Robustness and Generalizability
Pengyun Wang, Junyu Luo, Yanxin Shen, Ming Zhang, Shaoen Qin, Hanwen Xing, Siyu Heng, Xiao Luo
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to draw conclusions due to data fetch failure
Abstract: Failed to fetch summary for 2406.09031: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.09031&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[404] Set Contribution Functions for Quantitative Bipolar Argumentation and their Principles
Filip Naudot, Andreas Brännström, Vicenç Torra, Timotheus Kampik
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2509.14963: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.14963&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[405] Interpretable Classification via a Rule Network with Selective Logical Operators
Bowen Wei, Ziwei Zhu
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2408.11918: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2408.11918&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[406] Align to Misalign: Automatic LLM Jailbreak with Meta-Optimized LLM Judges
Hamin Koo, Minseon Kim, Jaehyung Kim
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to failed paper fetchMethod: Cannot determine method due to failed paper fetch
Result: Cannot determine results due to failed paper fetch
Conclusion: Cannot determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2511.01375: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01375&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[407] Modeling Multi-Objective Tradeoffs with Monotonic Utility Functions
Edward Chen, Natalie Dullerud, Thomas Niedermayr, Elizabeth Kidd, Ransalu Senanayake, Pang Wei Koh, Sanmi Koyejo, Carlos Guestrin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: The user requested analysis of paper 2412.06154, but the arXiv API returned a rate limiting error preventing access to the abstractMethod: Attempted to fetch paper metadata from arXiv API using standard query parameters
Result: HTTP 429 error received, indicating too many requests to the arXiv API within a short timeframe
Conclusion: Cannot analyze the paper without access to its abstract; need to retry later or use alternative methods to obtain the paper information
Abstract: Failed to fetch summary for 2412.06154: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2412.06154&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[408] Fast dynamical similarity analysis
Arman Behrad, Mitchell Ostrow, Mohammad Taha Fakharian, Ila Fiete, Christian Beste, Shervin Safavi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: N/A - Paper content not accessibleMethod: N/A - Paper content not accessible
Result: N/A - Paper content not accessible
Conclusion: N/A - Paper content not accessible
Abstract: Failed to fetch summary for 2511.22828: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.22828&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[409] V-OCBF: Learning Safety Filters from Offline Data via Value-Guided Offline Control Barrier Functions
Mumuksh Tayal, Manan Tayal, Aditya Singh, Shishir Kolathaya, Ravi Prakash
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot determine conclusion as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2512.10822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[410] The Illusion of AI Expertise Under Uncertainty: Navigating Elusive Ground Truth via a Probabilistic Paradigm
Aparna Elangovan, Lei Xu, Mahsa Elyasi, Ismail Akdulum, Mehmet Aksakal, Enes Gurun, Brian Hur, Saab Mansour, Ravid Shwartz Ziv, Karin Verspoor, Dan Roth
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2601.05500: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05500&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[411] M3-BENCH: Process-Aware Evaluation of LLM Agents’ Social Behaviors in Mixed-Motive Games
Sixiong Xie, Zhuofan Shi, Haiyang Shen, Yun Ma, Xiang Jing
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2601.08462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.08462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[412] Predicting LLM Output Length via Entropy-Guided Representations
Huanyi Xie, Yubin Chen, Liangyu Wang, Lijie Hu, Di Wang
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.11812 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2602.11812: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.11812&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[413] Agentified Assessment of Logical Reasoning Agents
Zhiyu Ni, Yifeng Xiao, Zheng Liang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.02788: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02788&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[414] MIRAGE: The Illusion of Visual Understanding
Mohammad Asadi, Jack W. O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Rajabalifardi, Fei-Fei Li, Ehsan Adeli, Euan Ashley
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.21687: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.21687&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[415] Monotone and Conservative Policy Iteration Beyond the Tabular Case
S.R. Eshwar, Gugan Thoppe, Ananyabrata Barua, Aditya Gopalan, Gal Dalal
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to draw conclusions due to API access failure
Abstract: Failed to fetch summary for 2506.07134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.07134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[416] Stabilizing Rubric Integration Training via Decoupled Advantage Normalization
Zelin Tan, Zhouliang Yu, Bohan Lin, Zijie Geng, Hejia Geng, Yudong Zhang, Mulei Zhang, Yang Chen, Shuyue Hu, Zhenfei Yin, Chen Zhang, Lei Bai
Main category: cs.AI
TL;DR: Paper ID 2603.26535 - Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstractMethod: Cannot determine method due to missing abstract
Result: Cannot determine results due to missing abstract
Conclusion: Cannot determine conclusion due to missing abstract
Abstract: Failed to fetch summary for 2603.26535: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26535&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[417] MonitorBench: A Comprehensive Benchmark for Chain-of-Thought Monitorability in Large Language Models
Han Wang, Yifan Sun, Brian Ko, Mann Talati, Jiawen Gong, Zimeng Li, Naicheng Yu, Xucheng Yu, Wei Shen, Vedant Jolly, Huan Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2603.28590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[418] Nomad: Autonomous Exploration and Discovery
Bokang Jia, Samta Kamboj, Satheesh Katipomu, Seung Hun Han, Neha Sengupta, Andrew Jackson
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.29353: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29353&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[419] ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
Christopher Zanoli, Andrea Giovannini, Tengjun Jin, Ana Klimovic, Yotam Perlitz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.29399: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29399&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[420] Meta-Learning at Scale for Large Language Models via Low-Rank Amortized Bayesian Meta-Learning
Liyi Zhang, Jake Snell, Thomas L. Griffiths
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to unavailability of paper contentMethod: Cannot determine method due to unavailability of paper content
Result: Cannot determine results due to unavailability of paper content
Conclusion: Cannot draw conclusions due to unavailability of paper content
Abstract: Failed to fetch summary for 2508.14285: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.14285&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[421] Cognitive Friction: A Decision-Theoretic Framework for Bounded Deliberation in Tool-Using Agents
Davide Di Gioia
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2603.30031: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.30031&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[422] TANDEM: Temporal Attention-guided Neural Differential Equations for Missingness in Time Series Classification
YongKyung Oh, Dong-Young Lim, Sungil Kim, Alex Bui
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2508.17519: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17519&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[423] The Silicon Mirror: Dynamic Behavioral Gating for Anti-Sycophancy in LLM Agents
Harshee Jignesh Shah
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.00478: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00478&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[424] PsychAgent: An Experience-Driven Lifelong Learning Agent for Self-Evolving Psychological Counselor
Yutao Yang, Junsong Li, Qianjun Pan, Jie Zhou, Kai Chen, Qin Chen, Jingyuan Zhao, Ningning Zhou, Xin Li, Liang He
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2604.00931: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00931&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[425] Omni-SimpleMem: Autoresearch-Guided Discovery of Lifelong Multimodal Agent Memory
Jiaqi Liu, Zipeng Ling, Shi Qiu, Yanqing Liu, Siwei Han, Peng Xia, Haoqin Tu, Zeyu Zheng, Cihang Xie, Charles Fleming, Mingyu Ding, Huaxiu Yao
Main category: cs.AI
TL;DR: Unable to analyze paper 2604.01007 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2604.01007: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01007&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[426] Therefore I am. I Think
Esakkivel Esakkiraja, Sai Rajeswar, Denis Akhiyarov, Rajagopal Venkatesaramani
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) when querying arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2604.01202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[427] Unveiling m-Sharpness Through the Structure of Stochastic Gradient Noise
Haocheng Luo, Mehrtash Harandi, Dinh Phung, Trung Le
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2509.18001 suggests it’s from September 2025, but content is unavailable.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2509.18001: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.18001&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[428] Human Misperception of Generative-AI Alignment: A Laboratory Experiment
Kevin He, Ran Shorrer, Mengjia Xia
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2502.14708 appears to be from February 2025, suggesting it’s a recent publication.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot determine conclusion without access to the paper content.
Abstract: Failed to fetch summary for 2502.14708: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.14708&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[429] TRACE: Transparent Web Reliability Assessment with Contextual Explanations
Joydeep Chandra, Aleksandr Algazinov, Satyam Kumar Navneet, Rim El Filali, Matt Laing, Andrew Hanna, Yong Zhang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper retrievalMethod: Unable to determine method due to failed paper retrieval
Result: Unable to determine results due to failed paper retrieval
Conclusion: Unable to analyze paper due to technical retrieval error
Abstract: Failed to fetch summary for 2506.12072: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12072&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[430] DIVER: A Multi-Stage Approach for Reasoning-intensive Information Retrieval
Duolin Sun, Meixiu Long, Dan Yang, Junjie Wang, Yecheng Luo, Yue Shen, Jian Wang, Hualei Zhou, Chunxiao Guo, Peng Wei, Jiahai Wang, Jinjie Gu
Main category: cs.AI
TL;DR: Unable to analyze paper 2508.07995 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2508.07995: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.07995&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[431] PRISM: Robust VLM Alignment with Principled Reasoning for Integrated Safety in Multimodality
Nanxi Li, Zhengyue Zhao, G. Edward Suh, Marco Pavone, Chaowei Xiao
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to API access limitationsMethod: Cannot determine method due to API access limitations
Result: Cannot determine results due to API access limitations
Conclusion: Cannot determine conclusion due to API access limitations
Abstract: Failed to fetch summary for 2508.18649: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.18649&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[432] Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods
Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2510.16609: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16609&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[433] A Pressure-Based Diffusion Model for Influence Maximization on Social Networks
Curt Stutsman, Eliot W. Robson, Abhishek K. Umrawal
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2509.12822: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.12822&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[434] AutoPK: Leveraging LLMs and a Hybrid Similarity Metric for Advanced Retrieval of Pharmacokinetic Data from Complex Tables and Documents
Hossein Sholehrasa, Amirhossein Ghanaatian, Doina Caragea, Lisa A. Tell, Jim E. Riviere, Majid Jaberi-Douraki
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available due to technical issue with arXiv API access
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2510.00039: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00039&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[435] Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools
Ha Min Son, Huan Ren, Xin Liu, Zhe Zhao
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). The paper ID 2510.08640 appears to be an arXiv identifier, but the content could not be retrieved.
Details
Motivation: Cannot determine motivation without access to the paper content.Method: Cannot determine method without access to the paper content.
Result: Cannot determine results without access to the paper content.
Conclusion: Cannot draw conclusions without access to the paper content.
Abstract: Failed to fetch summary for 2510.08640: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.08640&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[436] BIOGEN: Evidence-Grounded Multi-Agent Reasoning Framework for Transcriptomic Interpretation in Antimicrobial Resistance
Elias Hossain, Mehrdad Shoeibi, Ivan Garibay, Niloofar Yousefi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.16082: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16082&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[437] SentinelNet: Safeguarding Multi-Agent Collaboration Through Credit-Based Dynamic Threat Detection
Yang Feng, Xudong Pan
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.16219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.16219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[438] Bridging the Divide: End-to-End Sequence-Graph Learning
Yuen Chen, Yulun Wu, Samuel Sharpe, Igor Melnyk, Nam H. Nguyen, Furong Huang, C. Bayan Bruss, Rizal Fathony
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.25126: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25126&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[439] HAFixAgent: History-Aware Program Repair Agent
Yu Shi, Hao Li, Bram Adams, Ahmed E. Hassan
Main category: cs.AI
TL;DR: Paper 2511.01047 could not be fetched due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2511.01047: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.01047&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[440] A Self-Improving Architecture for Dynamic Safety in Large Language Models
Tyler Slater
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.07645: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.07645&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[441] FlowPath: Learning Data-Driven Manifolds with Invertible Flows for Robust Irregularly-sampled Time Series Classification
YongKyung Oh, Dong-Young Lim, Sungil Kim
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2511.10841: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.10841&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[442] Labels Matter More Than Models: Rethinking the Unsupervised Paradigm in Time Series Anomaly Detection
Zhijie Zhong, Zhiwen Yu, Kaixiang Yang, Yongheng Liu, Jun Jiang, C. L. Philip Chen
Main category: cs.AI
TL;DR: Paper 2511.16145: Could not fetch summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access restrictionsMethod: Unable to determine method due to access restrictions
Result: Unable to determine results due to access restrictions
Conclusion: Unable to determine conclusion due to access restrictions
Abstract: Failed to fetch summary for 2511.16145: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.16145&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[443] MultiGA: Leveraging Multi-Source Seeding in Genetic Algorithms
Isabelle Diana May-Xin Ng, Tharindu Cyril Weerasooriya, Haitao Zhu, Wei Wei
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Unknown - paper content not accessible due to HTTP 429 error from arXiv API
Result: No results available - technical issue with accessing the paper
Conclusion: Cannot analyze paper due to API rate limiting preventing access to content
Abstract: Failed to fetch summary for 2512.04097: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.04097&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[444] UniMark: Artificial Intelligence Generated Content Identification Toolkit
Meilin Li, Ji He, Yi Yu, Jia Xu, Shanzhe Lei, Yan Teng, Yingchun Wang, Xuhong Wang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error fetching paper contentMethod: Unable to determine method due to technical error fetching paper content
Result: Unable to determine results due to technical error fetching paper content
Conclusion: Unable to determine conclusion due to technical error fetching paper content
Abstract: Failed to fetch summary for 2512.12324: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.12324&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[445] WFR-FM: Simulation-Free Dynamic Unbalanced Optimal Transport
Qiangwei Peng, Zihan Wang, Junda Ying, Yuhao Sun, Qing Nie, Lei Zhang, Tiejun Li, Peijie Zhou
Main category: cs.AI
TL;DR: Unable to analyze paper 2601.06810 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstract or contentMethod: Cannot determine method without access to paper abstract or content
Result: Cannot determine results without access to paper abstract or content
Conclusion: Cannot draw conclusions without access to paper abstract or content
Abstract: Failed to fetch summary for 2601.06810: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.06810&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[446] Unified Optimization of Source Weights and Transfer Quantities in Multi-Source Transfer Learning: An Asymptotic Framework
Qingyue Zhang, Chang Chu, Haohao Fu, Tianren Peng, Yanru Wu, Guanbo Huang, Yang Li, Shao-Lun Huang
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.10779: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10779&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[447] Contextual Distributionally Robust Optimization with Causal and Continuous Structure: An Interpretable and Tractable Approach
Fenglin Zhang, Jie Wang
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without paper contentMethod: Cannot determine method without paper content
Result: Cannot determine results without paper content
Conclusion: Cannot determine conclusion without paper content
Abstract: Failed to fetch summary for 2601.11016: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.11016&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[448] Finite-Time Analysis of Gradient Descent for Shallow Transformers
Enes Arda, Semih Cayci, Atilla Eryilmaz
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2601.16514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[449] Learning Contextual Runtime Monitors for Safe AI-Based Autonomy
Alejandro Luque-Cerpa, Mengyuan Wang, Emil Carlsson, Sanjit A. Seshia, Devdatt Dubhashi, Hazem Torfah
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.20666: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.20666&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[450] PBLean: Pseudo-Boolean Proof Certificates for Lean 4
Stefan Szeider
Main category: cs.AI
TL;DR: Unable to analyze paper 2602.08692 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract is unavailableMethod: Cannot determine method as abstract is unavailable
Result: Cannot determine results as abstract is unavailable
Conclusion: Cannot determine conclusion as abstract is unavailable
Abstract: Failed to fetch summary for 2602.08692: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.08692&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[451] TaCarla: A comprehensive benchmarking dataset for end-to-end autonomous driving
Tugrul Gorgulu, Atakan Dag, M. Esat Kalfaoglu, Halil Ibrahim Kuru, Baris Can Cam, Halil Ibrahim Ozturk, Ozsel Kilinc
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2602.23499: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.23499&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[452] What Capable Agents Must Know: Selection Theorems for Robust Decision-Making under Uncertainty
Aran Nayebi
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2603.02491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.02491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[453] NCCL EP: Towards a Unified Expert Parallel Communication API for NCCL
Amos Goldman, Nimrod Boker, Maayan Sheraizin, Nimrod Admoni, Artem Polyakov, Subhadeep Bhattacharya, Fan Yu, Kai Sun, Georgios Theodorakis, Hsin-Chun Yin, Peter-Jan Gootzen, Aamir Shafi, Assaf Ravid, Salvatore Di Girolamo, James Dinan, Xiaofan Li, Manjunath Gorentla Venkata, Gil Bloch
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing paper contentMethod: Cannot determine method due to missing paper content
Result: Cannot determine results due to missing paper content
Conclusion: Cannot determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2603.13606: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.13606&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[454] 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Yeounoh Chung, Rushabh Desai, Jian He, Yu Xiao, Thibaud Hottelier, Yves-Laurent Kom Samo, Pushkar Khadilkar, Xianshun Chen, Sam Idicula, Fatma Özcan, Alon Halevy, Yannis Papakonstantinou
Main category: cs.AI
TL;DR: Paper ID 2603.15970 - Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation due to missing abstract contentMethod: Cannot determine method due to missing abstract content
Result: Cannot determine results due to missing abstract content
Conclusion: Cannot determine conclusion due to missing abstract content
Abstract: Failed to fetch summary for 2603.15970: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15970&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[455] When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making
Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang
Main category: cs.AI
TL;DR: Unable to analyze paper 2603.16673 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation due to inability to access paper contentMethod: Cannot determine method due to inability to access paper content
Result: Cannot determine results due to inability to access paper content
Conclusion: Cannot draw conclusions due to inability to access paper content
Abstract: Failed to fetch summary for 2603.16673: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.16673&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[456] Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data Generation
Iakovos-Christos Zarkadis, Christos Douligeris
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable
Result: No results available due to technical limitations in accessing the paper
Conclusion: Cannot provide analysis due to HTTP 429 error when attempting to fetch paper information
Abstract: Failed to fetch summary for 2603.17717: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17717&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[457] Adaptive Regime-Aware Stock Price Prediction Using Autoencoder-Gated Dual Node Transformers with Reinforcement Learning Control
Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2603.19136: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.19136&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[458] Democratizing AI: A Comparative Study in Deep Learning Efficiency and Future Trends in Computational Processing
Lisan Al Amin, Md Ismail Hossain, Rupak Kumar Das, Mahbubul Islam, Abdulaziz Tabbakh
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.20920: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20920&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[459] DVM: A Bytecode Virtual Machine Approach for Dynamic Tensor Computation
Jingzhi Fang, Xiong Gao, Renwei Zhang, Zichun Ye, Lei Chen, Jie Zhao, Chengnuo Huang, Hui Xu, Xuefeng Jin
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.24239: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.24239&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[460] Unsupervised Behavioral Compression: Learning Low-Dimensional Policy Manifolds through State-Occupancy Matching
Andrea Fraschini, Davide Tenedini, Riccardo Zamboni, Mirco Mutti, Marcello Restelli
Main category: cs.AI
TL;DR: Failed to fetch summary for paper 2603.27044 due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to analyze motivation due to technical error in fetching paper informationMethod: Unable to analyze method due to technical error in fetching paper information
Result: Unable to analyze results due to technical error in fetching paper information
Conclusion: Unable to draw conclusions due to technical error in fetching paper information
Abstract: Failed to fetch summary for 2603.27044: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27044&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[461] An Object Web Seminar: A Retrospective on a Technical Dialogue Still Reverberating
James J. Cusick
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.26203: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.26203&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[462] Cross-attentive Cohesive Subgraph Embedding to Mitigate Oversquashing in GNNs
Tanvir Hossain, Muhammad Ifte Khairul Islam, Lilia Chebbah, Charles Fanning, Esra Akbas
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to data fetch failureMethod: Unable to determine method due to data fetch failure
Result: Unable to determine results due to data fetch failure
Conclusion: Unable to determine conclusion due to data fetch failure
Abstract: Failed to fetch summary for 2603.27529: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27529&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[463] D-SPEAR: Dual-Stream Prioritized Experience Adaptive Replay for Stable Reinforcement Learning in Robotic Manipulation
Yu Zhang, Karl Mason
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2603.27346: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27346&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[464] Information-Theoretic Limits of Safety Verification for Self-Improving Systems
Arsenios Scrivens
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Unable to determine method due to API rate limiting preventing access to paper content
Result: Unable to determine results due to API rate limiting preventing access to paper content
Conclusion: Unable to determine conclusion due to API rate limiting preventing access to paper content
Abstract: Failed to fetch summary for 2603.28650: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.28650&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[465] Improving Ensemble Forecasts of Abnormally Deflecting Tropical Cyclones with Fused Atmosphere-Ocean-Terrain Data
Qixiang Li, Yuan Zhou, Shuwei Huo, Chong Wang, Xiaofeng Li
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2603.29200: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29200&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[466] Learning to Play Blackjack: A Curriculum Learning Perspective
Amirreza Alasti, Efe Erdal, Yücel Celik, Theresa Eimer
Main category: cs.AI
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2604.00076: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00076&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[467] Four Generations of Quantum Biomedical Sensors
Xin Jin, Priyam Srivastava, Ronghe Wang, Yuqing Li, Jonathan Beaumariage, Tom Purdy, M. V. Gurudev Dutt, Kang Kim, Kaushik Seshadreesan, Junyu Liu
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.29944: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.29944&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[468] Deep Networks Favor Simple Data
Weyl Lu, Chenjie Hao, Yubei Chen
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2604.00394: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00394&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[469] Hybrid Energy-Based Models for Physical AI: Provably Stable Identification of Port-Hamiltonian Dynamics
Simone Betteti, Luca Laurenti
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2604.00277: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00277&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[470] UniMixer: A Unified Architecture for Scaling Laws in Recommendation Systems
Mingming Ha, Guanchen Wang, Linxun Chen, Xuan Rao, Yuexin Shi, Tianbao Ma, Zhaojie Liu, Yunqian Fan, Zilong Lu, Yanan Niu, Han Li, Kun Gai
Main category: cs.AI
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2604.00590: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00590&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[471] Learning to Learn-at-Test-Time: Language Agents with Learnable Adaptation Policies
Zhanzhi Lou, Hui Chen, Yibo Li, Qian Wang, Bryan Hooi
Main category: cs.AI
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting). Paper ID 2604.00830 appears to be a recent arXiv submission.
Details
Motivation: Cannot determine motivation without access to paper content.Method: Cannot determine method without access to paper content.
Result: Cannot determine results without access to paper content.
Conclusion: Cannot draw conclusions without access to paper content.
Abstract: Failed to fetch summary for 2604.00830: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00830&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.SD
[472] Combining Masked Language Modeling and Cross-Modal Contrastive Learning for Prosody-Aware TTS
Kirill Borodin, Vasiliy Kudryavtsev, Maxim Maslov, Nikita Vasiliev, Mikhail Gorodnichev, Grach Mkrtchian
Main category: cs.SD
TL;DR: Multi-stage pretraining for prosody modeling in diffusion-based TTS using speaker-conditioned dual-stream encoder with MLM + contrastive learning achieves best synthesis quality, while same-phoneme refinement improves prosodic retrieval but degrades synthesis performance.
Details
Motivation: The paper aims to improve prosody modeling in text-to-speech systems by investigating multi-stage pretraining approaches for diffusion-based TTS, recognizing that better embedding-space metrics don't always translate to improved generative performance.Method: Uses speaker-conditioned dual-stream encoder trained with: 1) masked language modeling (MLM), 2) SigLIP-style cross-modal contrastive learning with mixed-phoneme batches, and 3) optional same-phoneme refinement stage. Evaluated on text-audio retrieval and downstream synthesis in Grad-TTS and latent diffusion TTS systems.
Result: Two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves best overall synthesis quality in intelligibility, speaker similarity, and perceptual measures. Same-phoneme refinement improves prosodic retrieval but reduces phoneme discrimination and degrades synthesis performance.
Conclusion: Improvements in embedding-space metrics don’t guarantee better generative performance; there’s a need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining. The two-stage approach provides optimal balance for diffusion-based TTS.
Abstract: We investigate multi-stage pretraining for prosody modeling in diffusion-based TTS. A speaker-conditioned dual-stream encoder is trained with masked language modeling followed by SigLIP-style cross-modal contrastive learning using mixed-phoneme batches, with an additional same-phoneme refinement stage studied separately. We evaluate intrinsic text-audio retrieval and downstream synthesis in Grad-TTS and a latent diffusion TTS system. The two-stage curriculum (MLM + mixed-phoneme contrastive learning) achieves the best overall synthesis quality in terms of intelligibility, speaker similarity, and perceptual measures. Although same-phoneme refinement improves prosodic retrieval, it reduces phoneme discrimination and degrades synthesis. These findings indicate that improvements in embedding-space metrics do not necessarily translate to better generative performance and highlight the need to balance phoneme discrimination and prosodic sensitivity in TTS pretraining.
[473] Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors
Vojtěch Staněk, Martin Perešíni, Lukáš Sekanina, Anton Firc, Kamil Malinka
Main category: cs.SD
TL;DR: Evolutionary multi-objective score fusion framework for deepfake speech detection that optimizes both accuracy and system complexity, outperforming standard ensemble methods while reducing parameters by half.
Details
Motivation: Standard ensemble fusion for deepfake speech detectors using large SSL models creates oversized systems with diminishing returns. There's a need to balance detection accuracy with computational efficiency for practical deployment.Method: Proposes an evolutionary multi-objective score fusion framework using NSGA-II with two encodings: binary-coded detector selection for score averaging, and real-valued scheme optimizing detector weights for weighted sum. Evaluated on ASVspoof 5 dataset with 36 SSL-based detectors.
Result: Achieves 2.37% EER (0.0684 minDCF), matching state-of-the-art performance while reducing system complexity by half (only half the parameters). Pareto fronts outperform simple averaging and logistic regression baselines.
Conclusion: The framework provides diverse trade-off solutions balancing accuracy and computational cost, enabling practical deployment choices for deepfake speech detection systems.
Abstract: While deepfake speech detectors built on large self-supervised learning (SSL) models achieve high accuracy, employing standard ensemble fusion to further enhance robustness often results in oversized systems with diminishing returns. To address this, we propose an evolutionary multi-objective score fusion framework that jointly minimizes detection error and system complexity. We explore two encodings optimized by NSGA-II: binary-coded detector selection for score averaging and a real-valued scheme that optimizes detector weights for a weighted sum. Experiments on the ASVspoof 5 dataset with 36 SSL-based detectors show that the obtained Pareto fronts outperform simple averaging and logistic regression baselines. The real-valued variant achieves 2.37% EER (0.0684 minDCF) and identifies configurations that match state-of-the-art performance while significantly reducing system complexity, requiring only half the parameters. Our method also provides a diverse set of trade-off solutions, enabling deployment choices that balance accuracy and computational cost.
[474] Acoustic and perceptual differences between standard and accented Chinese speech and their voice clones
Tianle Yang, Chengzhe Sun, Phil Rose, Siwei Lyu
Main category: cs.SD
TL;DR: Voice cloning systems show different perceptual effects for accented vs standard Mandarin speech, with clones rated more similar to originals for standard speakers and intelligibility improving more for accented speech clones.
Details
Motivation: While voice cloning is typically evaluated for overall quality, little is known about how accent preservation affects perceptual outcomes and speaker identity preservation in cloned speech.Method: Combined computational and perceptual design comparing standard and heavily accented Mandarin speech and their voice clones, using embedding-based analyses and perception studies with similarity and intelligibility ratings.
Result: Embedding analyses show no reliable accent differences in original-clone distances, but perceptual studies reveal clones are rated more similar to originals for standard speakers, and intelligibility increases more for accented speech clones.
Conclusion: Accent variation shapes perceived identity match and intelligibility in voice cloning independently of speaker-embedding distances, suggesting speaker identity and accent preservation should be evaluated as separate dimensions.
Abstract: Voice cloning is often evaluated in terms of overall quality, but less is known about accent preservation and its perceptual consequences. We compare standard and heavily accented Mandarin speech and their voice clones using a combined computational and perceptual design. Embedding-based analyses show no reliable accented-standard difference in original-clone distances across systems. In the perception study, clones are rated as more similar to their originals for standard than for accented speakers, and intelligibility increases from original to clone, with a larger gain for accented speech. These results show that accent variation can shape perceived identity match and intelligibility in voice cloning even when it is not reflected in an off-the-shelf speaker-embedding distance, and they motivate evaluating speaker identity preservation and accent preservation as separable dimensions.
[475] FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection
Chengyou Wang, Hongfei Xue, Chunjiang He, Jingbin Hu, Shuiyuan Wang, Bo Wu, Yuyu Ji, Jimeng Zheng, Ruofei Chen, Zhou Zhu, Lei Xie
Main category: cs.SD
TL;DR: FastTurn: A unified framework for low-latency turn detection in full-duplex spoken dialogue systems using streaming CTC decoding combined with acoustic features.
Details
Motivation: Existing full-duplex approaches for spoken dialogue systems either rely on voice activity detection (lacking semantic understanding) or ASR-based modules (introducing latency and degrading under overlapping speech/noise). Available datasets also rarely capture realistic interaction dynamics.Method: FastTurn combines streaming CTC decoding with acoustic features to enable early decisions from partial observations while preserving semantic cues. This allows for low-latency turn detection in real-time full-duplex communication.
Result: FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions including overlapping speech, backchannels, pauses, pitch variation, and environmental noise.
Conclusion: FastTurn demonstrates effectiveness for practical full-duplex dialogue systems by providing a unified framework for low-latency and robust turn detection, with a released test set based on real human dialogue capturing authentic interaction dynamics.
Abstract: Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.
[476] Woosh: A Sound Effects Foundation Model
Gaëtan Hadjeres, Marc Ferras, Khaled Koutini, Benno Weck, Alexandre Bittar, Thomas Hummel, Zineb Lahrici, Hakim Missoum, Joan Serrà, Yuki Mitsufuji
Main category: cs.SD
TL;DR: Woosh is Sony AI’s open-source sound effect foundation model with audio encoder/decoder, text-audio alignment, text-to-audio, and video-to-audio generation capabilities, offering competitive performance against existing open models.
Details
Motivation: The audio research community needs open generative models as foundational tools for building novel approaches and establishing baselines, particularly for sound effects.Method: Developed a comprehensive sound effect foundation model with four key components: (1) high-quality audio encoder/decoder model, (2) text-audio alignment model for conditioning, (3) text-to-audio generative model, and (4) video-to-audio generative model, plus distilled versions for low-resource operation.
Result: Evaluation on both public and private data shows competitive or better performance compared to existing open alternatives like StableAudio-Open and TangoFlux across all modules.
Conclusion: Woosh provides a comprehensive open-source foundation model for sound effects that advances the field by offering multiple generative capabilities with competitive performance and practical deployment options.
Abstract: The audio research community depends on open generative models as foundational tools for building novel approaches and establishing baselines. In this report, we present Woosh, Sony AI’s publicly released sound effect foundation model, detailing its architecture, training process, and an evaluation against other popular open models. Being optimized for sound effects, we provide (1) a high-quality audio encoder/decoder model and (2) a text-audio alignment model for conditioning, together with (3) text-to-audio and (4) video-to-audio generative models. Distilled text-to-audio and video-to-audio models are also included in the release, allowing for low-resource operation and fast inference. Our evaluation on both public and private data shows competitive or better performance for each module when compared to existing open alternatives like StableAudio-Open and TangoFlux. Inference code and model weights are available at https://github.com/SonyResearch/Woosh. Demo samples can be found at https://sonyresearch.github.io/Woosh/.
[477] PFluxTTS: Hybrid Flow-Matching TTS with Robust Cross-Lingual Voice Cloning and Inference-Time Model Fusion
Vikentii Pankov, Artem Gribul, Oktai Tatanov, Vladislav Proskurov, Yuliya Korotkova, Darima Mylzenova, Dmitrii Vypirailenko
Main category: cs.SD
TL;DR: PFluxTTS is a hybrid text-to-speech system that improves flow-matching TTS with better stability-naturalness trade-off, cross-lingual voice cloning, and audio quality through dual-decoder design, FLUX-based speaker embedding, and enhanced vocoder.
Details
Motivation: Address three key limitations in flow-matching TTS: 1) stability-naturalness trade-off, 2) weak cross-lingual voice cloning, and 3) limited audio quality from low-rate mel features.Method: 1) Dual-decoder design combining duration-guided and alignment-free models via inference-time vector-field fusion; 2) Robust cloning using speech-prompt embeddings in FLUX-based decoder; 3) Modified PeriodWave vocoder with super-resolution to 48 kHz.
Result: Outperforms F5-TTS, FishSpeech, and SparkTTS on cross-lingual in-the-wild data; matches ChatterBox in naturalness (MOS 4.11) with 23% lower WER (6.9% vs 9.0%); surpasses ElevenLabs in speaker similarity (+0.32 SMOS); robust in challenging scenarios.
Conclusion: PFluxTTS effectively addresses key limitations in flow-matching TTS, achieving state-of-the-art performance in cross-lingual voice cloning with high naturalness, speaker similarity, and robustness using only short reference audio and no extra training.
Abstract: We present PFluxTTS, a hybrid text-to-speech system addressing three gaps in flow-matching TTS: the stability-naturalness trade-off, weak cross-lingual voice cloning, and limited audio quality from low-rate mel features. Our contributions are: (1) a dual-decoder design combining duration-guided and alignment-free models through inference-time vector-field fusion; (2) robust cloning using a sequence of speech-prompt embeddings in a FLUX-based decoder, preserving speaker traits across languages without prompt transcripts; and (3) a modified PeriodWave vocoder with super-resolution to 48 kHz. On cross-lingual in-the-wild data, PFluxTTS clearly outperforms F5-TTS, FishSpeech, and SparkTTS, matches ChatterBox in naturalness (MOS 4.11) while achieving 23% lower WER (6.9% vs. 9.0%), and surpasses ElevenLabs in speaker similarity (+0.32 SMOS). The system remains robust in challenging scenarios where most open-source models fail, while requiring only short reference audio and no extra training. Audio demos are available at https://braskai.github.io/pfluxtts/
cs.LG
[478] DySCo: Dynamic Semantic Compression for Effective Long-term Time Series Forecasting
Xiang Ao, Yinyu Tan, Mengru Chen
Main category: cs.LG
TL;DR: DySCo is a dynamic semantic compression framework for time series forecasting that uses entropy-guided sampling and frequency decomposition to identify and retain important historical segments while compressing redundant trends.
Details
Motivation: Longer lookback windows in time series forecasting often introduce irrelevant noise and computational redundancy rather than providing useful historical context, preventing models from effectively capturing complex long-term dependencies.Method: Proposes DySCo framework with three components: 1) Entropy-Guided Dynamic Sampling (EGDS) to autonomously identify and retain high-entropy segments while compressing redundant trends, 2) Hierarchical Frequency-Enhanced Decomposition (HFED) to separate high-frequency anomalies from low-frequency patterns, and 3) Cross-Scale Interaction Mixer (CSIM) to dynamically fuse global contexts with local representations.
Result: DySCo serves as a universal plug-and-play module that significantly enhances mainstream models’ ability to capture long-term correlations while reducing computational cost.
Conclusion: The DySCo framework effectively addresses the challenges of long lookback windows in time series forecasting by dynamically compressing redundant information while preserving critical details, improving both performance and efficiency.
Abstract: Time series forecasting (TSF) is critical across domains such as finance, meteorology, and energy. While extending the lookback window theoretically provides richer historical context, in practice, it often introduces irrelevant noise and computational redundancy, preventing models from effectively capturing complex long-term dependencies. To address these challenges, we propose a Dynamic Semantic Compression (DySCo) framework. Unlike traditional methods that rely on fixed heuristics, DySCo introduces an Entropy-Guided Dynamic Sampling (EGDS) mechanism to autonomously identify and retain high-entropy segments while compressing redundant trends. Furthermore, we incorporate a Hierarchical Frequency-Enhanced Decomposition (HFED) strategy to separate high-frequency anomalies from low-frequency patterns, ensuring that critical details are preserved during sparse sampling. Finally, a Cross-Scale Interaction Mixer(CSIM) is designed to dynamically fuse global contexts with local representations, replacing simple linear aggregation. Experimental results demonstrate that DySCo serves as a universal plug-and-play module, significantly enhancing the ability of mainstream models to capture long-term correlations with reduced computational cost.
[479] Sven: Singular Value Descent as a Computationally Efficient Natural Gradient Method
Samuel Bright-Thonney, Thomas R. Harvey, Andre Lukas, Jesse Thaler
Main category: cs.LG
TL;DR: Sven is a new optimization algorithm that treats each data point’s loss as separate conditions, using pseudoinverse of loss Jacobian for parameter updates, achieving faster convergence than Adam while being more efficient than traditional natural gradient methods.
Details
Motivation: Current optimization methods like SGD and Adam reduce the full loss to a single scalar before computing parameter updates, which may not fully exploit the structure of loss functions that naturally decompose into sums over individual data points. Traditional natural gradient methods are computationally expensive, scaling quadratically with parameter count.Method: Sven treats each data point’s residual as separate conditions to be satisfied simultaneously. It uses the Moore-Penrose pseudoinverse of the loss Jacobian to find minimum-norm parameter updates that best satisfy all conditions. In practice, this pseudoinverse is approximated via truncated singular value decomposition, retaining only k most significant directions, keeping computational overhead to only factor k relative to SGD.
Result: On regression tasks, Sven significantly outperforms standard first-order methods including Adam, converging faster and to lower final loss. It remains competitive with LBFGS at a fraction of the wall-time cost. The method recovers natural gradient descent in the under-parametrized limit.
Conclusion: Sven provides an efficient optimization approach that generalizes natural gradient methods to over-parametrized regimes. While memory overhead presents a scaling challenge, the algorithm shows promise for scientific computing settings where custom loss functions decompose into several conditions.
Abstract: We introduce Sven (Singular Value dEsceNt), a new optimization algorithm for neural networks that exploits the natural decomposition of loss functions into a sum over individual data points, rather than reducing the full loss to a single scalar before computing a parameter update. Sven treats each data point’s residual as a separate condition to be satisfied simultaneously, using the Moore-Penrose pseudoinverse of the loss Jacobian to find the minimum-norm parameter update that best satisfies all conditions at once. In practice, this pseudoinverse is approximated via a truncated singular value decomposition, retaining only the $k$ most significant directions and incurring a computational overhead of only a factor of $k$ relative to stochastic gradient descent. This is in comparison to traditional natural gradient methods, which scale as the square of the number of parameters. We show that Sven can be understood as a natural gradient method generalized to the over-parametrized regime, recovering natural gradient descent in the under-parametrized limit. On regression tasks, Sven significantly outperforms standard first-order methods including Adam, converging faster and to a lower final loss, while remaining competitive with LBFGS at a fraction of the wall-time cost. We discuss the primary challenge to scaling, namely memory overhead, and propose mitigation strategies. Beyond standard machine learning benchmarks, we anticipate that Sven will find natural application in scientific computing settings where custom loss functions decompose into several conditions.
[480] Forecasting Supply Chain Disruptions with Foresight Learning
Benjamin Turtel, Paul Wilczewski, Kris Skotheim
Main category: cs.LG
TL;DR: LLM-based framework for probabilistic forecasting of supply chain disruptions using realized outcomes as supervision, outperforming baselines including GPT-5 on accuracy and calibration.
Details
Motivation: Anticipating supply chain disruptions is challenging due to infrequent, high-impact events and noisy unstructured inputs, where general-purpose models struggle without task-specific adaptation.Method: End-to-end framework that trains LLMs to produce calibrated probabilistic forecasts using realized disruption outcomes as supervision, inducing structured probabilistic reasoning without explicit prompting.
Result: The model substantially outperforms strong baselines including GPT-5 on accuracy, calibration, and precision, and training induces more structured and reliable probabilistic reasoning.
Conclusion: Provides a general pathway for training domain-specific forecasting models that produce decision-ready signals, with open-sourced evaluation dataset for transparency.
Abstract: Anticipating supply chain disruptions before they materialize is a core challenge for firms and policymakers alike. A key difficulty is learning to reason reliably about infrequent, high-impact events from noisy and unstructured inputs - a setting where general-purpose models struggle without task-specific adaptation. We introduce an end-to-end framework that trains LLMs to produce calibrated probabilistic forecasts using realized disruption outcomes as supervision. The resulting model substantially outperforms strong baselines - including GPT-5 - on accuracy, calibration, and precision. We also show that training induces more structured and reliable probabilistic reasoning without explicit prompting. These results suggest a general pathway for training domain-specific forecasting models that produce decision-ready signals. To support transparency we open-source the evaluation dataset used in this study. Dataset: https://huggingface.co/datasets/LightningRodLabs/supply-chain-predictions
[481] UQ-SHRED: uncertainty quantification of shallow recurrent decoder networks for sparse sensing via engression
Mars Liyao Gao, Yuxuan Bao, Amy S. Rude, Xinwei Shen, J. Nathan Kutz
Main category: cs.LG
TL;DR: UQ-SHRED extends SHRED with uncertainty quantification for sparse sensor reconstruction using distributional regression and noise injection.
Details
Motivation: Existing SHRED architecture lacks uncertainty quantification, which is crucial for complex, data-scarce, high-frequency, or stochastic systems where portions of spatiotemporal fields need to be modeled with valid uncertainty estimation.Method: UQ-SHRED uses a distributional learning framework with neural network-based distributional regression (engression). It injects stochastic noise into sensor inputs and trains with an energy score loss to learn predictive distributions of spatial states conditioned on sensor history, without requiring retraining or additional network structures.
Result: UQ-SHRED produces well-calibrated confidence intervals and distributional approximations on complex synthetic and real datasets including turbulent flow, atmospheric dynamics, neuroscience, and astrophysics, with minimal computational overhead.
Conclusion: UQ-SHRED provides effective uncertainty quantification for sparse sensing problems through a simple yet powerful distributional learning approach that extends the capabilities of existing state-of-the-art architectures.
Abstract: Reconstructing high-dimensional spatiotemporal fields from sparse sensor measurements is critical in a wide range of scientific applications. The SHallow REcurrent Decoder (SHRED) architecture is a recent state-of-the-art architecture that reconstructs high-quality spatial domain from hyper-sparse sensor measurement streams. An important limitation of SHRED is that in complex, data-scarce, high-frequency, or stochastic systems, portions of the spatiotemporal field must be modeled with valid uncertainty estimation. We introduce UQ-SHRED, a distributional learning framework for sparse sensing problems that provides uncertainty quantification through a neural network-based distributional regression called engression. UQ-SHRED models the uncertainty by learning the predictive distribution of the spatial state conditioned on the sensor history. By injecting stochastic noise into sensor inputs and training with an energy score loss, UQ-SHRED produces predictive distributions with minimal computational overhead, requiring only noise injection at the input and resampling through a single architecture without retraining or additional network structures. On complicated synthetic and real-life datasets including turbulent flow, atmospheric dynamics, neuroscience and astrophysics, UQ-SHRED provides a distributional approximation with well-calibrated confidence intervals. We further conduct ablation studies to understand how each model setting affects the quality of the UQ-SHRED performance, and its validity on uncertainty quantification over a set of different experimental setups.
[482] An Online Machine Learning Multi-resolution Optimization Framework for Energy System Design Limit of Performance Analysis
Oluwamayowa O. Amusat, Luka Grbcic, Remi Patureau, M. Jibran S. Zuberi, Dan Gunter, Michael Wetter
Main category: cs.LG
TL;DR: ML-accelerated multi-resolution optimization framework for integrated energy systems that estimates architecture-specific performance bounds while minimizing expensive high-fidelity evaluations
Details
Motivation: Model mismatch across different fidelity levels in energy system design obscures performance loss sources and complicates quantification of architecture-to-operation performance gaps, requiring better optimization approachesMethod: Online ML-accelerated multi-resolution optimization framework with ML-guided controller that adaptively schedules optimization resolution based on predictive uncertainty and warm-starts high-fidelity solves using elite low-fidelity solutions
Result: Reduces architecture-to-operation performance gap by up to 42% relative to rule-based controller while reducing required high-fidelity model evaluations by 34% relative to same approach without ML guidance
Conclusion: The framework enables tractable high-fidelity verification and provides practical upper bounds on achievable operational performance for integrated energy system design
Abstract: Designing reliable integrated energy systems for industrial processes requires optimization and verification models across multiple fidelities, from architecture-level sizing to high-fidelity dynamic operation. However, model mismatch across fidelities obscures the sources of performance loss and complicates the quantification of architecture-to-operation performance gaps. We propose an online, machine-learning-accelerated multi-resolution optimization framework that estimates an architecture-specific upper bound on achievable performance while minimizing expensive high-fidelity model evaluations. We demonstrate the approach on a pilot energy system supplying a 1 MW industrial heat load. First, we solve a multi-objective architecture optimization to select the system configuration and component capacities. We then develop an machine learning (ML)-accelerated multi-resolution, receding-horizon optimal control strategy that approaches the achievable-performance bound for the specified architecture, given the additional controls and dynamics not captured by the architectural optimization model. The ML-guided controller adaptively schedules the optimization resolution based on predictive uncertainty and warm-starts high-fidelity solves using elite low-fidelity solutions. Our results on the pilot case study show that the proposed multi-resolution strategy reduces the architecture-to-operation performance gap by up to 42% relative to a rule-based controller, while reducing required high-fidelity model evaluations by 34% relative to the same multi-fidelity approach without ML guidance, enabling faster and more reliable design verification. Together, these gains make high-fidelity verification tractable, providing a practical upper bound on achievable operational performance.
[483] JetPrism: diagnosing convergence for generative simulation and inverse problems in nuclear physics
Zeyu Xia, Tyler Kim, Trevor Reed, Judy Fox, Geoffrey Fox, Adam Szczepaniak
Main category: cs.LG
TL;DR: JetPrism framework reveals CFM training loss is misleading for physics applications, proposes multi-metric evaluation for generative surrogates in nuclear physics and beyond.
Details
Motivation: Standard Conditional Flow Matching (CFM) training loss plateaus prematurely in physics applications, serving as unreliable indicator of true convergence and physical fidelity, despite CFM's potential for accelerating computationally intensive simulations and inverse problems.Method: Developed JetPrism, a configurable CFM framework for evaluating unconditional generation and conditional detector unfolding. Used synthetic stress tests and Jefferson Lab kinematic dataset (γp → ρ⁰p → π⁺π⁻p) relevant to Electron-Ion Collider. Proposed multi-metric evaluation protocol with marginal/pairwise χ² statistics, W₁ distances, correlation matrix distances (D_corr), and nearest-neighbor distance ratios (R_NN).
Result: Physics-informed metrics continue to improve significantly long after standard CFM loss converges. JetPrism establishes dependable generative surrogate ensuring precise statistical agreement with ground-truth data without memorizing training set. Domain-specific evaluations must supersede generic loss metrics.
Conclusion: While demonstrated in nuclear physics, diagnostic framework is extensible to parameter generation and complex inverse problems across domains including medical imaging, astrophysics, semiconductor discovery, and quantitative finance where high-fidelity simulation and generative reliability are critical.
Abstract: High-fidelity Monte Carlo simulations and complex inverse problems, such as mapping smeared experimental observations to ground-truth states, are computationally intensive yet essential for robust data analysis. Conditional Flow Matching (CFM) offers a mathematically robust approach to accelerating these tasks, but we demonstrate its standard training loss is fundamentally misleading. In rigorous physics applications, CFM loss plateaus prematurely, serving as an unreliable indicator of true convergence and physical fidelity. To investigate this disconnect, we designed JetPrism, a configurable CFM framework acting as an efficient generative surrogate for evaluating unconditional generation and conditional detector unfolding. Using synthetic stress tests and a Jefferson Lab kinematic dataset ($γp \to ρ^0 p \to π^+π^- p$) relevant to the forthcoming Electron-Ion Collider (EIC), we establish that physics-informed metrics continue to improve significantly long after the standard loss converges. Consequently, we propose a multi-metric evaluation protocol incorporating marginal and pairwise $χ^2$ statistics, $W_1$ distances, correlation matrix distances ($D_{\mathrm{corr}}$), and nearest-neighbor distance ratios ($R_{\mathrm{NN}}$). By demonstrating that domain-specific evaluations must supersede generic loss metrics, this work establishes JetPrism as a dependable generative surrogate that ensures precise statistical agreement with ground-truth data without memorizing the training set. While demonstrated in nuclear physics, this diagnostic framework is readily extensible to parameter generation and complex inverse problems across broad domains. Potential applications span medical imaging, astrophysics, semiconductor discovery, and quantitative finance, where high-fidelity simulation, rigorous inversion, and generative reliability are critical.
[484] Detecting Complex Money Laundering Patterns with Incremental and Distributed Graph Modeling
Haseeb Tariq, Alen Kaja, Marwan Hassani
Main category: cs.LG
TL;DR: ReDiRect framework for detecting money laundering patterns in financial transaction graphs using unsupervised fuzzy partitioning and distributed processing.
Details
Motivation: Existing money laundering detection systems struggle with scale, complexity, and high false positive rates from rigid rule-based approaches, allowing criminals to hide financial footprints by mimicking legitimate transaction patterns.Method: Proposes ReDiRect framework with unsupervised approach: fuzzy partitioning of large transaction graphs into manageable components for distributed processing, plus refined evaluation metrics for better pattern effectiveness assessment.
Result: Demonstrates superior performance compared to existing techniques in efficiency and real-world applicability, validated on real Libra dataset and IBM Watson synthetic datasets.
Conclusion: ReDiRect effectively addresses scalability and false positive challenges in money laundering detection through novel unsupervised graph partitioning and distributed processing approach.
Abstract: Money launderers take advantage of limitations in existing detection approaches by hiding their financial footprints in a deceitful manner. They manage this by replicating transaction patterns that the monitoring systems cannot easily distinguish. As a result, criminally gained assets are pushed into legitimate financial channels without drawing attention. Algorithms developed to monitor money flows often struggle with scale and complexity. The difficulty of identifying such activities is further intensified by the (persistent) inability of current solutions to control the excessive number of false positive signals produced by rigid, risk-based rules systems. We propose a framework called ReDiRect (REduce, DIstribute, and RECTify), specifically designed to overcome these challenges. The primary contribution of our work is a novel framing of this problem in an unsupervised setting; where a large transaction graph is fuzzily partitioned into smaller, manageable components to enable fast processing in a distributed manner. In addition, we define a refined evaluation metric that better captures the effectiveness of exposed money laundering patterns. Through comprehensive experimentation, we demonstrate that our framework achieves superior performance compared to existing and state-of-the-art techniques, particularly in terms of efficiency and real-world applicability. For validation, we used the real (open source) Libra dataset and the recently released synthetic datasets by IBM Watson. Our code and datasets are available at https://github.com/mhaseebtariq/redirect.
[485] Efficient and Principled Scientific Discovery through Bayesian Optimization: A Tutorial
Zhongwei Yu, Rasul Tutunov, Alexandre Max Maraval, Zikai Xie, Zhenzhi Tan, Jiankang Wang, Zijing Li, Liangliang Xu, Qi Yang, Jun Jiang, Sanzhong Luo, Zhenxiao Guo, Haitham Bou-Ammar, Jun Wang
Main category: cs.LG
TL;DR: A tutorial on Bayesian Optimization (BO) as a principled framework for automating scientific discovery by formalizing the hypothesize-experiment-refine cycle using surrogate models and acquisition functions to guide efficient experimentation.
Details
Motivation: Traditional scientific discovery relies on intuitive, ad-hoc iterative cycles that waste resources, yield inefficient designs, and miss critical insights. There's a need for a principled, automated framework to eliminate guesswork and manual trial-and-error in experimentation.Method: Bayesian Optimization (BO) uses surrogate models (e.g., Gaussian processes) to model empirical observations as evolving hypotheses, and acquisition functions to guide experiment selection by balancing exploitation of known knowledge and exploration of uncharted domains.
Result: The tutorial presents BO’s core components, end-to-end workflows, and demonstrates real-world efficacy through case studies in catalysis, materials science, organic synthesis, and molecule discovery. It also covers technical extensions like batched experimentation, heteroscedasticity, contextual optimization, and human-in-the-loop integration.
Conclusion: BO bridges AI advances with practical natural science applications, offering tiered content to empower cross-disciplinary researchers to design more efficient experiments and accelerate principled scientific discovery.
Abstract: Traditional scientific discovery relies on an iterative hypothesise-experiment-refine cycle that has driven progress for centuries, but its intuitive, ad-hoc implementation often wastes resources, yields inefficient designs, and misses critical insights. This tutorial presents Bayesian Optimisation (BO), a principled probability-driven framework that formalises and automates this core scientific cycle. BO uses surrogate models (e.g., Gaussian processes) to model empirical observations as evolving hypotheses, and acquisition functions to guide experiment selection, balancing exploitation of known knowledge and exploration of uncharted domains to eliminate guesswork and manual trial-and-error. We first frame scientific discovery as an optimisation problem, then unpack BO’s core components, end-to-end workflows, and real-world efficacy via case studies in catalysis, materials science, organic synthesis, and molecule discovery. We also cover critical technical extensions for scientific applications, including batched experimentation, heteroscedasticity, contextual optimisation, and human-in-the-loop integration. Tailored for a broad audience, this tutorial bridges AI advances in BO with practical natural science applications, offering tiered content to empower cross-disciplinary researchers to design more efficient experiments and accelerate principled scientific discovery.
[486] Model Merging via Data-Free Covariance Estimation
Marawan Gamal Abdel Hameed, Derek Tam, Pascal Jr Tikeng Notsawo, Colin Raffel, Guillaume Rabusseau
Main category: cs.LG
TL;DR: Data-free model merging via covariance estimation from difference matrices, eliminating need for auxiliary data while maintaining performance
Details
Motivation: Existing model merging methods either lack theoretical justification or require auxiliary data for covariance estimation, limiting practical applicability. The paper aims to develop a principled, data-free merging approach.Method: Revisits interference minimization framework and shows covariance matrices can be estimated directly from difference matrices between model parameters, eliminating need for data. This enables data-free merging while reducing computational costs.
Result: Validated across vision and language benchmarks on models ranging from 86M to 7B parameters, outperforming previous data-free state-of-the-art merging methods.
Conclusion: Proposes a principled, data-free model merging method that bridges theoretical justification with practical advantages, enabling effective combination of model capabilities without auxiliary data.
Abstract: Model merging provides a way of cheaply combining individual models to produce a model that inherits each individual’s capabilities. While some merging methods can approach the performance of multitask training, they are often heuristically motivated and lack theoretical justification. A principled alternative is to pose model merging as a layer-wise optimization problem that directly minimizes interference between tasks. However, this formulation requires estimating per-layer covariance matrices from data, which may not be available when performing merging. In contrast, many of the heuristically-motivated methods do not require auxiliary data, making them practically advantageous. In this work, we revisit the interference minimization framework and show that, under certain conditions, covariance matrices can be estimated directly from difference matrices, eliminating the need for data while also reducing computational costs. We validate our approach across vision and language benchmarks on models ranging from 86M parameters to 7B parameters, outperforming previous data-free state-of-the-art merging methods
[487] SECURE: Stable Early Collision Understanding via Robust Embeddings in Autonomous Driving
Wenjing Wang, Wenxuan Wang, Songning Lai
Main category: cs.LG
TL;DR: SECURE framework enhances robustness of accident anticipation models against input perturbations by enforcing stability in predictions and latent features through multi-objective training.
Details
Motivation: Current deep learning models for accident anticipation show significant instability when faced with minor input perturbations, posing reliability risks for safety-critical systems despite high performance on clean data.Method: Introduces SECURE framework with four key robustness attributes: consistency and stability in both prediction space and latent feature space. Uses principled training methodology that fine-tunes baseline models with multi-objective loss minimizing divergence from reference model and penalizing sensitivity to adversarial perturbations.
Result: Experiments on DAD and CCD datasets show SECURE significantly enhances robustness against various perturbations while also improving performance on clean data, achieving new state-of-the-art results.
Conclusion: The SECURE framework successfully addresses robustness challenges in accident anticipation models, providing formal definitions and enforcement mechanisms for model stability against real-world perturbations.
Abstract: While deep learning has significantly advanced accident anticipation, the robustness of these safety-critical systems against real-world perturbations remains a major challenge. We reveal that state-of-the-art models like CRASH, despite their high performance, exhibit significant instability in predictions and latent representations when faced with minor input perturbations, posing serious reliability risks. To address this, we introduce SECURE - Stable Early Collision Understanding Robust Embeddings, a framework that formally defines and enforces model robustness. SECURE is founded on four key attributes: consistency and stability in both prediction space and latent feature space. We propose a principled training methodology that fine-tunes a baseline model using a multi-objective loss, which minimizes divergence from a reference model and penalizes sensitivity to adversarial perturbations. Experiments on DAD and CCD datasets demonstrate that our approach not only significantly enhances robustness against various perturbations but also improves performance on clean data, achieving new state-of-the-art results.
[488] Massively Parallel Exact Inference for Hawkes Processes
Ahmer Raza, Hudson Smith
Main category: cs.LG
TL;DR: Parallel algorithm for Hawkes process maximum likelihood estimation using GPU-optimized prefix scans, achieving O(N/P) complexity with exact likelihood computation.
Details
Motivation: Hawkes processes are important for modeling self-exciting events, but traditional maximum likelihood estimation scales poorly (O(N²)) and doesn't leverage modern GPU parallelism. Existing linear-time methods are sequential and miss optimization opportunities.Method: Express Hawkes process intensity as product of sparse transition matrices that admit linear-time associative multiplication. Use parallel prefix scan algorithm on GPUs to compute likelihood efficiently. Implement batching scheme for constant memory usage.
Result: Achieves orders-of-magnitude speedups on simulated and real datasets, scaling to thousands of nodes and tens of millions of events. Provides open-source PyTorch library with GPU optimizations.
Conclusion: Massively parallelizable algorithm enables exact likelihood computation for large-scale Hawkes processes, overcoming previous computational limitations while preserving model interpretability.
Abstract: Multivariate Hawkes processes are a widely used class of self-exciting point processes, but maximum likelihood estimation naively scales as $O(N^2)$ in the number of events. The canonical linear exponential Hawkes process admits a faster $O(N)$ recurrence, but prior work evaluates this recurrence sequentially, without exploiting parallelization on modern GPUs. We show that the Hawkes process intensity can be expressed as a product of sparse transition matrices admitting a linear-time associative multiply, enabling computation via a parallel prefix scan. This yields a simple yet massively parallelizable algorithm for maximum likelihood estimation of linear exponential Hawkes processes. Our method reduces the computational complexity to approximately $O(N/P)$ with $P$ parallel processors, and naturally yields a batching scheme to maintain constant memory usage, avoiding GPU memory constraints. Importantly, it computes the exact likelihood without any additional assumptions or approximations, preserving the simplicity and interpretability of the model. We demonstrate orders-of-magnitude speedups on simulated and real datasets, scaling to thousands of nodes and tens of millions of events, substantially beyond scales reported in prior work. We provide an open-source PyTorch library implementing our optimizations.
[489] Malliavin Calculus for Counterfactual Gradient Estimation in Adaptive Inverse Reinforcement Learning
Vikram Krishnamurthy, Luke Snow
Main category: cs.LG
TL;DR: Proposes a novel passive Langevin-based algorithm for adaptive inverse reinforcement learning using Malliavin calculus to efficiently estimate counterfactual gradients.
Details
Motivation: Adaptive IRL aims to reconstruct loss functions from observed RL gradients, but faces challenges with counterfactual gradient estimation that requires conditioning on zero-probability events, making naive Monte Carlo estimators inefficient and kernel smoothing slow.Method: Uses Malliavin calculus to reformulate counterfactual conditioning as ratio of unconditioned expectations involving Malliavin quantities, derives necessary Malliavin derivatives and adjoint Skorohod integrals for Langevin structure, and provides concrete algorithmic approach for counterfactual gradient estimation.
Result: Achieves adaptive IRL with standard estimation rates by overcoming the inefficiency of naive Monte Carlo estimators and slow convergence of kernel smoothing through Malliavin calculus techniques.
Conclusion: The proposed passive Langevin-based algorithm with Malliavin calculus enables efficient adaptive IRL by solving the fundamental challenge of counterfactual gradient estimation in reinforcement learning contexts.
Abstract: Inverse reinforcement learning (IRL) recovers the loss function of a forward learner from its observed responses adaptive IRL aims to reconstruct the loss function of a forward learner by passively observing its gradients as it performs reinforcement learning (RL). This paper proposes a novel passive Langevin-based algorithm that achieves adaptive IRL. The key difficulty in adaptive IRL is that the required gradients in the passive algorithm are counterfactual, that is, they are conditioned on events of probability zero under the forward learner’s trajectory. Therefore, naive Monte Carlo estimators are prohibitively inefficient, and kernel smoothing, though common, suffers from slow convergence. We overcome this by employing Malliavin calculus to efficiently estimate the required counterfactual gradients. We reformulate the counterfactual conditioning as a ratio of unconditioned expectations involving Malliavin quantities, thus recovering standard estimation rates. We derive the necessary Malliavin derivatives and their adjoint Skorohod integral formulations for a general Langevin structure, and provide a concrete algorithmic approach which exploits these for counterfactual gradient estimation.
[490] PI-JEPA: Label-Free Surrogate Pretraining for Coupled Multiphysics Simulation via Operator-Split Latent Prediction
Brandon Yee, Pairie Koh
Main category: cs.LG
TL;DR: PI-JEPA is a physics-informed pretraining framework for neural operator surrogates that trains without completed PDE solves, using masked latent prediction on unlabeled parameter fields with PDE residual regularization.
Details
Motivation: Reservoir simulation workflows have data asymmetry: abundant unlabeled parameter fields but scarce expensive labeled simulation data. Existing neural operator surrogates require large labeled datasets and cannot exploit unlabeled structure.Method: PI-JEPA uses masked latent prediction on unlabeled parameter fields with per-sub-operator PDE residual regularization. The predictor bank aligns with Lie-Trotter operator-splitting decomposition, dedicating separate physics-constrained latent modules to each sub-process (pressure, saturation transport, reaction).
Result: On single-phase Darcy flow, PI-JEPA achieves 1.9× lower error than FNO and 2.4× lower error than DeepONet at Nℓ=100, with 24% improvement over supervised-only training at Nℓ=500.
Conclusion: Label-free surrogate pretraining substantially reduces the simulation budget required for multiphysics surrogate deployment by exploiting abundant unlabeled parameter fields.
Abstract: Reservoir simulation workflows face a fundamental data asymmetry: input parameter fields (geostatistical permeability realizations, porosity distributions) are free to generate in arbitrary quantities, yet existing neural operator surrogates require large corpora of expensive labeled simulation trajectories and cannot exploit this unlabeled structure. We introduce \textbf{PI-JEPA} (Physics-Informed Joint Embedding Predictive Architecture), a surrogate pretraining framework that trains \emph{without any completed PDE solves}, using masked latent prediction on unlabeled parameter fields under per-sub-operator PDE residual regularization. The predictor bank is structurally aligned with the Lie–Trotter operator-splitting decomposition of the governing equations, dedicating a separate physics-constrained latent module to each sub-process (pressure, saturation transport, reaction), enabling fine-tuning with as few as 100 labeled simulation runs. On single-phase Darcy flow, PI-JEPA achieves $1.9\times$ lower error than FNO and $2.4\times$ lower error than DeepONet at $N_\ell{=}100$, with 24% improvement over supervised-only training at $N_\ell{=}500$, demonstrating that label-free surrogate pretraining substantially reduces the simulation budget required for multiphysics surrogate deployment.
[491] Residuals-based Offline Reinforcement Learning
Qing Zhu, Xian Yu
Main category: cs.LG
TL;DR: Proposes a residuals-based offline RL framework that incorporates transition dynamics estimation error into policy optimization using empirical residuals, with theoretical guarantees and a practical DQN algorithm.
Details
Motivation: Offline RL is important for high-stakes applications but existing methods rely on restrictive data coverage assumptions and suffer from distribution shift. The paper aims to address these limitations by explicitly incorporating estimation error in learning transition dynamics.Method: Defines a residuals-based Bellman optimality operator that incorporates transition dynamics estimation error using empirical residuals. Shows this operator is a contraction mapping, develops theoretical guarantees, and implements a residuals-based offline deep Q-learning (DQN) algorithm.
Result: Demonstrates effectiveness of the residuals-based offline DQN algorithm using a stochastic CartPole environment, showing improved performance over baseline methods.
Conclusion: The residuals-based framework provides a principled approach to offline RL that explicitly handles transition dynamics estimation error with theoretical guarantees and practical effectiveness.
Abstract: Offline reinforcement learning (RL) has received increasing attention for learning policies from previously collected data without interaction with the real environment, which is particularly important in high-stakes applications. While a growing body of work has developed offline RL algorithms, these methods often rely on restrictive assumptions about data coverage and suffer from distribution shift. In this paper, we propose a residuals-based offline RL framework for general state and action spaces. Specifically, we define a residuals-based Bellman optimality operator that explicitly incorporates estimation error in learning transition dynamics into policy optimization by leveraging empirical residuals. We show that this Bellman operator is a contraction mapping and identify conditions under which its fixed point is asymptotically optimal and possesses finite-sample guarantees. We further develop a residuals-based offline deep Q-learning (DQN) algorithm. Using a stochastic CartPole environment, we demonstrate the effectiveness of our residuals-based offline DQN algorithm.
[492] Benchmark Problems and Benchmark Datasets for the evaluation of Machine and Deep Learning methods on Photoplethysmography signals: the D4 report from the QUMPHY project
Urs Hackstein, Jordi Alastruey, Philip Aston, Ciaran Bench, Peter H. Charlton, Loic Coquelin, Nando Hegemann, Vaidotas Marozas, Mohammad Moulaeifard, Manasi Nandi, Andrius Petrenas, Oskar Pfeffer, Mantas Rinkevicius, Andrius Solosenko, Nils Strodthoff, Sara Vardanega
Main category: cs.LG
TL;DR: This report from the Qumphy project identifies six medical problems related to PPG signals as benchmark problems for quantifying uncertainty in ML algorithms applied to medical data.
Details
Motivation: The project aims to develop measures to quantify uncertainties in Machine Learning algorithms applied to medical problems, specifically focusing on Photoplethysmography (PPG) signal analysis and processing.Method: The report provides a list of six medical problems related to PPG signals that serve as benchmark problems, along with descriptions of suitable benchmark datasets and their usage.
Result: Six benchmark medical problems related to PPG signals are identified, along with corresponding benchmark datasets and usage guidelines for uncertainty quantification in ML applications.
Conclusion: The report establishes a foundation for benchmarking uncertainty quantification methods in ML applied to PPG signal analysis, providing standardized problems and datasets for the research community.
Abstract: This report is part of the Qumphy project (22HLT01 Qumphy) that is funded by the European Union and is dedicated to the development of measures to quantify the uncertainties associated with Machine Learning algorithms applied to medical problems, in particular the analysis and processing of Photoplethysmography (PPG) signals. In this report, a list of six medical problems that are related to PPG signals and serve as Benchmark Problems is given. Suitable Benchmark datasets and their usage are described also.
[493] Test-Time Scaling Makes Overtraining Compute-Optimal
Nicholas Roberts, Sungjun Cho, Zhiqi Gao, Tzu-Heng Huang, Albert Wu, Gabriel Orlanski, Avi Trost, Kelly Buchanan, Aws Albarghouthi, Frederic Sala
Main category: cs.LG
TL;DR: Train-to-Test (T²) scaling laws optimize model size, training tokens, and inference samples under fixed budgets, showing optimal pretraining shifts to heavy overtraining when accounting for inference costs.
Details
Motivation: Current scaling laws (like Chinchilla) only address pretraining optimization without considering test-time scaling costs, creating a trade-off between model size, training tokens, and inference samples that needs joint optimization under budget constraints.Method: Develop Train-to-Test (T²) scaling laws that incorporate pass@k modeling for test-time scaling, then jointly optimize pretraining and test-time decisions. Validate across eight downstream tasks and through actual pretraining of heavily overtrained models in optimal regions.
Result: When accounting for inference costs, optimal pretraining decisions shift radically into overtraining regime beyond standard scaling suites. Heavily overtrained models in T²-optimal regions show substantially stronger performance than pretraining scaling alone, and findings survive post-training stage.
Conclusion: T² scaling laws provide a more comprehensive framework for LLM optimization that accounts for both pretraining and inference costs, revealing that optimal deployment strategies require heavy overtraining when considering end-to-end budgets.
Abstract: Modern LLMs scale at test-time, e.g. via repeated sampling, where inference cost grows with model size and the number of samples. This creates a trade-off that pretraining scaling laws, such as Chinchilla, do not address. We present Train-to-Test ($T^2$) scaling laws that jointly optimize model size, training tokens, and number of inference samples under fixed end-to-end budgets. $T^2$ modernizes pretraining scaling laws with pass@$k$ modeling used for test-time scaling, then jointly optimizes pretraining and test-time decisions. Forecasts from $T^2$ are robust over distinct modeling approaches: measuring joint scaling effect on the task loss and modeling impact on task accuracy. Across eight downstream tasks, we find that when accounting for inference cost, optimal pretraining decisions shift radically into the overtraining regime, well-outside of the range of standard pretraining scaling suites. We validate our results by pretraining heavily overtrained models in the optimal region that $T^2$ scaling forecasts, confirming their substantially stronger performance compared to pretraining scaling alone. Finally, as frontier LLMs are post-trained, we show that our findings survive the post-training stage, making $T^2$ scaling meaningful in modern deployments.
[494] Improving Latent Generalization Using Test-time Compute
Arslan Chaudhry, Sridhar Thiagarajan, Andrew Lampinen
Main category: cs.LG
TL;DR: Training language models with RL to produce long chains-of-thought at test time improves latent generalization from in-weights knowledge, though still lags behind in-context learning for certain tasks.
Details
Motivation: Language models struggle with deductive reasoning over internalized knowledge (latent generalization deficit), while in-context learning shows robust generalization. Existing augmentation methods are task-specific and don't generalize to out-of-distribution knowledge.Method: Use Reinforcement Learning from correctness feedback to train models to produce long chains-of-thought at test time (’thinking’) to improve latent generalization from in-weights knowledge.
Result: Thinking approach resolves many latent generalization failures on in-distribution knowledge and generalizes to new knowledge without RL training. However, on pure reversal tasks, thinking doesn’t unlock direct knowledge inversion but enables above-chance performance through generate-and-verify.
Conclusion: Test-time thinking is a flexible and promising direction for improving latent generalization of LMs, though still below in-context learning performance for certain tasks like factual self-verification.
Abstract: Language Models (LMs) exhibit two distinct mechanisms for knowledge acquisition: in-weights learning (i.e., encoding information within the model weights) and in-context learning (ICL). Although these two modes offer complementary strengths, in-weights learning frequently struggles to facilitate deductive reasoning over the internalized knowledge. We characterize this limitation as a deficit in latent generalization, of which the reversal curse is one example. Conversely, in-context learning demonstrates highly robust latent generalization capabilities. To improve latent generalization from in-weights knowledge, prior approaches rely on train-time data augmentation, yet these techniques are task-specific, scale poorly, and fail to generalize to out-of-distribution knowledge. To overcome these shortcomings, this work studies how models can be taught to use test-time compute, or ’thinking’, specifically to improve latent generalization. We use Reinforcement Learning (RL) from correctness feedback to train models to produce long chains-of-thought (CoTs) to improve latent generalization. Our experiments show that this thinking approach not only resolves many instances of latent generalization failures on in-distribution knowledge but also, unlike augmentation baselines, generalizes to new knowledge for which no RL was performed. Nevertheless, on pure reversal tasks, we find that thinking does not unlock direct knowledge inversion, but the generate-and-verify ability of thinking models enables them to get well above chance performance. The brittleness of factual self-verification means thinking models still remain well below the performance of in-context learning for this task. Overall, our results establish test-time thinking as a flexible and promising direction for improving the latent generalization of LMs.
[495] When Reward Hacking Rebounds: Understanding and Mitigating It with Representation-Level Signals
Rui Wu, Ruixiang Tang
Main category: cs.LG
TL;DR: Paper studies reward hacking in LLMs for coding tasks using environment-manipulation setting, identifies three-phase rebound pattern, uses representation engineering to detect hacking via shortcut concepts, and proposes Advantage Modification to suppress hacking during training.
Details
Motivation: Reinforcement learning for LLMs is vulnerable to reward hacking where models exploit shortcuts to maximize reward without solving intended tasks. The authors want to systematically study this phenomenon in a controlled setting using coding tasks where models can rewrite evaluator code.Method: Use environment-manipulation setting for coding tasks where models can rewrite evaluator code to pass tests. Analyze hacking behavior patterns across models. Apply representation engineering to extract concept directions (shortcut, deception, evaluation awareness) from domain-general contrastive pairs. Propose Advantage Modification that integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts during training.
Result: Identified reproducible three-phase rebound pattern: models first attempt but fail to rewrite evaluator, then temporarily retreat to legitimate solving, then rebound into successful hacking with different strategies. Found shortcut direction tracks hacking behavior most closely and serves as effective representational proxy for detection. Advantage Modification provides more robust suppression of hacking compared to generation-time activation steering.
Conclusion: Reward hacking in LLMs follows predictable patterns that can be detected via representation engineering. Integrating shortcut concept detection into training advantage computation (Advantage Modification) offers more robust solution than inference-time interventions.
Abstract: Reinforcement learning for LLMs is vulnerable to reward hacking, where models exploit shortcuts to maximize reward without solving the intended task. We systematically study this phenomenon in coding tasks using an environment-manipulation setting, where models can rewrite evaluator code to trivially pass tests without solving the task, as a controlled testbed. Across both studied models, we identify a reproducible three-phase rebound pattern: models first attempt to rewrite the evaluator but fail, as their rewrites embed test cases their own solutions cannot pass. They then temporarily retreat to legitimate solving. When legitimate reward remains scarce, they rebound into successful hacking with qualitatively different strategies. Using representation engineering, we extract concept directions for shortcut, deception, and evaluation awareness from domain-general contrastive pairs and find that the shortcut direction tracks hacking behavior most closely, making it an effective representational proxy for detection. Motivated by this finding, we propose Advantage Modification, which integrates shortcut concept scores into GRPO advantage computation to penalize hacking rollouts before policy updates. Because the penalty is internalized into the training signal rather than applied only at inference time, Advantage Modification provides more robust suppression of hacking compared with generation-time activation steering.
[496] Soft MPCritic: Amortized Model Predictive Value Iteration
Thomas Banker, Nathan P. Lawrence, Ali Mesbah
Main category: cs.LG
TL;DR: Soft MPCritic combines reinforcement learning with model predictive control using value learning and sample-based planning, making RL-MPC synthesis practical and scalable through amortized warm-start strategies and short-horizon planning.
Details
Motivation: Reinforcement learning and model predictive control have complementary strengths but combining them at scale remains computationally challenging. The authors aim to create a practical framework that synthesizes MPC policies in settings where policy extraction and direct long-horizon planning may fail.Method: Soft MPCritic learns in soft value space while using sample-based planning for both online control and value target generation. It instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration. The framework includes an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets, and plans in a scenario-based fashion with an ensemble of dynamic models.
Result: Soft MPCritic enables effective learning through robust, short-horizon planning on both classic and complex control tasks. The approach makes RL-MPC synthesis computationally practical while preserving solution quality.
Conclusion: Soft MPCritic establishes a practical and scalable blueprint for synthesizing MPC policies, successfully combining RL and MPC in a computationally efficient manner that works in settings where traditional approaches fail.
Abstract: Reinforcement learning (RL) and model predictive control (MPC) offer complementary strengths, yet combining them at scale remains computationally challenging. We propose soft MPCritic, an RL-MPC framework that learns in (soft) value space while using sample-based planning for both online control and value target generation. soft MPCritic instantiates MPC through model predictive path integral control (MPPI) and trains a terminal Q-function with fitted value iteration, aligning the learned value function with the planner and implicitly extending the effective planning horizon. We introduce an amortized warm-start strategy that recycles planned open-loop action sequences from online observations when computing batched MPPI-based value targets. This makes soft MPCritic computationally practical, while preserving solution quality. soft MPCritic plans in a scenario-based fashion with an ensemble of dynamic models trained for next-step prediction accuracy. Together, these ingredients enable soft MPCritic to learn effectively through robust, short-horizon planning on classic and complex control tasks. These results establish soft MPCritic as a practical and scalable blueprint for synthesizing MPC policies in settings where policy extraction and direct, long-horizon planning may fail.
[497] DISCO-TAB: A Hierarchical Reinforcement Learning Framework for Privacy-Preserving Synthesis of Complex Clinical Data
Arshia Ilaty, Hossein Shirazi, Amir Rahmani, Hajar Homayouni
Main category: cs.LG
TL;DR: DISCO-TAB is a novel framework that uses a fine-tuned LLM with multi-objective discriminator system via RL to generate high-quality synthetic tabular EHR data, addressing class imbalance and clinical validity issues.
Details
Motivation: Clinical decision support systems face data scarcity due to privacy concerns, and existing LLM-based synthetic data generation struggles with complex dependencies and class imbalances in EHR data, producing clinically invalid records.Method: DISCO-TAB orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning, evaluating synthesis at four granularities (token, sentence, feature, row) with Automated Constraint Discovery and Inverse-Frequency Reward Shaping.
Result: Achieves up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, with exceptional statistical fidelity (JSD < 0.01) and robust resistance to membership inference attacks.
Conclusion: Establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications through hierarchical feedback mechanisms.
Abstract: The development of robust clinical decision support systems is frequently impeded by the scarcity of high-fidelity, privacy-preserving biomedical data. While Generative Large Language Models (LLMs) offer a promising avenue for synthetic data generation, they often struggle to capture the complex, non-linear dependencies and severe class imbalances inherent in Electronic Health Records (EHR), leading to statistically plausible but clinically invalid records. To bridge this gap, we introduce DISCO-TAB (DIScriminator-guided COntrol for TABular synthesis), a novel framework that orchestrates a fine-tuned LLM with a multi-objective discriminator system optimized via Reinforcement Learning. Unlike prior methods relying on scalar feedback, DISCO-TAB evaluates synthesis at four granularities, token, sentence, feature, and row, while integrating Automated Constraint Discovery and Inverse-Frequency Reward Shaping to autonomously preserve latent medical logic and resolve minority-class collapse. We rigorously validate our framework across diverse benchmarks, including high-dimensional, small-sample medical datasets (e.g., Heart Failure, Parkinson’s). Our results demonstrate that hierarchical feedback yields state-of-the-art performance, achieving up to 38.2% improvement in downstream clinical classifier utility compared to GAN and Diffusion baselines, while ensuring exceptional statistical fidelity (JSD < 0.01) and robust resistance to membership inference attacks. This work establishes a new standard for generating trustworthy, utility-preserving synthetic tabular data for sensitive healthcare applications.
[498] CuTeGen: An LLM-Based Agentic Framework for Generation and Optimization of High-Performance GPU Kernels using CuTe
Tara Saba, Anne Ouyang, Xujie Si, Fan Long
Main category: cs.LG
TL;DR: CuTeGen is an agentic framework that automates GPU kernel generation and optimization through a structured generate-test-refine workflow using the CuTe abstraction layer, producing correct and performant kernels for ML workloads.
Details
Motivation: Developing efficient GPU kernels for ML systems is challenging and expert-driven due to tight coupling between algorithms, memory usage, and hardware optimizations. Existing LLM-based approaches often fail to maintain correctness and achieve competitive performance across iterations.Method: CuTeGen treats kernel development as a structured generate-test-refine workflow using the CuTe abstraction layer. It focuses on progressive refinement of a single kernel through execution-based validation, structured debugging, staged optimization, workload-aware prompts, and delayed integration of profiling feedback.
Result: Experimental results on matrix multiplication and activation workloads show the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.
Conclusion: CuTeGen demonstrates that agentic frameworks with structured refinement workflows and appropriate abstraction layers can effectively automate GPU kernel development, addressing correctness and performance challenges in ML systems.
Abstract: High-performance GPU kernels are critical to modern machine learning systems, yet developing efficient implementations remains a challenging, expert-driven process due to the tight coupling between algorithmic structure, memory hierarchy usage, and hardware-specific optimizations. Recent work has explored using large language models (LLMs) to generate GPU kernels automatically, but generated implementations often struggle to maintain correctness and achieve competitive performance across iterative refinements. We present CuTeGen, an agentic framework for automated generation and optimization of GPU kernels that treats kernel development as a structured generate–test–refine workflow. Unlike approaches that rely on one-shot generation or large-scale search over candidate implementations, CuTeGen focuses on progressive refinement of a single evolving kernel through execution-based validation, structured debugging, and staged optimization. A key design choice is to generate kernels using the CuTe abstraction layer, which exposes performance-critical structures such as tiling and data movement while providing a more stable representation for iterative modification. To guide performance improvement, CuTeGen incorporates workload-aware optimization prompts and delayed integration of profiling feedback. Experimental results on matrix multiplication and activation workloads demonstrate that the framework produces functionally correct kernels and achieves competitive performance relative to optimized library implementations.
[499] Matching Accuracy, Different Geometry: Evolution Strategies vs GRPO in LLM Post-Training
William Hoy, Binxu Wang, Xu Pan
Main category: cs.LG
TL;DR: ES and GRPO achieve similar task performance in LLM fine-tuning but produce geometrically distinct parameter updates, with ES making larger, more diffuse changes while GRPO makes smaller, more localized updates.
Details
Motivation: To understand whether Evolution Strategies (ES) and gradient-based reinforcement learning methods like GRPO produce comparable solutions in parameter space despite similar task performance, and to analyze their geometric properties and implications for forgetting and knowledge preservation.Method: Comparative analysis of ES and GRPO across four tasks in single-task and sequential continual-learning settings, with controlled iteration budgets. Developed analytical theory of ES to explain observed phenomena within a unified framework.
Result: ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially. Despite similar performance, ES makes much larger parameter changes with broader off-task KL drift, while GRPO makes smaller, localized updates. The solutions are linearly connected with no loss barrier despite nearly orthogonal update directions.
Conclusion: Gradient-free (ES) and gradient-based (GRPO) fine-tuning reach similarly accurate but geometrically distinct solutions, with important consequences for forgetting and knowledge preservation in LLM fine-tuning.
Abstract: Evolution Strategies (ES) have emerged as a scalable gradient-free alternative to reinforcement learning based LLM fine-tuning, but it remains unclear whether comparable task performance implies comparable solutions in parameter space. We compare ES and Group Relative Policy Optimization (GRPO) across four tasks in both single-task and sequential continual-learning settings. ES matches or exceeds GRPO in single-task accuracy and remains competitive sequentially when its iteration budget is controlled. Despite this similarity in task performance, the two methods produce markedly different model updates: ES makes much larger changes and induces broader off-task KL drift, whereas GRPO makes smaller, more localized updates. Strikingly, the ES and GRPO solutions are linearly connected with no loss barrier, even though their update directions are nearly orthogonal. We develop an analytical theory of ES that explains all these phenomena within a unified framework, showing how ES can accumulate large off-task movement on weakly informative directions while still making enough progress on the task to match gradient-based RL in downstream accuracy. These results show that gradient-free and gradient-based fine-tuning can reach similarly accurate yet geometrically distinct solutions, with important consequences for forgetting and knowledge preservation. The source code is publicly available: https://github.com/Bhoy1/ESvsGRPO.
[500] Beyond Logit Adjustment: A Residual Decomposition Framework for Long-Tailed Reranking
Zhanliang Wang, Hongzhuo Chen, Quan Minh Nguyen, Mian Umair Ahsan, Kai Wang
Main category: cs.LG
TL;DR: REPAIR is a post-hoc reranking method for long-tailed classification that combines classwise and pairwise corrections to address limitations of fixed-offset methods.
Details
Motivation: Existing post-hoc methods for long-tailed classification use fixed classwise offsets, but optimal correction varies across inputs and competing labels, requiring adaptive pairwise correction.Method: Bayes-optimal reranking on base-model top-k shortlist decomposes correction into classwise (constant) and pairwise (input-dependent) components. REPAIR combines shrinkage-stabilized classwise term with linear pairwise term using competition features.
Result: Experiments on five benchmarks (image classification, species recognition, scene recognition, rare disease diagnosis) confirm decomposition explains where pairwise correction helps vs. where classwise alone suffices.
Conclusion: Pairwise correction is necessary when incompatible ordering constraints exist across contexts; REPAIR provides lightweight post-hoc solution that adapts to input-dependent competition.
Abstract: Long-tailed classification, where a small number of frequent classes dominate many rare ones, remains challenging because models systematically favor frequent classes at inference time. Existing post-hoc methods such as logit adjustment address this by adding a fixed classwise offset to the base-model logits. However, the correction required to restore the relative ranking of two classes need not be constant across inputs, and a fixed offset cannot adapt to such variation. We study this problem through Bayes-optimal reranking on a base-model top-k shortlist. The gap between the optimal score and the base score, the residual correction, decomposes into a classwise component that is constant within each class, and a pairwise component that depends on the input and competing labels. When the residual is purely classwise, a fixed offset suffices to recover the Bayes-optimal ordering. We further show that when the same label pair induces incompatible ordering constraints across contexts, no fixed offset can achieve this recovery. This decomposition leads to testable predictions regarding when pairwise correction can improve performance and when cannot. We develop REPAIR (Reranking via Pairwise residual correction), a lightweight post-hoc reranker that combines a shrinkage-stabilized classwise term with a linear pairwise term driven by competition features on the shortlist. Experiments on five benchmarks spanning image classification, species recognition, scene recognition, and rare disease diagnosis confirm that the decomposition explains where pairwise correction helps and where classwise correction alone suffices.
[501] Learning ECG Image Representations via Dual Physiological-Aware Alignments
Hung Manh Pham, Jialu Tang, Aaqib Saeed, Dong Ma, Bin Zhu, Pan Zhou
Main category: cs.LG
TL;DR: ECG-Scan: A self-supervised framework for learning clinically generalized representations from ECG images using dual physiological-aware alignments, bridging the gap between image-based and signal-based ECG analysis.
Details
Motivation: ECG data often exists only in image form worldwide, but most automated ECG analysis methods require raw signal recordings, limiting applicability in real-world and resource-constrained settings. There's a need to unlock large-scale legacy ECG image data for automated cardiovascular diagnostics.Method: Uses dual physiological-aware alignments: 1) multimodal contrastive alignment between ECG images and gold-standard signal-text modalities to optimize image representation learning, and 2) soft-lead constraints that integrate domain knowledge to regularize reconstruction and improve signal lead inter-consistency.
Result: Extensive benchmarking shows superior performance compared to existing image baselines and significantly narrows the gap between ECG image and signal analysis across multiple datasets and downstream tasks.
Conclusion: Self-supervised image modeling has potential to unlock large-scale legacy ECG data and broaden access to automated cardiovascular diagnostics, demonstrating the effectiveness of physiological-aware alignments for learning clinically generalized representations from ECG images.
Abstract: Electrocardiograms (ECGs) are among the most widely used diagnostic tools for cardiovascular diseases, and a large amount of ECG data worldwide appears only in image form. However, most existing automated ECG analysis methods rely on access to raw signal recordings, limiting their applicability in real-world and resource-constrained settings. In this paper, we present ECG-Scan, a self-supervised framework for learning clinically generalized representations from ECG images through dual physiological-aware alignments: 1) Our approach optimizes image representation learning using multimodal contrastive alignment between image and gold-standard signal-text modalities. 2) We further integrate domain knowledge via soft-lead constraints, regularizing the reconstruction process and improving signal lead inter-consistency. Extensive benchmarking across multiple datasets and downstream tasks demonstrates that our image-based model achieves superior performance compared to existing image baselines and notably narrows the gap between ECG image and signal analysis. These results highlight the potential of self-supervised image modeling to unlock large-scale legacy ECG data and broaden access to automated cardiovascular diagnostics.
[502] ZEUS: Accelerating Diffusion Models with Only Second-Order Predictor
Yixiao Wang, Ting Jiang, Zishan Shao, Hancheng Ye, Jingwei Sun, Mingyuan Ma, Jianyi Zhang, Yiran Chen, Hai Li
Main category: cs.LG
TL;DR: ZEUS: A training-free acceleration method for diffusion models that uses second-order predictors and interleaved skipping to achieve up to 3.2x speedup while maintaining quality.
Details
Motivation: Current training-free acceleration methods for diffusion models are too complex, with higher-order predictors amplifying errors under aggressive speedups and architectural modifications hindering deployment. There's a need for simpler, more effective acceleration that works with existing models.Method: ZEUS predicts reduced denoiser evaluations using a second-order predictor and stabilizes aggressive consecutive skipping with an interleaved scheme that avoids back-to-back extrapolations. It adds zero overhead, no feature caches, and no architectural modifications.
Result: ZEUS achieves up to 3.2x end-to-end speedup across image and video generation while maintaining perceptual quality, outperforming recent training-free baselines in speed-fidelity trade-off.
Conclusion: ZEUS provides an effective, simple training-free acceleration method for diffusion models that works with various backbones, prediction objectives, and solver choices without architectural changes.
Abstract: Denoising generative models deliver high-fidelity generation but remain bottlenecked by inference latency due to the many iterative denoiser calls required during sampling. Training-free acceleration methods reduce latency by either sparsifying the model architecture or shortening the sampling trajectory. Current training-free acceleration methods are more complex than necessary: higher-order predictors amplify error under aggressive speedups, and architectural modifications hinder deployment. Beyond 2x acceleration, step skipping creates structural scarcity – at most one fresh evaluation per local window – leaving the computed output and its backward difference as the only causally grounded information. Based on this, we propose ZEUS, an acceleration method that predicts reduced denoiser evaluations using a second-order predictor, and stabilizes aggressive consecutive skipping with an interleaved scheme that avoids back-to-back extrapolations. ZEUS adds essentially zero overhead, no feature caches, and no architectural modifications, and it is compatible with different backbones, prediction objectives, and solver choices. Across image and video generation, ZEUS consistently improves the speed-fidelity performance over recent training-free baselines, achieving up to 3.2x end-to-end speedup while maintaining perceptual quality. Our code is available at: https://github.com/Ting-Justin-Jiang/ZEUS.
[503] Care-Conditioned Neuromodulation for Autonomy-Preserving Supportive Dialogue Agents
Shalima Binta Manir, Tim Oates
Main category: cs.LG
TL;DR: CCN framework uses learned care signals to condition dialogue generation, balancing helpfulness with autonomy preservation through utility-based reranking.
Details
Motivation: Standard alignment methods focus on helpfulness and harmlessness but neglect relational risks like dependency reinforcement, overprotection, and coercive guidance in supportive AI roles.Method: Care-Conditioned Neuromodulation (CCN) - state-dependent control framework with learned scalar signals from user state/dialogue context that conditions response generation and candidate selection, combined with utility-based reranking.
Result: CCN improves autonomy-preserving utility by +0.25 over supervised fine-tuning and +0.07 over preference optimization baselines while maintaining comparable supportiveness on relational failure benchmark.
Conclusion: State-dependent control with utility-based selection is practical for multi-objective alignment in autonomy-sensitive dialogue systems.
Abstract: Large language models deployed in supportive or advisory roles must balance helpfulness with preservation of user autonomy, yet standard alignment methods primarily optimize for helpfulness and harmlessness without explicitly modeling relational risks such as dependency reinforcement, overprotection, or coercive guidance. We introduce Care-Conditioned Neuromodulation (CCN), a state-dependent control framework in which a learned scalar signal derived from structured user state and dialogue context conditions response generation and candidate selection. We formalize this setting as an autonomy-preserving alignment problem and define a utility function that rewards autonomy support and helpfulness while penalizing dependency and coercion. We also construct a benchmark of relational failure modes in multi-turn dialogue, including reassurance dependence, manipulative care, overprotection, and boundary inconsistency. On this benchmark, care-conditioned candidate generation combined with utility-based reranking improves autonomy-preserving utility by +0.25 over supervised fine-tuning and +0.07 over preference optimization baselines while maintaining comparable supportiveness. Pilot human evaluation and zero-shot transfer to real emotional-support conversations show directional agreement with automated metrics. These results suggest that state-dependent control combined with utility-based selection is a practical approach to multi-objective alignment in autonomy-sensitive dialogue.
[504] Thinking While Listening: Fast-Slow Recurrence for Long-Horizon Sequential Modeling
Shota Takashiro, Masanori Koyama, Takeru Miyato, Yusuke Iwasawa, Yutaka Matsuo, Kohei Hayashi
Main category: cs.LG
TL;DR: A method that extends latent recurrent modeling to sequential inputs by interleaving fast recurrent latent updates with slow observation updates, enabling stable internal structures that evolve with input for better long-term coherence and out-of-distribution generalization.
Details
Motivation: To address limitations of existing sequential models (LSTM, state space models, Transformers) in maintaining coherent representations over long horizons and improving out-of-distribution generalization in reinforcement learning and algorithmic tasks.Method: Extends latent recurrent modeling by interleaving fast recurrent latent updates with self-organizational ability between slow observation updates, facilitating learning of stable internal structures that evolve alongside input streams.
Result: The method improves out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines like LSTM, state space models, and Transformer variants, maintaining coherent and clustered representations over long horizons.
Conclusion: The proposed approach enables better learning of stable internal representations that evolve with sequential input, offering advantages for long-term coherence and generalization over existing sequential modeling techniques.
Abstract: We extend the recent latent recurrent modeling to sequential input streams. By interleaving fast, recurrent latent updates with self-organizational ability between slow observation updates, our method facilitates the learning of stable internal structures that evolve alongside the input. This mechanism allows the model to maintain coherent and clustered representations over long horizons, improving out-of-distribution generalization in reinforcement learning and algorithmic tasks compared to sequential baselines such as LSTM, state space models, and Transformer variants.
[505] Variational LSTM with Augmented Inputs: Nonlinear Response History Metamodeling with Aleatoric and Epistemic Uncertainty
Manisha Sapkota, Min Li, Bowei Li
Main category: cs.LG
TL;DR: A probabilistic metamodel using variational LSTM with augmented inputs to capture both aleatoric and epistemic uncertainties in nonlinear dynamic structural systems under stochastic excitations.
Details
Motivation: High-dimensional nonlinear dynamic structural systems require uncertainty propagation for performance-based design and risk assessment, but computational demands are heavy. Machine learning metamodels help but need to avoid overconfidence when data/training is insufficient, requiring estimation of both aleatoric (inherent) and epistemic (model/prediction) uncertainties.Method: Developed a probabilistic metamodeling technique based on variational long short-term memory (LSTM) with augmented inputs. Key random system parameters are treated as augmented inputs alongside excitation series to capture aleatoric uncertainty. Epistemic uncertainty is approximated via Monte Carlo dropout scheme, avoiding expensive full Bayesian approaches.
Result: The calibrated metamodels accurately reproduce nonlinear response time histories and provide confidence bounds indicating epistemic uncertainty. The method incurs negligible additional training costs while enabling nearly cost-free uncertainty simulation.
Conclusion: The proposed technique effectively captures both aleatoric and epistemic uncertainties in machine learning-based metamodels for nonlinear dynamic structural systems, providing accurate predictions with confidence bounds while maintaining computational efficiency.
Abstract: Uncertainty propagation in high-dimensional nonlinear dynamic structural systems is pivotal in state-of-the-art performance-based design and risk assessment, where uncertainties from both excitations and structures, i.e., the aleatoric uncertainty, must be considered. This poses a significant challenge due to heavy computational demands. Machine learning techniques are thus introduced as metamodels to alleviate this burden. However, the “black box” nature of Machine learning models underscores the necessity of avoiding overly confident predictions, particularly when data and training efforts are insufficient. This creates a need, in addition to considering the aleatoric uncertainty, of estimating the uncertainty related to the prediction confidence, i.e., epistemic uncertainty, for machine learning-based metamodels. We developed a probabilistic metamodeling technique based on a variational long short-term memory (LSTM) with augmented inputs to simultaneously capture aleatoric and epistemic uncertainties. Key random system parameters are treated as augmented inputs alongside excitation series carrying record-to-record variability to capture the full range of aleatoric uncertainty. Meanwhile, epistemic uncertainty is effectively approximated via the Monte Carlo dropout scheme. Unlike computationally expensive full Bayesian approaches, this method incurs negligible additional training costs while enabling nearly cost-free uncertainty simulation. The proposed technique is demonstrated through multiple case studies involving stochastic seismic or wind excitations. Results show that the calibrated metamodels accurately reproduce nonlinear response time histories and provide confidence bounds indicating the associated epistemic uncertainty.
[506] Optimizing EEG Graph Structure for Seizure Detection: An Information Bottleneck and Self-Supervised Learning Approach
Lincan Li, Rikuto Kotoge, Xihao Piao, Zheng Chen, Yushun Dong
Main category: cs.LG
TL;DR: IRENE: Information Bottleneck-guided EEG seizure detection method that jointly learns denoised dynamic graph structures and spatial-temporal representations using self-supervised learning to address EEG noise and inter-patient variability.
Details
Motivation: Current EEG seizure detection methods struggle with complex spatiotemporal dynamics, extreme inter-patient variability, and noisy EEG data. Existing graph-based approaches often create redundant or irrelevant connections due to not accounting for EEG's noisy nature, undermining performance even with advanced architectures.Method: Proposes IRENE framework that: 1) Jointly learns denoised dynamic graph structures guided by Information Bottleneck principle to identify most informative nodes/edges, 2) Uses self-supervised Graph Masked AutoEncoder to reconstruct masked EEG signals based on dynamic graph context, promoting structure-aware compact representations, 3) Explicitly accounts for EEG noise characteristics to produce reliable connectivity patterns.
Result: Extensive experiments on benchmark EEG datasets show IRENE outperforms state-of-the-art baselines in seizure detection. The method provides clinically meaningful insights into seizure dynamics and demonstrates enhanced robustness against label scarcity and inter-patient variability.
Conclusion: IRENE offers a new perspective for EEG seizure detection by jointly learning denoised dynamic graph structures and informative spatial-temporal representations through Information Bottleneck guidance and self-supervised learning, addressing key challenges in noisy EEG data analysis.
Abstract: Seizure detection from EEG signals is highly challenging due to complex spatiotemporal dynamics and extreme inter-patient variability. To model them, recent methods construct dynamic graphs via statistical correlations, predefined similarity measures, or implicit learning, yet rarely account for EEG’s noisy nature. Consequently, these graphs usually contain redundant or task-irrelevant connections, undermining model performance even with state-of-the-art architectures. In this paper, we present a new perspective for EEG seizure detection: jointly learning denoised dynamic graph structures and informative spatial-temporal representations guided by the Information Bottleneck (IB). Unlike prior approaches, our graph constructor explicitly accounts for the noisy characteristics of EEG data, producing compact and reliable connectivity patterns that better support downstream seizure detection. To further enhance representation learning, we employ a self-supervised Graph Masked AutoEncoder that reconstructs masked EEG signals based on dynamic graph context, promoting structure-aware and compact representations aligned with the IB principle. Bringing things together, we introduce Information Bottleneck-guided EEG SeizuRE DetectioN via SElf-Supervised Learning (IRENE), which explicitly learns dynamic graph structures and interpretable spatial-temporal EEG representations. IRENE addresses three core challenges: (i) Identifying the most informative nodes and edges; (ii) Explaining seizure propagation in the brain network; and (iii) Enhancing robustness against label scarcity and inter-patient variability. Extensive experiments on benchmark EEG datasets demonstrate that our method outperforms state-of-the-art baselines in seizure detection and provides clinically meaningful insights into seizure dynamics. The source code is available at https://github.com/LabRAI/IRENE.
[507] Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training
Dong Shu, Denghui Zhang, Jessica Hullman
Main category: cs.LG
TL;DR: I-PPO improves RL training by filtering out noisy episodes using influence scores, accelerating training and reducing unfaithful reasoning.
Details
Motivation: Traditional RL algorithms like PPO train on all rollout episodes, but many contain noisy or unfaithful reasoning that degrades performance and slows training.Method: Proposes Influence-Guided PPO (I-PPO) that integrates data attribution into RL post-training by calculating influence scores for each episode using gradient-based approximation to identify and eliminate anti-aligned episodes.
Result: I-PPO consistently outperforms SFT and PPO baselines, acts as an intrinsic early stopping mechanism, accelerates training efficiency, and effectively reduces unfaithful CoT reasoning.
Conclusion: Filtering episodes based on influence scores improves RL training by removing harmful data, leading to better performance and faster convergence.
Abstract: Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.
[508] Training In-Context and In-Weights Mixtures Via Contrastive Context Sampling
Deeptanshu Malu, Deevyanshu Malu, Aditya Nemiwal, Sunita Sarawagi
Main category: cs.LG
TL;DR: The paper investigates training strategies for co-developing in-context learning (ICL) and in-weights learning (IWL) in LLMs, showing that context similarity structure is crucial and proposing Contrastive-Context training to achieve stable ICL-IWL mixtures.
Details
Motivation: Current LLMs exhibit both in-context learning (ICL) and in-weights learning (IWL) capabilities, but standard fine-tuning often erodes ICL. The paper aims to develop training strategies that co-develop both modes and enable switching between them based on context relevance.Method: Proposes Contrastive-Context training with two types of contrasts: (1) mixing similar and random examples within a context to evolve correct ICL, and (2) varying similarity grades across contexts to evolve ICL-IWL mixtures. Includes theoretical analysis of a minimal model and empirical validation on four LLMs across several tasks.
Result: Random context leads to loss of ICL and IWL dominance, while only similar examples cause ICL to degenerate to copying labels. Contrastive-Context yields stable ICL-IWL mixtures, avoiding collapse into pure ICL, IWL, or copying, as confirmed by diagnostic probes.
Conclusion: Similarity structure between target inputs and context examples plays a crucial role in developing ICL and IWL capabilities. The proposed Contrastive-Context training effectively co-develops both learning modes and enables context-aware switching between them.
Abstract: We investigate training strategies that co-develop in-context learning (ICL) and in-weights learning (IWL), and the ability to switch between them based on context relevance. Although current LLMs exhibit both modes, standard task-specific fine-tuning often erodes ICL, motivating IC-Train - fine-tuning with in-context examples. Prior work has shown that emergence of ICL after IC-Train depends on factors such as task diversity and training duration. In this paper we show that the similarity structure between target inputs and context examples also plays an important role. Random context leads to loss of ICL and IWL dominance, while only similar examples in context causes ICL to degenerate to copying labels without regard to relevance. To address this, we propose a simple Contrastive-Context which enforces two types of contrasts: (1) mix of similar and random examples within a context to evolve a correct form of ICL, and (2) varying grades of similarity across contexts to evolve ICL-IWL mixtures. We present insights on the importance of such contrast with theoretical analysis of a minimal model. We validate with extensive empirical evaluation on four LLMs and several tasks. Diagnostic probes confirm that contrasted contexts yield stable ICL-IWL mixtures, avoiding collapse into pure ICL, IWL, or copying.
[509] Pseudo-Quantized Actor-Critic Algorithm for Robustness to Noisy Temporal Difference Error
Taisuke Kobayashi
Main category: cs.LG
TL;DR: A novel reinforcement learning algorithm that uses control-as-inference framework to derive robust learning rules against noisy TD errors through sigmoid-based optimality modeling and gradient vanishing properties.
Details
Motivation: TD errors in RL are noisy and destabilize learning due to bootstrap computation. Existing heuristics like target networks and ensemble models increase computational cost and reduce efficiency. Need for robust learning algorithms against noisy TD errors.Method: Revisits TD learning based on control as inference. Models optimality distribution with sigmoid function, derives robust learning rules using forward/reverse KL divergences. When sigmoid saturates with large TD errors (likely noise), gradient vanishes, excluding noisy updates. Also decomposes optimality into multiple levels for pseudo-quantization and develops Jensen-Shannon divergence approach.
Result: Demonstrated stable learning on RL benchmarks even when heuristics are insufficient or rewards contain noise. Shows robustness against noisy TD errors through gradient vanishing properties.
Conclusion: Proposed novel RL algorithm derived from control-as-inference framework provides robust learning against noisy TD errors through sigmoid-based optimality modeling and gradient vanishing mechanisms, offering stable learning without heavy computational heuristics.
Abstract: In reinforcement learning (RL), temporal difference (TD) errors are widely adopted for optimizing value and policy functions. However, since the TD error is defined by a bootstrap method, its computation tends to be noisy and destabilize learning. Heuristics to improve the accuracy of TD errors, such as target networks and ensemble models, have been introduced so far. While these are essential approaches for the current deep RL algorithms, they cause side effects like increased computational cost and reduced learning efficiency. Therefore, this paper revisits the TD learning algorithm based on control as inference, deriving a novel algorithm capable of robust learning against noisy TD errors. First, the distribution model of optimality, a binary random variable, is represented by a sigmoid function. Alongside forward and reverse Kullback-Leibler divergences, this new model derives a robust learning rule: when the sigmoid function saturates with a large TD error probably due to noise, the gradient vanishes, implicitly excluding it from learning. Furthermore, the two divergences exhibit distinct gradient-vanishing characteristics. Building on these analyses, the optimality is decomposed into multiple levels to achieve pseudo-quantization of TD errors, aiming for further noise reduction. Additionally, a Jensen-Shannon divergence-based approach is approximately derived to inherit the characteristics of both divergences. These benefits are verified through RL benchmarks, demonstrating stable learning even when heuristics are insufficient or rewards contain noise.
[510] Expert-Choice Routing Enables Adaptive Computation in Diffusion Language Models
Shuibai Zhang, Caspian Zhuang, Chihan Cui, Zhihan Yang, Fred Zhangzhi Peng, Yanxin Zhang, Haoyue Bai, Zack Jia, Yang Zhou, Guanhua Chen, Ming Liu
Main category: cs.LG
TL;DR: Expert-Choice routing outperforms Token-Choice routing for Diffusion Language Models, providing better load balancing, faster convergence, and enabling adaptive computation allocation based on denoising steps.
Details
Motivation: Existing DLM MoE models use Token-Choice routing from autoregressive systems, causing load imbalance and rigid computation allocation. The authors aim to find a better routing paradigm for DLMs that enables deterministic load balancing and adaptive computation.Method: Propose Expert-Choice routing for DLMs with timestep-dependent expert capacity that varies allocation according to denoising steps. Show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router.
Result: EC routing provides better load balancing, higher throughput, and faster convergence than TC. Allocating more capacity to low-mask-ratio steps achieves best performance under matched FLOPs. Retrofitting existing TC DLMs to EC improves accuracy across diverse downstream tasks.
Conclusion: EC routing is superior to TC for DLM MoE models, enabling computation to be treated as an adaptive policy rather than fixed architectural constant. Low-mask-ratio tokens have higher learning efficiency, justifying compute concentration on these steps.
Abstract: Diffusion language models (DLMs) enable parallel, non-autoregressive text generation, yet existing DLM mixture-of-experts (MoE) models inherit token-choice (TC) routing from autoregressive systems, leading to load imbalance and rigid computation allocation. We show that expert-choice (EC) routing is a better fit for DLMs: it provides deterministic load balancing by design, yielding higher throughput and faster convergence than TC. Building on the property that EC capacity is externally controllable, we introduce timestep-dependent expert capacity, which varies expert allocation according to the denoising step. We find that allocating more capacity to low-mask-ratio steps consistently achieves the best performance under matched FLOPs, and provide a mechanistic explanation: tokens in low-mask-ratio contexts exhibit an order-of-magnitude higher learning efficiency, so concentrating compute on these steps yields the largest marginal return. Finally, we show that existing pretrained TC DLMs can be retrofitted to EC by replacing only the router, achieving faster convergence and improved accuracy across diverse downstream tasks. Together, these results establish EC routing as a superior paradigm for DLM MoE models and demonstrate that computation in DLMs can be treated as an adaptive policy rather than a fixed architectural constant. Code is available at https://github.com/zhangshuibai/EC-DLM.
[511] CRIT: Graph-Based Automatic Data Synthesis to Enhance Cross-Modal Multi-Hop Reasoning
Junyoung Sung, Seungwoo Lyu, Minjun Kim, Sumin An, Arsha Nagrani, Paul Hongsuck Seo
Main category: cs.LG
TL;DR: CRIT introduces a new dataset and benchmark for evaluating cross-modal multi-hop reasoning in vision-language models, addressing limitations of current benchmarks that don’t require combining information across modalities.
Details
Motivation: Current multimodal benchmarks fail to capture real-world reasoning that requires combining information across modalities in a multi-hop process. Most benchmarks rely on single images where answers can be inferred from a single modality alone, and training data rarely enforces complementary, multi-hop reasoning, leading to VLMs that hallucinate and produce poorly grounded reasoning.Method: The authors introduce CRIT, a dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. The dataset spans diverse domains including natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation.
Result: Experiments show that even state-of-the-art models struggle on CRIT’s reasoning tasks. Models trained on CRIT demonstrate significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.
Conclusion: CRIT addresses a critical gap in multimodal evaluation by focusing on cross-modal multi-hop reasoning, revealing limitations in current VLMs and providing a valuable training resource that improves performance on both specialized and standard benchmarks.
Abstract: Real-world reasoning often requires combining information across modalities, connecting textual context with visual cues in a multi-hop process. Yet, most multimodal benchmarks fail to capture this ability: they typically rely on single images or set of images, where answers can be inferred from a single modality alone. This limitation is mirrored in the training data, where interleaved image-text content rarely enforces complementary, multi-hop reasoning. As a result, Vision-Language Models (VLMs) frequently hallucinate and produce reasoning traces poorly grounded in visual evidence. To address this gap, we introduce CRIT, a new dataset and benchmark built with a graph-based automatic pipeline for generating complex cross-modal reasoning tasks. CRIT consists of diverse domains ranging from natural images, videos, and text-rich sources, and includes a manually verified test set for reliable evaluation. Experiments on this benchmark reveal that even state-of-the-art models struggle on such reasoning tasks. Models trained on CRIT show significant gains in cross-modal multi-hop reasoning, including strong improvements on SPIQA and other standard multimodal benchmarks.
[512] Label Shift Estimation With Incremental Prior Update
Yunrui Zhang, Gustavo Batista, Salil S. Kanhere
Main category: cs.LG
TL;DR: A new post-hoc method for label shift estimation that incrementally updates priors on each sample to adjust posteriors, outperforming state-of-the-art methods on CIFAR-10 and MNIST datasets.
Details
Motivation: Real-world scenarios often violate the assumption of identical training and testing label distributions, such as in medical diagnosis, fraud detection, and social media analysis where label distributions change over time and across contexts.Method: Proposes an incremental prior update approach for label shift estimation that adjusts posteriors sample-by-sample, based on intuitive assumptions about modern probabilistic classifiers and requiring weaker calibration than existing methods.
Result: Experiments on CIFAR-10 and MNIST show the method consistently outperforms current state-of-the-art maximum likelihood-based approaches under different calibration settings and varying intensities of label shift.
Conclusion: The proposed post-hoc method provides a versatile, black-box compatible approach for accurate label shift estimation that works with any probabilistic classifier and demonstrates superior performance over existing methods.
Abstract: An assumption often made in supervised learning is that the training and testing sets have the same label distribution. However, in real-life scenarios, this assumption rarely holds. For example, medical diagnosis result distributions change over time and across locations; fraud detection models must adapt as patterns of fraudulent activity shift; the category distribution of social media posts changes based on trending topics and user demographics. In the task of label shift estimation, the goal is to estimate the changing label distribution $p_t(y)$ in the testing set, assuming the likelihood $p(x|y)$ does not change, implying no concept drift. In this paper, we propose a new approach for post-hoc label shift estimation, unlike previous methods that perform moment matching with confusion matrix estimated from a validation set or maximize the likelihood of the new data with an expectation-maximization algorithm. We aim to incrementally update the prior on each sample, adjusting each posterior for more accurate label shift estimation. The proposed method is based on intuitive assumptions on classifiers that are generally true for modern probabilistic classifiers. The proposed method relies on a weaker notion of calibration compared to other methods. As a post-hoc approach for label shift estimation, the proposed method is versatile and can be applied to any black-box probabilistic classifier. Experiments on CIFAR-10 and MNIST show that the proposed method consistently outperforms the current state-of-the-art maximum likelihood-based methods under different calibrations and varying intensities of label shift.
[513] Cognitive Energy Modeling for Neuroadaptive Human-Machine Systems using EEG and WGAN-GP
Sriram Sattiraju, Vaibhav Gollapalli, Aryan Shah, Timothy McMahan
Main category: cs.LG
TL;DR: Using Schrödinger Bridge Problem to model cognitive state transitions in EEG data and evaluating whether GAN-generated synthetic EEG preserves the underlying dynamical structure for energy-based modeling.
Details
Motivation: EEG provides insight into brain dynamics, but modeling real-time state transitions and quantifying the energy required remains challenging. While GANs are used to augment EEG data, it's unclear if synthetic EEG preserves the dynamical structure needed for transition-based analysis.Method: Uses Schrödinger Bridge Problem (SBP) as a probabilistic framework to model efficient evolution between brain states, interpreting this as cognitive energy cost. SBP-derived transport cost serves as a metric to evaluate whether GAN-generated EEG retains the distributional geometry needed for energy-based modeling of cognitive state transitions.
Result: Strong agreement in transition energies derived from real and synthetic EEG collected during Stroop tasks across both group and participant-level analyses, indicating synthetic EEG preserves the transition structure required for SBP-based modeling.
Conclusion: Synthetic EEG preserves the transition structure needed for SBP-based modeling, enabling its use in data-efficient neuroadaptive systems. A framework is presented where SBP-derived cognitive energy serves as a control signal for adaptive human-machine systems, supporting real-time adjustment based on user cognitive and affective state.
Abstract: Electroencephalography (EEG) provides a non-invasive insight into the brain’s cognitive and emotional dynamics. However, modeling how these states evolve in real time and quantifying the energy required for such transitions remains a major challenge. The Schrödinger Bridge Problem (SBP) offers a principled probabilistic framework to model the most efficient evolution between the brain states, interpreted as a measure of cognitive energy cost. While generative models such as GANs have been widely used to augment EEG data, it remains unclear whether synthetic EEG preserves the underlying dynamical structure required for transition-based analysis. In this work, we address this gap by using SBP-derived transport cost as a metric to evaluate whether GAN-generated EEG retains the distributional geometry necessary for energy-based modeling of cognitive state transitions. We compare transition energies derived from real and synthetic EEG collected during Stroop tasks and demonstrate strong agreement across group and participant-level analyses. These results indicate that synthetic EEG preserves the transition structure required for SBP-based modeling, enabling its use in data-efficient neuroadaptive systems. We further present a framework in which SBP-derived cognitive energy serves as a control signal for adaptive human-machine systems, supporting real-time adjustment of system behavior in response to user cognitive and affective state.
[514] Coupled Query-Key Dynamics for Attention
Barak Gahtan, Alex M. Bronstein
Main category: cs.LG
TL;DR: Coupled QK dynamics improves language modeling by evolving queries and keys jointly through shared learned dynamics before scoring, enhancing perplexity and training stability with minimal parameter overhead.
Details
Motivation: Standard attention uses static, independent projections of queries and keys. The authors hypothesize that jointly evolving queries and keys through shared dynamics before scoring could improve language modeling performance and training stability.Method: Introduces coupled QK dynamics where queries and keys evolve jointly through shared learned dynamics before attention scoring. Compares symplectic (Hamiltonian) and non-symplectic (Euler) integrators, with structural ablation studies to isolate coupling as the key factor.
Result: Achieves 6.6-6.9% perplexity reduction on WikiText-103 with only 0.11% additional parameters. Coupling improves sample efficiency (2.4× fewer tokens needed), benefits scale to 150M parameters but narrow at 350M. Works well on domain-coherent text but degrades on heterogeneous web data.
Conclusion: Coupled QK dynamics is an effective sample-efficiency mechanism for language modeling, particularly beneficial for domain-coherent text. The coupling itself, not the specific integration method, drives improvements.
Abstract: Standard scaled dot-product attention computes scores from static, independent projections of the input. We show that evolving queries and keys \emph{jointly} through shared learned dynamics before scoring - which we call \textbf{coupled QK dynamics} - improves language modeling perplexity and training stability. On WikiText-103 at 60M parameters, coupled dynamics achieves 22.55–22.62 perplexity vs.\ 24.22 for standard attention ($-$6.6–6.9%), with only 0.11% additional parameters (shared across both instantiations). A structural ablation isolates coupling as the active ingredient: a symplectic (Hamiltonian) and a non-symplectic (Euler) integrator perform identically when both couple Q and K, while an uncoupled MLP baseline of matched capacity reaches only 23.81 with 8$\times$ higher seed variance. The integration step count (1–7) is similarly irrelevant - a single coupled step suffices. A compute-matched comparison reveals that coupling is a \emph{sample-efficiency} mechanism: standard attention trained for 2.4$\times$ longer (matching wall-clock) reaches the same perplexity, but requires 2.4$\times$ more tokens. The advantage scales to 150M ($-$6.7%) but narrows at 350M ($-$1.0%), where Differential Attention (18.93) overtakes coupled dynamics (19.35). The benefit is corpus-dependent: coupling helps on domain-coherent text (WikiText-103 $-$6.6%, PubMed $-$4.5%) but degrades on heterogeneous web text ($+$10.3%) and shows no benefit on GLUE. We characterize when coupling helps and when it does not, providing practical guidelines.
[515] MiCA Learns More Knowledge Than LoRA and Full Fine-Tuning
Sten Rüdiger, Sebastian Raschka
Main category: cs.LG
TL;DR: MiCA is a parameter-efficient fine-tuning method for LLMs that adapts underutilized subspaces via minor singular vectors, achieving better knowledge acquisition with fewer parameters than LoRA.
Details
Motivation: Conventional fine-tuning methods like LoRA target dominant subspaces, but the authors hypothesize that adapting underutilized subspaces (minor components) could be more efficient for integrating new knowledge into pre-trained models.Method: Uses Singular Value Decomposition to identify subspaces related to minor singular vectors (least significant singular values), then constrains parameter updates during fine-tuning to these minor directions only.
Result: Achieves up to 5.9x improvement in knowledge acquisition with optimized hyperparameters, using only 6-60% of parameters compared to LoRA, demonstrating more efficient and stable adaptation.
Conclusion: Constraining adaptation to minor singular directions provides a superior mechanism for integrating new knowledge into pre-trained language models with minimal parameter footprint.
Abstract: Minor Component Adaptation (MiCA) is a novel parameter-efficient fine-tuning method for large language models that focuses on adapting underutilized subspaces of model representations. Unlike conventional methods such as Low-Rank Adaptation (LoRA), which target dominant subspaces, MiCA leverages Singular Value Decomposition to identify subspaces related to minor singular vectors associated with the least significant singular values and constrains the update of parameters during fine-tuning to those directions. This strategy leads to up to 5.9x improvement in knowledge acquisition under optimized training hyperparameters and a minimal parameter footprint of 6-60% compared to LoRA. These results suggest that constraining adaptation to minor singular directions provides a more efficient and stable mechanism for integrating new knowledge into pre-trained language models.
[516] Transformer self-attention encoder-decoder with multimodal deep learning for response time series forecasting and digital twin support in wind structural health monitoring
Feiyu Zhou, Marios Impraimakis
Main category: cs.LG
TL;DR: Transformer-based digital twin for bridge structural health monitoring using wind-induced vibration forecasting and anomaly detection
Details
Motivation: Current wind-excited dynamic behavior forecasting suffers from uncertainty when environmental or traffic conditions change, making it hard to distinguish normal from abnormal vibration behavior. There's a need for models that don't require assumptions about wind stationarity or structural normal vibration behavior.Method: A novel transformer methodology that: 1) uses temporal characteristics to train a forecasting model, 2) compares vibration predictions to measured data to detect large deviations, and 3) uses identified anomalies as early-warning indicators of structural change. The approach is tested on real-world measurements from the Hardanger Bridge.
Result: The AI-based model outperforms existing approaches for response forecasting, accurately captures structural behavior in realistic conditions, and adapts to changes in system excitation. The transformer-based digital twin shows potential for resilient infrastructure management.
Conclusion: Transformer-based digital twin components can serve as next-generation tools for continuous learning and adaptive monitoring over infrastructure lifecycle, particularly for handling temporal characteristics and changing environmental conditions.
Abstract: The wind-induced structural response forecasting capabilities of a novel transformer methodology are examined here. The model also provides a digital twin component for bridge structural health monitoring. Firstly, the approach uses the temporal characteristics of the system to train a forecasting model. Secondly, the vibration predictions are compared to the measured ones to detect large deviations. Finally, the identified cases are used as an early-warning indicator of structural change. The artificial intelligence-based model outperforms approaches for response forecasting as no assumption on wind stationarity or on structural normal vibration behavior is needed. Specifically, wind-excited dynamic behavior suffers from uncertainty related to obtaining poor predictions when the environmental or traffic conditions change. This results in a hard distinction of what constitutes normal vibration behavior. To this end, a framework is rigorously examined on real-world measurements from the Hardanger Bridge monitored by the Norwegian University of Science and Technology. The approach captures accurate structural behavior in realistic conditions, and with respect to the changes in the system excitation. The results, importantly, highlight the potential of transformer-based digital twin components to serve as next-generation tools for resilient infrastructure management, continuous learning, and adaptive monitoring over the system’s lifecycle with respect to temporal characteristics.
[517] MATA-Former & SIICU: Semantic Aware Temporal Alignment for High-Fidelity ICU Risk Prediction
Zhichong Zheng, Xiaohang Nie, Xueqi Wang, Yuanjin Zhao, Haitao Zhang, Yichao Tang
Main category: cs.LG
TL;DR: MATA-Former: A medical-semantics aware transformer that uses event semantics to parameterize attention weights for clinical risk forecasting, with Plateau-Gaussian soft labeling for continuous multi-horizon regression.
Details
Motivation: Current clinical risk forecasting methods rely too much on chronological proximity rather than intrinsic pathological dependencies, and suffer from coarse binary supervision and physical timestamps that don't align with clinical logic.Method: Proposes MATA-Former (Medical-semantics Aware Time-ALiBi Transformer) that uses event semantics to dynamically parameterize attention weights, prioritizing causal validity over time lags. Also introduces Plateau-Gaussian Soft Labeling (PSL) to reformulate binary classification into continuous multi-horizon regression for full-trajectory risk modeling.
Result: Evaluated on SIICU (new dataset with 506k+ events with expert-verified annotations) and MIMIC-IV, the framework demonstrates superior efficacy and robust generalization in capturing risks from text-intensive, irregular clinical time series.
Conclusion: The proposed approach better aligns predictive modeling with clinical logic by focusing on pathological dependencies rather than chronological proximity, showing improved performance in clinical risk forecasting.
Abstract: Forecasting evolving clinical risks relies on intrinsic pathological dependencies rather than mere chronological proximity, yet current methods struggle with coarse binary supervision and physical timestamps. To align predictive modeling with clinical logic, we propose the Medical-semantics Aware Time-ALiBi Transformer (MATA-Former), utilizing event semantics to dynamically parameterize attention weights to prioritize causal validity over time lags. Furthermore, we introduce Plateau-Gaussian Soft Labeling (PSL), reformulating binary classification into continuous multi-horizon regression for full-trajectory risk modeling. Evaluated on SIICU – a newly constructed dataset featuring over 506k events with rigorous expert-verified, fine-grained annotations – and the MIMIC-IV dataset, our framework demonstrates superior efficacy and robust generalization in capturing risks from text-intensive, irregular clinical time series.
[518] Koopman-Based Nonlinear Identification and Adaptive Control of a Turbofan Engine
David Grasev
Main category: cs.LG
TL;DR: Koopman operator-based multivariable control for turbofan engines using physics-based modeling, meta-heuristic system identification, and two controller designs (adaptive MPC and feedback linearization) evaluated under various flight conditions.
Details
Motivation: To develop robust multivariable control for turbofan engines that can handle multiple control objectives (spool speeds and engine pressure ratio) and adapt to varying flight conditions using Koopman operator theory.Method: Developed physics-based component-level model for training data; created meta-heuristic extended dynamic mode decomposition for Koopman model identification; designed two controllers: adaptive Koopman-based MPC with disturbance observer and Koopman-based feedback linearization controller.
Result: Koopman model accurately predicts both spool speeds and EPR; both control strategies achieve comparable steady-state performance; adaptive MPC shows superior robustness under varying flight conditions; EPR control improves thrust response.
Conclusion: Koopman-based control is applicable to turbofan engines, with adaptive MPC framework offering robustness advantages for engine control under varying flight conditions.
Abstract: This paper investigates Koopman operator-based approaches for multivariable control of a two-spool turbofan engine. A physics-based component-level model is developed to generate training data and validate the controllers. A meta-heuristic extended dynamic mode decomposition is developed, with a cost function designed to accurately capture both spool-speed dynamics and the engine pressure ratio (EPR), enabling the construction of a single Koopman model suitable for multiple control objectives. Using the identified time-varying Koopman model, two controllers are developed: an adaptive Koopman-based model predictive controller (AKMPC) with a disturbance observer and a Koopman-based feedback linearization controller (K-FBLC), which serves as a benchmark. The controllers are evaluated for two control strategies, namely configurations of spool speeds and EPR, under both sea-level and varying flight conditions. The results demonstrate that the proposed identification approach enables accurate predictions of both spool speeds and EPR, allowing the Koopman model to be reused flexibly across different control formulations. While both control strategies achieve comparable performance in steady conditions, the AKMPC exhibits superior robustness compared with the K-FBLC under varying flight conditions due to its ability to compensate for model mismatch. Moreover, the EPR control strategy improves the thrust response. The study highlights the applicability of Koopman-based control and demonstrates the advantages of the AKMPC-based framework for robust turbofan engine control.
[519] DDCL: Deep Dual Competitive Learning: A Differentiable End-to-End Framework for Unsupervised Prototype-Based Representation Learning
Giansalvo Cirrincione
Main category: cs.LG
TL;DR: DDCL introduces the first fully differentiable end-to-end framework for unsupervised prototype-based representation learning by replacing external k-means with an internal Dual Competitive Layer, enabling direct optimization of cluster quality through backpropagation.
Details
Motivation: Addresses the structural disconnect between feature learning and cluster assignment in deep clustering, where most methods use external clustering steps (like k-means) to generate pseudo-labels, preventing direct optimization for cluster quality.Method: Proposes Deep Dual Competitive Learning (DDCL) with a Dual Competitive Layer (DCL) that generates prototypes as native differentiable outputs, making the entire pipeline trainable by backpropagation through a unified loss without external clustering steps.
Result: DDCL outperforms its non-differentiable ablation by 65% in clustering accuracy and DeepCluster end-to-end by 122%. The decomposition identity holds across 100k+ epochs, and the negative feedback cycle shows Pearson -0.98 correlation.
Conclusion: DDCL successfully bridges the gap between feature learning and cluster assignment through a fully differentiable framework, enabling direct optimization of cluster quality and demonstrating superior performance over existing methods.
Abstract: A persistent structural weakness in deep clustering is the disconnect between feature learning and cluster assignment. Most architectures invoke an external clustering step, typically k-means, to produce pseudo-labels that guide training, preventing the backbone from directly optimising for cluster quality. This paper introduces Deep Dual Competitive Learning (DDCL), the first fully differentiable end-to-end framework for unsupervised prototype-based representation learning. The core contribution is architectural: the external k-means is replaced by an internal Dual Competitive Layer (DCL) that generates prototypes as native differentiable outputs of the network. This single inversion makes the complete pipeline, from backbone feature extraction through prototype generation to soft cluster assignment, trainable by backpropagation through a single unified loss, with no Lloyd iterations, no pseudo-label discretisation, and no external clustering step. To ground the framework theoretically, the paper derives an exact algebraic decomposition of the soft quantisation loss into a simplex-constrained reconstruction error and a non-negative weighted prototype variance term. This identity reveals a self-regulating mechanism built into the loss geometry: the gradient of the variance term acts as an implicit separation force that resists prototype collapse without any auxiliary objective, and leads to a global Lyapunov stability theorem for the reduced frozen-encoder system. Six blocks of controlled experiments validate each structural prediction. The decomposition identity holds with zero violations across more than one hundred thousand training epochs; the negative feedback cycle is confirmed with Pearson -0.98; with a jointly trained backbone, DDCL outperforms its non-differentiable ablation by 65% in clustering accuracy and DeepCluster end-to-end by 122%.
[520] FourierMoE: Fourier Mixture-of-Experts Adaptation of Large Language Models
Juyong Jiang, Fan Wang, Hong Qi, Sunghun Kim, Jing Tang
Main category: cs.LG
TL;DR: FourierMoE: A spectral-domain mixture-of-experts approach for parameter-efficient fine-tuning of LLMs that uses frequency-aware adaptation to reduce task interference and improve multi-task performance.
Details
Motivation: Standard PEFT methods struggle with multi-task fine-tuning due to task interference and limited parameter budgets. Existing MoE approaches operate in spatial domain, causing structural redundancy and parameter overhead. The authors discovered that different tasks have distinct frequency energy distributions and LLM layers show heterogeneous frequency sensitivities.Method: Proposes FourierMoE which integrates MoE architecture with inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Uses frequency-adaptive router to dispatch tokens to experts specialized in different frequency bands. Each expert learns conjugate-symmetric complex coefficients that preserve phase and amplitude information while guaranteeing lossless IDFT reconstruction into real-valued spatial weights.
Result: Extensive evaluations across 28 benchmarks, multiple model architectures and scales show FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters.
Conclusion: Spectral-domain expert adaptation is an effective and parameter-efficient paradigm for LLM fine-tuning, addressing limitations of spatial-domain approaches through frequency-aware task specialization.
Abstract: Parameter-efficient fine-tuning (PEFT) has emerged as a crucial paradigm for adapting large language models (LLMs) under constrained computational budgets. However, standard PEFT methods often struggle in multi-task fine-tuning settings, where diverse optimization objectives induce task interference and limited parameter budgets lead to representational deficiency. While recent approaches incorporate mixture-of-experts (MoE) to alleviate these issues, they predominantly operate in the spatial domain, which may introduce structural redundancy and parameter overhead. To overcome these limitations, we reformulate adaptation in the spectral domain. Our spectral analysis reveals that different tasks exhibit distinct frequency energy distributions, and that LLM layers display heterogeneous frequency sensitivities. Motivated by these insights, we propose FourierMoE, which integrates the MoE architecture with the inverse discrete Fourier transform (IDFT) for frequency-aware adaptation. Specifically, FourierMoE employs a frequency-adaptive router to dispatch tokens to experts specialized in distinct frequency bands. Each expert learns a set of conjugate-symmetric complex coefficients, preserving complete phase and amplitude information while theoretically guaranteeing lossless IDFT reconstruction into real-valued spatial weights. Extensive evaluations across 28 benchmarks, multiple model architectures, and scales demonstrate that FourierMoE consistently outperforms competitive baselines in both single-task and multi-task settings while using significantly fewer trainable parameters. These results highlight the promise of spectral-domain expert adaptation as an effective and parameter-efficient paradigm for LLM fine-tuning.
[521] Dual-Attention Based 3D Channel Estimation
Xiangzhao Qin, Sha Hu
Main category: cs.LG
TL;DR: 3DCENet: A dual attention mechanism network for 3D channel estimation in MIMO systems, addressing complexity issues of traditional LMMSE methods.
Details
Motivation: Traditional LMMSE-based 3D channel estimation for MIMO channels is computationally prohibitive due to large matrix dimensions, while suboptimal approximations suffer performance degradation in correlated MIMO channels.Method: Proposes 3DCENet using dual attention mechanisms to explore channel correlations across time, frequency, and spatial domains, leveraging deep learning capabilities for accurate 3D channel estimation.
Result: The network achieves accurate channel estimates by effectively capturing correlations across all domains through attention mechanisms.
Conclusion: Deep learning with attention mechanisms provides an effective solution for computationally efficient and accurate 3D channel estimation in MIMO systems.
Abstract: For multi-input and multi-output (MIMO) channels, the optimal channel estimation (CE) based on linear minimum mean square error (LMMSE) requires three-dimensional (3D) filtering. However, the complexity is often prohibitive due to large matrix dimensions. Suboptimal estimators approximate 3DCE by decomposing it into time, frequency, and spatial domains, while yields noticeable performance degradation under correlated MIMO channels. On the other hand, recent advances in deep learning (DL) can explore channel correlations in all domains via attention mechanisms. Building on this capability, we propose a dual attention mechanism based 3DCE network (3DCENet) that can achieve accurate estimates.
[522] Bridging Deep Learning and Integer Linear Programming: A Predictive-to-Prescriptive Framework for Supply Chain Analytics
Khai Banh Nghiep, Duc Nguyen Minh, Lan Hoang Thi
Main category: cs.LG
TL;DR: A three-step framework combining forecasting and operational analytics for retail demand forecasting, comparing N-BEATS and N-HiTS deep learning models with statistical benchmarks, then using optimal forecasts for delivery optimization via integer linear programming.
Details
Motivation: Retail demand forecasting faces challenges with irreconcilable seasonality, irregular spikes, and noise in actual data, making precise projections difficult despite being critical for supply chain planning.Method: Three-step framework: 1) Exploratory data analysis on 180,519 transactions to examine trends, seasonality, and delivery attributes; 2) Compare forecasting performance of N-BEATS MSTL (statistical) and N-HiTS (deep learning); 3) Use optimal forecasts in integer linear programming to minimize total delivery time under budget, capacity, and service constraints.
Result: N-HiTS and N-BEATS outperformed statistical benchmarks significantly. N-BEATS was selected as most optimized with lowest forecasting error, predicting next 4 weeks demand of 1918 units. The integer linear program provided feasible, cost-optimal shipping plans.
Conclusion: The study demonstrates practical impact of precise forecasting combined with interpretable model optimization in logistics, showing how advanced forecasting models can be integrated with operational decision-making for supply chain efficiency.
Abstract: Although demand forecasting is a critical component of supply chain planning, actual retail data can exhibit irreconcilable seasonality, irregular spikes, and noise, rendering precise projections nearly unattainable. This paper proposes a three-step analytical framework that combines forecasting and operational analytics. The first stage consists of exploratory data analysis, where delivery-tracked data from 180,519 transactions are partitioned, and long-term trends, seasonality, and delivery-related attributes are examined. Secondly, the forecasting performance of a statistical time series decomposition model N-BEATS MSTL and a recent deep learning architecture N-HiTS were compared. N-BEATS and N-HiTS were both statistically, and hence were N-BEATS’s and N-HiTS’s statistically selected. Most recent time series deep learning models, N-HiTS, N-BEATS. N-HiTS and N-BEATS N-HiTS and N-HiTS outperformed the statistical benchmark to a large extent. N-BEATS was selected to be the most optimized model, as the one with the lowest forecasting error, in the 3rd and final stage forecasting values of the next 4 weeks of 1918 units, and provided those as a model with a set of deterministically integer linear program outcomes that are aimed to minimize the total delivery time with a set of bound budget, capacity, and service constraints. The solution allocation provided a feasible and cost-optimal shipping plan. Overall, the study provides a compelling example of the practical impact of precise forecasting and simple, highly interpretable model optimization in logistics.
[523] Graph Neural Operator Towards Edge Deployability and Portability for Sparse-to-Dense, Real-Time Virtual Sensing on Irregular Grids
William Howes, Jason Yoo, Kazuma Kobayashi, Subhankar Sarkar, Farid Ahmed, Souvik Chakraborty, Syed Bahauddin Alam
Main category: cs.LG
TL;DR: VIRSO is a graph-based neural operator for sparse-to-dense reconstruction on irregular geometries that enables real-time virtual sensing with low power consumption, achieving sub-1% errors on nuclear thermal-hydraulic benchmarks.
Details
Motivation: Dense instrumentation for physical field sensing is often infeasible due to cost and constraints, while physics-based solvers have high computational latency and power requirements unsuitable for real-time monitoring in resource-constrained systems.Method: VIRSO combines spectral and spatial analysis for sparse-to-dense reconstruction on irregular geometries, using a variable-connectivity algorithm (V-KNN) for mesh-informed graph construction, treating inference as measurement rather than traditional computation.
Result: Achieves mean relative L2 errors below 1% on nuclear thermal-hydraulic benchmarks with reconstruction ratios from 47:1 to 156:1, reduces energy-delay product from ~206 J·ms to 10.1 J·ms on H200, and achieves sub-10W power consumption with sub-second latency on Jetson Orin Nano.
Conclusion: VIRSO establishes edge-feasibility and hardware-portability for real-time virtual sensing, presenting compute-aware operator learning as a new paradigm for resource-constrained environments.
Abstract: Accurate sensing of spatially distributed physical fields typically requires dense instrumentation, which is often infeasible in real-world systems due to cost, accessibility, and environmental constraints. Physics-based solvers address this through direct numerical integration of governing equations, but their computational latency and power requirements preclude real-time use in resource-constrained monitoring and control systems. Here we introduce VIRSO (Virtual Irregular Real-Time Sparse Operator), a graph-based neural operator for sparse-to-dense reconstruction on irregular geometries, and a variable-connectivity algorithm, Variable KNN (V-KNN), for mesh-informed graph construction. Unlike prior neural operators that treat hardware deployability as secondary, VIRSO reframes inference as measurement: the combination of both spectral and spatial analysis provides accurate reconstruction without the high latency and power consumption of previous graph-based methodologies with poor scalability, presenting VIRSO as a potential candidate for edge-constrained, real-time virtual sensing. We evaluate VIRSO on three nuclear thermal-hydraulic benchmarks of increasing geometric and multiphysics complexity, across reconstruction ratios from 47:1 to 156:1. VIRSO achieves mean relative $L_2$ errors below 1%, outperforming other benchmark operators while using fewer parameters. The full 10-layer configuration reduces the energy-delay product (EDP) from ${\approx}206$ J$\cdot$ms for the graph operator baseline to $10.1$ J$\cdot$ms on an NVIDIA H200. Implemented on an NVIDIA Jetson Orin Nano, all configurations of VIRSO provide sub-10 W power consumption and sub-second latency. These results establish the edge-feasibility and hardware-portability of VIRSO and present compute-aware operator learning as a new paradigm for real-time sensing in inaccessible and resource-constrained environments.
[524] Physics Informed Reinforcement Learning with Gibbs Priors for Topology Control in Power Grids
Pantelis Dogoulis, Maxime Cordy
Main category: cs.LG
TL;DR: Physics-informed Reinforcement Learning framework for power grid topology control using semi-Markov control with Gibbs prior to reduce action space and computational cost while maintaining performance.
Details
Motivation: Power grid topology control is challenging due to combinatorial action space growth with grid size and expensive simulation-based action evaluation, requiring more efficient decision-making approaches.Method: Combines semi-Markov control with physics-informed Gibbs prior; uses GNN surrogate to predict post-action overload risk; constructs state-dependent candidate action sets; reweights policy logits before selection; reduces exploration difficulty and simulation cost.
Result: Achieves strong balance between control quality and efficiency: matches oracle performance 6× faster on first benchmark; reaches 94.6% of oracle reward with 200× lower decision time on second; improves over PPO baseline by 255% in reward and 284% in survived steps on hardest benchmark while being 2.5× faster than engineering baseline.
Conclusion: The proposed physics-informed RL framework provides an effective mechanism for power grid topology control by efficiently handling combinatorial action spaces while maintaining performance through physics-informed priors.
Abstract: Topology control for power grid operation is a challenging sequential decision making problem because the action space grows combinatorially with the size of the grid and action evaluation through simulation is computationally expensive. We propose a physics-informed Reinforcement Learning framework that combines semi-Markov control with a Gibbs prior, that encodes the system’s physics, over the action space. The decision is only taken when the grid enters a hazardous regime, while a graph neural network surrogate predicts the post action overload risk of feasible topology actions. These predictions are used to construct a physics-informed Gibbs prior that both selects a small state-dependent candidate set and reweights policy logits before action selection. In this way, our method reduces exploration difficulty and online simulation cost while preserving the flexibility of a learned policy. We evaluate the approach in three realistic benchmark environments of increasing difficulty. Across all settings, the proposed method achieves a strong balance between control quality and computational efficiency: it matches oracle-level performance while being approximately $6\times$ faster on the first benchmark, reaches $94.6%$ of oracle reward with roughly $200\times$ lower decision time on the second one, and on the most challenging benchmark improves over a PPO baseline by up to $255%$ in reward and $284%$ in survived steps while remaining about $2.5\times$ faster than a strong specialized engineering baseline. These results show that our method provides an effective mechanism for topology control in power grids.
[525] CANDI: Curated Test-Time Adaptation for Multivariate Time-Series Anomaly Detection Under Distribution Shift
HyunGi Kim, Jisoo Mok, Hyungyu Lee, Juhyeon Shin, Sungroh Yoon
Main category: cs.LG
TL;DR: CANDI: A test-time adaptation framework for multivariate time-series anomaly detection that selectively adapts to distribution shifts by identifying potential false positives and updating models with spatiotemporal awareness.
Details
Motivation: Real-world multivariate time-series anomaly detection suffers from performance degradation due to distribution shifts. Test-time adaptation is promising but needs careful curation to avoid overfitting to anomalies while preserving pre-trained knowledge.Method: Proposes CANDI with False Positive Mining (FPM) to curate adaptation samples based on anomaly scores and latent similarity, and Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates.
Result: CANDI significantly improves performance under distribution shift, improving AUROC up to 14% while using fewer adaptation samples compared to baseline methods.
Conclusion: CANDI effectively addresses distribution shift in multivariate time-series anomaly detection through curated test-time adaptation, balancing adaptation to new distributions while preserving pre-trained knowledge.
Abstract: Multivariate time-series anomaly detection (MTSAD) aims to identify deviations from normality in multivariate time-series and is critical in real-world applications. However, in real-world deployments, distribution shifts are ubiquitous and cause severe performance degradation in pre-trained anomaly detector. Test-time adaptation (TTA) updates a pre-trained model on-the-fly using only unlabeled test data, making it promising for addressing this challenge. In this study, we propose CANDI (Curated test-time adaptation for multivariate time-series ANomaly detection under DIstribution shift), a novel TTA framework that selectively identifies and adapts to potential false positives while preserving pre-trained knowledge. CANDI introduces a False Positive Mining (FPM) strategy to curate adaptation samples based on anomaly scores and latent similarity, and incorporates a plug-and-play Spatiotemporally-Aware Normality Adaptation (SANA) module for structurally informed model updates. Extensive experiments demonstrate that CANDI significantly improves the performance of MTSAD under distribution shift, improving AUROC up to 14% while using fewer adaptation samples.
[526] Towards Intrinsically Calibrated Uncertainty Quantification in Industrial Data-Driven Models via Diffusion Sampler
Yiran Ma, Jerome Le Ny, Zhichao Chen, Zhihuan Song
Main category: cs.LG
TL;DR: Diffusion-based posterior sampling framework for well-calibrated predictive uncertainty in industrial process monitoring
Details
Motivation: Data-driven models are crucial for real-time monitoring in process industries when key performance indicators are hard to measure directly. While accurate predictions are essential, reliable uncertainty quantification (UQ) is equally critical for safety, reliability, and decision-making, but remains a major challenge in current data-driven approaches.Method: Introduces a diffusion-based posterior sampling framework that inherently produces well-calibrated predictive uncertainty via faithful posterior sampling, eliminating the need for post-hoc calibration.
Result: In extensive evaluations on synthetic distributions, the Raman-based phenylacetic acid soft sensor benchmark, and a real ammonia synthesis case study, the method achieves practical improvements over existing UQ techniques in both uncertainty calibration and predictive accuracy.
Conclusion: Diffusion samplers represent a principled and scalable paradigm for advancing uncertainty-aware modeling in industrial applications.
Abstract: In modern process industries, data-driven models are important tools for real-time monitoring when key performance indicators are difficult to measure directly. While accurate predictions are essential, reliable uncertainty quantification (UQ) is equally critical for safety, reliability, and decision-making, but remains a major challenge in current data-driven approaches. In this work, we introduce a diffusion-based posterior sampling framework that inherently produces well-calibrated predictive uncertainty via faithful posterior sampling, eliminating the need for post-hoc calibration. In extensive evaluations on synthetic distributions, the Raman-based phenylacetic acid soft sensor benchmark, and a real ammonia synthesis case study, our method achieves practical improvements over existing UQ techniques in both uncertainty calibration and predictive accuracy. These results highlight diffusion samplers as a principled and scalable paradigm for advancing uncertainty-aware modeling in industrial applications.
[527] Robust Graph Representation Learning via Adaptive Spectral Contrast
Zhuolong Li, Boxue Yang, Haopeng Chen
Main category: cs.LG
TL;DR: ASPECT is a spectral graph contrastive learning framework that addresses the spectral dilemma in handling both homophilic and heterophilic graphs through reliability-aware spectral gating.
Details
Motivation: The paper identifies a fundamental spectral dilemma: while high-frequency signals are essential for encoding heterophily in graphs, they exhibit significantly higher variance under spectrally concentrated perturbations. Existing global spectral fusion approaches are provably sub-optimal for mixed graphs with separated node-wise frequency preferences.Method: ASPECT uses a reliability-aware spectral gating mechanism formulated as a minimax game. It employs a node-wise gate that dynamically re-weights frequency channels based on their stability against a purpose-built adversary, which explicitly targets spectral energy distributions via a Rayleigh quotient penalty.
Result: ASPECT achieves new state-of-the-art performance on 8 out of 9 benchmarks, effectively decoupling meaningful structural heterophily from incidental noise.
Conclusion: The proposed framework resolves the spectral dilemma in graph contrastive learning by learning representations that are both structurally discriminative and spectrally robust through adaptive node-wise spectral fusion.
Abstract: Spectral graph contrastive learning has emerged as a unified paradigm for handling both homophilic and heterophilic graphs by leveraging high-frequency components. However, we identify a fundamental spectral dilemma: while high-frequency signals are indispensable for encoding heterophily, our theoretical analysis proves they exhibit significantly higher variance under spectrally concentrated perturbations. We derive a regret lower bound showing that existing global (node-agnostic) spectral fusion is provably sub-optimal: on mixed graphs with separated node-wise frequency preferences, any global fusion strategy incurs non-vanishing regret relative to a node-wise oracle. To escape this bound, we propose ASPECT, a framework that resolves this dilemma through a reliability-aware spectral gating mechanism. Formulated as a minimax game, ASPECT employs a node-wise gate that dynamically re-weights frequency channels based on their stability against a purpose-built adversary, which explicitly targets spectral energy distributions via a Rayleigh quotient penalty. This design forces the encoder to learn representations that are both structurally discriminative and spectrally robust. Empirical results show that ASPECT achieves new state-of-the-art performance on 8 out of 9 benchmarks, effectively decoupling meaningful structural heterophily from incidental noise.
[528] DDCL-INCRT: A Self-Organising Transformer with Hierarchical Prototype Structure (Theoretical Foundations)
Giansalvo Cirrincione
Main category: cs.LG
TL;DR: DDCL-INCRT is a self-organizing transformer architecture that automatically determines its structure during training through two mechanisms: DDCL for prototype learning and INCRT for incremental head addition, resulting in minimal task-specific architectures.
Details
Motivation: Traditional transformers require pre-defined architecture decisions (number of heads, depth, width) without task knowledge, leading to systematically oversized networks. Empirical studies show many components can be pruned post-training without performance loss, indicating inefficiency.Method: Combines two mechanisms: 1) DDCL replaces feedforward blocks with learned prototype vectors that automatically separate via training objective, and 2) INCRT starts with one attention head and adds new heads only when uncaptured directional information exceeds a threshold.
Result: The two mechanisms reinforce each other: new heads amplify prototype separation, which triggers further head additions. The network self-organizes into a unique, minimal hierarchical structure of heads ordered by representational granularity.
Conclusion: DDCL-INCRT provides a principled approach to derive rather than design transformer architectures, with formal guarantees of stability, convergence, and pruning safety, resulting in the smallest sufficient architecture for each task.
Abstract: Modern neural networks of the transformer family require the practitioner to decide, before training begins, how many attention heads to use, how deep the network should be, and how wide each component should be. These decisions are made without knowledge of the task, producing architectures that are systematically larger than necessary: empirical studies find that a substantial fraction of heads and layers can be removed after training without performance loss. This paper introduces DDCL-INCRT, an architecture that determines its own structure during training. Two complementary ideas are combined. The first, DDCL (Deep Dual Competitive Learning), replaces the feedforward block with a dictionary of learned prototype vectors representing the most informative directions in the data. The prototypes spread apart automatically, driven by the training objective, without explicit regularisation. The second, INCRT (Incremental Transformer), controls the number of heads: starting from one, it adds a new head only when the directional information uncaptured by existing heads exceeds a threshold. The main theoretical finding is that these two mechanisms reinforce each other: each new head amplifies prototype separation, which in turn raises the signal triggering the next addition. At convergence, the network self-organises into a hierarchy of heads ordered by representational granularity. This hierarchical structure is proved to be unique and minimal, the smallest architecture sufficient for the task, under the stated conditions. Formal guarantees of stability, convergence, and pruning safety are established throughout. The architecture is not something one designs. It is something one derives.
[529] LI-DSN: A Layer-wise Interactive Dual-Stream Network for EEG Decoding
Chenghao Yue, Zhiyuan Ma, Zhongye Xia, Xinche Zhang, Yisi Zhang, Xinke Shen, Sen Song
Main category: cs.LG
TL;DR: LI-DSN: A layer-wise interactive dual-stream network for EEG analysis that enables progressive cross-stream communication at each layer using Temporal-Spatial Integration Attention, outperforming 13 SOTA models across 8 EEG datasets.
Details
Motivation: Current dual-stream neural networks for EEG process temporal and spatial features independently with late-stage fusion, creating "information silos" that prevent intermediate cross-stream refinement and hinder optimal spatial-temporal feature utilization.Method: Proposes LI-DSN with Temporal-Spatial Integration Attention (TSIA) mechanism that constructs Spatial Affinity Correlation Matrix (SACM) for inter-electrode spatial relationships and Temporal Channel Aggregation Matrix (TCAM) for cosine-gated temporal dynamics under spatial guidance, plus adaptive fusion with learnable channel weights.
Result: Extensive experiments across eight diverse EEG datasets (motor imagery classification, emotion recognition, SSVEP) show LI-DSN significantly outperforms 13 state-of-the-art baseline models, demonstrating superior robustness and decoding performance.
Conclusion: LI-DSN overcomes limitations of late-fusion paradigms by enabling progressive cross-stream communication at each layer, providing a more effective approach for EEG analysis across multiple applications.
Abstract: Electroencephalography (EEG) provides a non-invasive window into brain activity, offering high temporal resolution crucial for understanding and interacting with neural processes through brain-computer interfaces (BCIs). Current dual-stream neural networks for EEG often process temporal and spatial features independently through parallel branches, delaying their integration until a final, late-stage fusion. This design inherently leads to an “information silo” problem, precluding intermediate cross-stream refinement and hindering spatial-temporal decompositions essential for full feature utilization. We propose LI-DSN, a layer-wise interactive dual-stream network that facilitates progressive, cross-stream communication at each layer, thereby overcoming the limitations of late-fusion paradigms. LI-DSN introduces a novel Temporal-Spatial Integration Attention (TSIA) mechanism, which constructs a Spatial Affinity Correlation Matrix (SACM) to capture inter-electrode spatial structural relationships and a Temporal Channel Aggregation Matrix (TCAM) to integrate cosine-gated temporal dynamics under spatial guidance. Furthermore, we employ an adaptive fusion strategy with learnable channel weights to optimize the integration of dual-stream features. Extensive experiments across eight diverse EEG datasets, encompassing motor imagery (MI) classification, emotion recognition, and steady-state visual evoked potentials (SSVEP), consistently demonstrate that LI-DSN significantly outperforms 13 state-of-the-art (SOTA) baseline models, showcasing its superior robustness and decoding performance. The code will be publicized after acceptance.
[530] Enhancing the Reliability of Medical AI through Expert-guided Uncertainty Modeling
Aleksei Khalin, Ekaterina Zaychenkova, Aleksandr Yugay, Andrey Goncharov, Sergey Korchagin, Alexey Zaytsev, Egor Ershov
Main category: cs.LG
TL;DR: Novel uncertainty estimation method for healthcare AI that leverages expert disagreement to better quantify aleatoric uncertainty, improving uncertainty estimation by 9-50% across medical imaging and question answering tasks.
Details
Motivation: AI errors in healthcare can have severe consequences, and while uncertainty estimation helps identify high-risk cases, current methods are limited in quantifying aleatoric uncertainty (data ambiguity/noise). Expert disagreement provides valuable information about uncertainty that current approaches don't capture.Method: Proposes using expert disagreement to generate training targets alongside standard labels. Uses a two-ensemble approach (and lightweight variant) to separately estimate two uncertainty components via the law of total variance. Validated on binary image classification, binary/multi-class image segmentation, and multiple-choice question answering.
Result: Incorporating expert knowledge improves uncertainty estimation quality by 9% to 50% depending on the task, demonstrating that expert disagreement is invaluable for building risk-aware AI systems in healthcare.
Conclusion: Expert disagreement provides crucial information for uncertainty estimation in healthcare AI, enabling better risk assessment and more reliable second-opinion systems through improved quantification of aleatoric uncertainty.
Abstract: Artificial intelligence (AI) systems accelerate medical workflows and improve diagnostic accuracy in healthcare, serving as second-opinion systems. However, the unpredictability of AI errors poses a significant challenge, particularly in healthcare contexts, where mistakes can have severe consequences. A widely adopted safeguard is to pair predictions with uncertainty estimation, enabling human experts to focus on high-risk cases while streamlining routine verification. Current uncertainty estimation methods, however, remain limited, particularly in quantifying aleatoric uncertainty, which arises from data ambiguity and noise. To address this, we propose a novel approach that leverages disagreement in expert responses to generate targets for training machine learning models. These targets are used in conjunction with standard data labels to estimate two components of uncertainty separately, as given by the law of total variance, via a two-ensemble approach, as well as its lightweight variant. We validate our method on binary image classification, binary and multi-class image segmentation, and multiple-choice question answering. Our experiments demonstrate that incorporating expert knowledge can enhance uncertainty estimation quality by $9%$ to $50%$ depending on the task, making this source of information invaluable for the construction of risk-aware AI systems in healthcare applications.
[531] The Rank and Gradient Lost in Non-stationarity: Sample Weight Decay for Mitigating Plasticity Loss in Reinforcement Learning
Zihao Wu, Hongyao Tang, Yi Ma, Jiashun Liu, Yan Zheng, Jianye Hao
Main category: cs.LG
TL;DR: Theoretical analysis of plasticity loss in deep RL identifies two mechanisms: NTK rank collapse and gradient decay, proposes Sample Weight Decay to restore gradient magnitude.
Details
Motivation: Deep RL suffers from plasticity loss due to non-stationarity, but theoretical understanding is limited. The paper aims to bridge this gap by analyzing plasticity loss from a network optimization perspective.Method: Theoretical analysis identifies two mechanisms: NTK Gram matrix rank collapse and Θ(1/k) gradient decay. Proposes Sample Weight Decay to restore gradient magnitude, applied to experience replay-based RL methods.
Result: Method effectively alleviates plasticity loss, improves learning performance across TD3, Double DQN, SAC with SimBa architecture in MuJoCo, ALE, and DeepMind Control Suite tasks, achieving SOTA on DMC Humanoid.
Conclusion: Provides theoretical understanding of plasticity loss mechanisms and offers practical solution (Sample Weight Decay) that works across various RL algorithms and environments.
Abstract: Deep reinforcement learning (RL) suffers from plasticity loss severely due to the nature of non-stationarity, which impairs the ability to adapt to new data and learn continually. Unfortunately, our understanding of how plasticity loss arises, dissipates, and can be dissolved remains limited to empirical findings, leaving the theoretical end underexplored.To address this gap, we study the plasticity loss problem from the theoretical perspective of network optimization. By formally characterizing the two culprit factors in online RL process: the non-stationarity of data distributions and the non-stationarity of targets induced by bootstrapping, our theory attributes the loss of plasticity to two mechanisms: the rank collapse of the Neural Tangent Kernel (NTK) Gram matrix and the $Θ(\frac{1}{k})$ decay of gradient magnitude. The first mechanism echoes prior empirical findings from the theoretical perspective and sheds light on the effects of existing methods, e.g., network reset, neuron recycle, and noise injection. Against this backdrop, we focus primarily on the second mechanism and aim to alleviate plasticity loss by addressing the gradient attenuation issue, which is orthogonal to existing methods. We propose Sample Weight Decay – a lightweight method to restore gradient magnitude, as a general remedy to plasticity loss for deep RL methods based on experience replay. In experiments, we evaluate the efficacy of \methodName upon TD3, \myadded{Double DQN} and SAC with SimBa architecture in MuJoCo, \myadded{ALE} and DeepMind Control Suite tasks. The results demonstrate that \methodName effectively alleviates plasticity loss and consistently improves learning performance across various configurations of deep RL algorithms, UTD, network architectures, and environments, achieving SOTA performance on challenging DMC Humanoid tasks.
[532] PAC-Bayesian Reward-Certified Outcome Weighted Learning
Yuya Ishikawa, Shu Tamano
Main category: cs.LG
TL;DR: PROWL is a PAC-Bayesian framework for robust individualized treatment rule estimation that accounts for reward uncertainty through certified learning with automated calibration.
Details
Motivation: Existing outcome weighted learning methods for ITRs rely on noisy or optimistic reward proxies, leading to inflated performance estimates. Current frameworks lack finite-sample guarantees to systematically incorporate reward uncertainty into learning objectives.Method: PROWL constructs conservative rewards using one-sided uncertainty certificates and policy-dependent lower bounds. It transforms robust policy learning into cost-sensitive classification, derives PAC-Bayes lower bounds for randomized ITRs, and introduces automated calibration with Fisher-consistent certified hinge surrogates.
Result: Experiments show PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard ITR estimation methods.
Conclusion: PROWL provides a principled PAC-Bayesian framework for reward-certified outcome weighted learning with finite-sample guarantees, addressing reward uncertainty in treatment policy estimation.
Abstract: Estimating optimal individualized treatment rules (ITRs) via outcome weighted learning (OWL) often relies on observed rewards that are noisy or optimistic proxies for the true latent utility. Ignoring this reward uncertainty leads to the selection of policies with inflated apparent performance, yet existing OWL frameworks lack the finite-sample guarantees required to systematically embed such uncertainty into the learning objective. To address this issue, we propose PAC-Bayesian Reward-Certified Outcome Weighted Learning (PROWL). Given a one-sided uncertainty certificate, PROWL constructs a conservative reward and a strictly policy-dependent lower bound on the true expected value. Theoretically, we prove an exact certified reduction that transforms robust policy learning into a unified, split-free cost-sensitive classification task. This formulation enables the derivation of a nonasymptotic PAC-Bayes lower bound for randomized ITRs, where we establish that the optimal posterior maximizing this bound is exactly characterized by a general Bayes update. To overcome the learning-rate selection problem inherent in generalized Bayesian inference, we introduce a fully automated, bounds-based calibration procedure, coupled with a Fisher-consistent certified hinge surrogate for efficient optimization. Our experiments demonstrate that PROWL achieves improvements in estimating robust, high-value treatment regimes under severe reward uncertainty compared to standard methods for ITR estimation.
[533] annbatch unlocks terabyte-scale training of biological data in anndata
Ilan Gold, Felix Fischer, Lucas Arnoldt, F. Alexander Wolf, Fabian J. Theis
Main category: cs.LG
TL;DR: annbatch is a mini-batch loader for anndata that enables out-of-core training on disk-backed biological datasets, significantly improving loading throughput and reducing training time while maintaining compatibility with the scverse ecosystem.
Details
Motivation: Biological datasets now routinely exceed system memory, making data access the primary bottleneck in training machine-learning models. This is particularly acute in biology where community data formats must support heterogeneous metadata, sparse/dense assays, and downstream analysis within established computational ecosystems.Method: Developed annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. It works across single-cell transcriptomics, microscopy, and whole-genome sequencing benchmarks.
Result: annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem.
Conclusion: annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats.
Abstract: The scale of biological datasets now routinely exceeds system memory, making data access rather than model computation the primary bottleneck in training machine-learning models. This bottleneck is particularly acute in biology, where widely used community data formats must support heterogeneous metadata, sparse and dense assays, and downstream analysis within established computational ecosystems. Here we present annbatch, a mini-batch loader native to anndata that enables out-of-core training directly on disk-backed datasets. Across single-cell transcriptomics, microscopy and whole-genome sequencing benchmarks, annbatch increases loading throughput by up to an order of magnitude and shortens training from days to hours, while remaining fully compatible with the scverse ecosystem. Annbatch establishes a practical data-loading infrastructure for scalable biological AI, allowing increasingly large and diverse datasets to be used without abandoning standard biological data formats. Github: https://github.com/scverse/annbatch
[534] Learn by Surprise, Commit by Proof
Kang-Sin Choi
Main category: cs.LG
TL;DR: LSCP is a self-gated post-training framework that enables models to autonomously acquire knowledge by identifying gaps in their understanding, generating Q&A chains to articulate knowledge, and adjusting learning intensity based on conviction depth.
Details
Motivation: To create a framework for autonomous knowledge acquisition that learns only what a model doesn't know, verifies against existing knowledge, and prevents rote memorization while sharpening weakly encoded knowledge to reduce hallucinations.Method: Uses per-token loss anomalies to flag knowledge gaps, generates Q&A chains for self-verification, adjusts AdamW’s β₂ parameter proportionally to conviction depth (k) via β₂ = 0.999·rᵏ, with learning intensity controlled by a single parameter r.
Result: LSCP achieves semantic learning (2.7-3.0x perturbation gap) vs. rote memorization in standard fine-tuning (11.6x baseline). The Q&A format prevents memorization, while gating protects neighboring knowledge from contamination (93% accuracy vs. 90% baseline).
Conclusion: LSCP enables autonomous knowledge acquisition that prevents rote memorization, sharpens existing knowledge, and protects neighboring knowledge, modeling biological memory consolidation through selective parametric weight updates.
Abstract: We propose LSCP, a self-gated post-training framework for autonomous knowledge acquisition: learning only what a model does not already know, verified against what it does know, at a strength proportional to conviction, with no external oracle. When a passage produces anomalously high per-token loss, LSCP flags it, generates a Q&A chain that forces the model to articulate its own knowledge and identify gaps, then adjusts AdamW’s $β_2$ proportionally to conviction depth k (the number of self-verification steps the passage survives) via $β_2 = 0.999 \cdot r^k$. The entire learning intensity is governed by a single parameter $r$. Beyond new knowledge, this process sharpens weakly encoded existing knowledge, which is a primary source of hallucination. The framework is self-extinguishing: as the model learns, per-token loss on learned passages decreases toward the surprisal threshold and the system progressively converges to standard AdamW. This models biological memory consolidation: temporary information in the context window is selectively consolidated into parametric weights, the model’s long-term memory. Experiments on the reference model (Qwen3-14B) and across six models (8B–32B, four families) show that standard fine-tuning produces rote memorization (perturbation gap (the ratio of paraphrase to original perplexity) of 11.6 +- 0.2 x baseline) while all LSCP conditions learn semantically (2.7–3.0x). The r=1.0 condition (identical optimizer, nearly identical data, only Q&A format differs) confirms that the training data format, not $β_2$ gating, is the primary mechanism preventing memorization; gating instead protects neighboring knowledge from contamination by corrupt content (93 +- 7% accuracy on adjacent questions at r=0.98 vs. 90% baseline).
[535] Generalization Bounds and Statistical Guarantees for Multi-Task and Multiple Operator Learning with MNO Networks
Adrien Weihs, Hayden Schaeffer
Main category: cs.LG
TL;DR: The paper develops generalization analysis for multiple operator learning with hierarchical sampling, providing covering-number-based bounds for Multiple Neural Operator architecture and explicit approximation-estimation tradeoffs.
Details
Motivation: Recent work has developed expressive architectures for multiple operator learning but lacks quantitative statistical generalization guarantees. The paper aims to provide rigorous generalization analysis for learning operator families with hierarchical sampling structure.Method: Uses covering-number-based generalization analysis for separable models, specifically the Multiple Neural Operator (MNO) architecture. Derives metric-entropy bounds for hypothesis classes of linear combinations of products of deep ReLU subnetworks, then combines these with approximation guarantees to obtain explicit approximation-estimation tradeoffs.
Result: Obtains explicit generalization bounds showing dependence on hierarchical sampling budgets (n_α, n_u, n_x) and provides learning-rate statements in operator-sampling budget n_α, characterizing sample complexity for generalization across operator instances.
Conclusion: The analysis provides rigorous statistical guarantees for multiple operator learning, making sampling budget dependencies transparent and offering insights into generalization across operator instances. The architecture can be viewed as a general-purpose solver or small PDE foundation model.
Abstract: Multiple operator learning concerns learning operator families ${G[α]:U\to V}_{α\in W}$ indexed by an operator descriptor $α$. Training data are collected hierarchically by sampling operator instances $α$, then input functions $u$ per instance, and finally evaluation points $x$ per input, yielding noisy observations of $G[α]u$. While recent work has developed expressive multi-task and multiple operator learning architectures and approximation-theoretic scaling laws, quantitative statistical generalization guarantees remain limited. We provide a covering-number-based generalization analysis for separable models, focusing on the Multiple Neural Operator (MNO) architecture: we first derive explicit metric-entropy bounds for hypothesis classes given by linear combinations of products of deep ReLU subnetworks, and then combine these complexity bounds with approximation guarantees for MNO to obtain an explicit approximation-estimation tradeoff for the expected test error on new (unseen) triples $(α,u,x)$. The resulting bound makes the dependence on the hierarchical sampling budgets $(n_α,n_u,n_x)$ transparent and yields an explicit learning-rate statement in the operator-sampling budget $n_α$, providing a sample-complexity characterization for generalization across operator instances. The structure and architecture can also be viewed as a general purpose solver or an example of a “small’’ PDE foundation model, where the triples are one form of multi-modality.
[536] World Action Verifier: Self-Improving World Models via Forward-Inverse Asymmetry
Yuejiang Liu, Fan Feng, Lingjing Kong, Weifeng Lu, Jinzhou Tang, Kun Zhang, Kevin Murphy, Chelsea Finn, Yilun Du
Main category: cs.LG
TL;DR: WAV framework improves world model robustness by decomposing action-conditioned state prediction into state plausibility and action reachability, using cycle consistency verification with diverse subgoal generation and sparse inverse modeling.
Details
Motivation: World models need robustness over broad ranges of suboptimal actions, but action-labeled interaction data insufficiently covers these regimes, making reliable prediction challenging.Method: Decompose action-conditioned state prediction into state plausibility (verifiable with action-free data) and action reachability (using sparse inverse models). Augment world model with diverse subgoal generator from video corpora and sparse inverse model that infers actions from subset of state features. Enforce cycle consistency among generated subgoals, inferred actions, and forward rollouts for verification.
Result: Achieves 2x higher sample efficiency and improves downstream policy performance by 18% across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill environments.
Conclusion: WAV framework effectively addresses world model robustness by leveraging asymmetries in data availability and feature dimensionality, enabling self-verification and improvement in under-explored action regimes.
Abstract: General-purpose world models promise scalable policy evaluation, optimization, and planning, yet achieving the required level of robustness remains challenging. Unlike policy learning, which primarily focuses on optimal actions, a world model must be reliable over a much broader range of suboptimal actions, which are often insufficiently covered by action-labeled interaction data. To address this challenge, we propose World Action Verifier (WAV), a framework that enables world models to identify their own prediction errors and self-improve. The key idea is to decompose action-conditioned state prediction into two factors – state plausibility and action reachability – and verify each separately. We show that these verification problems can be substantially easier than predicting future states due to two underlying asymmetries: the broader availability of action-free data and the lower dimensionality of action-relevant features. Leveraging these asymmetries, we augment a world model with (i) a diverse subgoal generator obtained from video corpora and (ii) a sparse inverse model that infers actions from a subset of state features. By enforcing cycle consistency among generated subgoals, inferred actions, and forward rollouts, WAV provides an effective verification mechanism in under-explored regimes, where existing methods typically fail. Across nine tasks spanning MiniGrid, RoboMimic, and ManiSkill, our method achieves 2x higher sample efficiency while improving downstream policy performance by 18%.
[537] Apriel-Reasoner: RL Post-Training for General-Purpose and Efficient Reasoning
Rafael Pardinas, Ehsan Kamalloo, David Vazquez, Alexandre Drouin
Main category: cs.LG
TL;DR: Apriel-Reasoner: A 15B-parameter LLM trained with multi-domain RL post-training across mathematics, code generation, instruction following, logical puzzles, and function calling, featuring adaptive domain sampling and difficulty-aware length penalty to optimize reasoning efficiency.
Details
Motivation: Current frontier open-weight models use RLVR for general-purpose reasoning but lack transparency in training recipes and domain mixtures. Joint optimization across diverse domains is challenging due to varying rollout lengths, problem difficulty, and sample efficiency. Long chain-of-thought traces increase inference costs, making efficiency critical for practical deployment.Method: Trained Apriel-Reasoner on Apriel-Base (15B parameter LLM) using reproducible multi-domain RL post-training across five domains with public datasets. Introduced adaptive domain sampling to preserve target domain ratios despite heterogeneous rollout dynamics, and difficulty-aware extension of standard length penalty that encourages longer reasoning for difficult problems and shorter traces for easy ones without additional training overhead.
Result: Trained with strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. Matches strong open-weight models of similar size at lower token cost, pushing the Pareto frontier of accuracy versus token budget.
Conclusion: Apriel-Reasoner demonstrates effective multi-domain RL post-training with efficiency optimizations, achieving competitive performance with shorter reasoning traces and lower inference costs, advancing the trade-off between accuracy and computational efficiency in reasoning models.
Abstract: Building general-purpose reasoning models using reinforcement learning with verifiable rewards (RLVR) across diverse domains has been widely adopted by frontier open-weight models. However, their training recipes and domain mixtures are often not disclosed. Joint optimization across domains poses significant challenges: domains vary widely in rollout length, problem difficulty and sample efficiency. Further, models with long chain-of-thought traces increase inference cost and latency, making efficiency critical for practical deployment. We present Apriel-Reasoner, trained with a fully reproducible multi-domain RL post-training recipe on Apriel-Base, a 15B-parameter open-weight LLM, across five domains using public datasets: mathematics, code generation, instruction following, logical puzzles and function calling. We introduce an adaptive domain sampling mechanism that preserves target domain ratios despite heterogeneous rollout dynamics, and a difficulty-aware extension of the standard length penalty that, with no additional training overhead, encourages longer reasoning for difficult problems and shorter traces for easy ones. Trained with a strict 16K-token output budget, Apriel-Reasoner generalizes to 32K tokens at inference and improves over Apriel-Base on AIME 2025, GPQA, MMLU-Pro, and LiveCodeBench while producing 30-50% shorter reasoning traces. It matches strong open-weight models of similar size at lower token cost, thereby pushing the Pareto frontier of accuracy versus token budget.
[538] Feature Weighting Improves Pool-Based Sequential Active Learning for Regression
Dongrui Wu
Main category: cs.LG
TL;DR: Feature-weighted active learning for regression improves sample selection by weighting features based on ridge regression coefficients when computing inter-sample distances for representativeness and diversity.
Details
Motivation: Previous active learning for regression approaches don't incorporate feature importance in distance computation, leading to suboptimal sample selection. The paper aims to improve sample selection by weighting features based on their importance.Method: Proposes three feature-weighted single-task ALR approaches and two feature-weighted multi-task ALR approaches. Uses ridge regression coefficients from previously labeled samples to weight corresponding features in inter-sample distance computation for better sample selection.
Result: Experiments show this easy-to-implement enhancement almost always improves performance of four existing ALR approaches in both single-task and multi-task regression problems.
Conclusion: Feature weighting strategy effectively improves active learning for regression and can be extended to stream-based ALR and classification algorithms.
Abstract: Pool-based sequential active learning for regression (ALR) optimally selects a small number of samples sequentially from a large pool of unlabeled samples to label, so that a more accurate regression model can be constructed under a given labeling budget. Representativeness and diversity, which involve computing the distances among different samples, are important considerations in ALR. However, previous ALR approaches do not incorporate the importance of different features in inter-sample distance computation, resulting in sub-optimal sample selection. This paper proposes three feature weighted single-task ALR approaches and two feature weighted multi-task ALR approaches, where the ridge regression coefficients trained from a small amount of previously labeled samples are used to weight the corresponding features in inter-sample distance computation. Experiments showed that this easy-to-implement enhancement almost always improves the performance of four existing ALR approaches, in both single-task and multi-task regression problems. The feature weighting strategy may also be easily extended to stream-based ALR, and classification algorithms.
[539] Ouroboros: Dynamic Weight Generation for Recursive Transformers via Input-Conditioned LoRA Modulation
Jaber Jaber, Osama Jaber
Main category: cs.LG
TL;DR: Ouroboros enhances recursive transformers with a Controller hypernetwork that enables input-dependent per-step modulation, improving training performance while adding minimal parameters.
Details
Motivation: Recursive transformers reuse the same weight block across depth steps, which prevents composition of distinct operations across depth. This limitation restricts the model's ability to perform different transformations at different recurrence steps.Method: Attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state and produces a per-step diagonal modulation vector applied to frozen SVD-initialized LoRA bases. Combines with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration.
Result: On Qwen2.5-3B split into Prelude/Recurrent/Coda architecture, reduces training loss by 43.4% over baseline, recovering 51.3% of performance gap from layer removal. Outperforms equivalently-sized static per-step LoRA across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). Gated recurrence proved essential for success.
Conclusion: Ouroboros enables input-dependent modulation in recursive transformers, significantly improving training performance with minimal parameter overhead. However, improvements are currently limited to training distribution and don’t generalize to held-out text, suggesting limitations from frozen downstream layers.
Abstract: Recursive transformers reuse a shared weight block across multiple depth steps, trading parameters for compute. A core limitation: every step applies the same transformation, preventing the model from composing distinct operations across depth. We present Ouroboros, a system that attaches a compact Controller hypernetwork to a recursive transformer block. The Controller observes the current hidden state, produces a per-step diagonal modulation vector, and applies it to frozen SVD-initialized LoRA bases, making each recurrence step input-dependent. We combine this with gated recurrence (bias-initialized to 88% retention) and per-step LayerNorm for stable deep iteration. On Qwen2.5-3B split into a Prelude/Recurrent/Coda architecture (17 of 36 layers retained), Ouroboros reduces training loss by 43.4% over the unmodified 17-layer baseline, recovering 51.3% of the performance gap caused by layer removal. The full system adds only 9.2M trainable parameters (Controller, gate, and per-step norms) yet outperforms equivalently-sized static per-step LoRA by 1.44 loss points at depth 1 and remains ahead across all tested depths (1, 4, 8, 16) and ranks (8, 32, 64). We also find that gated recurrence is essential: without it, recursive layer application makes the model strictly worse. These gains are measured on the training distribution; on held-out text, the Controller does not yet improve over the baseline, a limitation we attribute to frozen downstream layers and discuss in detail. Code: https://github.com/RightNow-AI/ouroboros
[540] AA-SVD : Anchored and Adaptive SVD for Large Language Model Compression
Atul Kumar Sinha, François Fleuret
Main category: cs.LG
TL;DR: Fast low-rank factorization framework for compressing large language models without retraining, addressing distribution shifts and maintaining functional equivalence.
Details
Motivation: Existing factorization-based compression methods either ignore distribution shifts from upstream compression (causing error propagation) or rely only on shifted inputs (risking output drift), creating a need for a method that accounts for both original outputs and input distribution changes.Method: Proposes a low-rank factorization framework that anchors compressed layers to original outputs while explicitly modeling input distribution shifts, with end-to-end transformer block refinement to minimize output distortion and allow joint error compensation.
Result: Consistently outperforms existing SVD-based baselines across compression ratios, with advantages becoming more pronounced at aggressive compression budgets where competing methods degrade substantially or collapse entirely.
Conclusion: Provides a practical solution for efficient, large-scale model deployment by maintaining functional equivalence with original models while enabling rapid compression of billion-parameter models without retraining.
Abstract: We introduce a fast low-rank factorization-based framework for compressing large language models that enables rapid compression of billion-parameter models without retraining. Unlike existing factorization-based approaches that optimize only on the original inputs, ignoring distribution shifts from upstream compression and thus propagating errors forward, or those that rely only on shifted inputs and risk drifting away from the original outputs, our approach accounts for both. Beyond individual layer compression, we further refine each transformer block end-to-end, minimizing block-level output distortion and allowing compressed layers to jointly compensate for accumulated errors. By anchoring each compressed layer to the original outputs while explicitly modeling input distribution shifts, our method finds a low-rank approximation that maintains functional equivalence with the original model. Experiments on large language models show that our method consistently outperforms existing SVD-based baselines across compression ratios, with the advantage becoming increasingly pronounced at aggressive compression budgets, where competing methods degrade substantially or collapse entirely, offering a practical solution for efficient, large-scale model deployment.
[541] Application of parametric Shallow Recurrent Decoder Network to magnetohydrodynamic flows in liquid metal blankets of fusion reactors
M. Lo Verso, C. Introini, E. Cervi, L. Savoldi, J. N. Kutz, A. Cammi
Main category: cs.LG
TL;DR: SHRED neural network integrates SVD with recurrent decoders for MHD state reconstruction from sparse measurements in fusion reactor systems.
Details
Motivation: MHD phenomena in nuclear fusion systems require computationally demanding numerical solutions, especially for multi-query, parametric, or real-time applications. There's a need for efficient data-driven approaches for state reconstruction.Method: Combines dimensionality reduction via Singular Value Decomposition (SVD) with SHallow REcurrent Decoder (SHRED) neural network architecture to reconstruct full spatio-temporal MHD states from sparse time-series measurements.
Result: SHRED achieves high reconstruction accuracy, robustness, and generalization across various magnetic field configurations (constant toroidal, combined toroidal-poloidal, time-dependent) in 3D WCLL blanket cell geometry. Can infer temporal evolution of magnetic fields from temperature measurements alone.
Conclusion: SHRED is a computationally efficient, data-driven, flexible approach for MHD state reconstruction with significant potential for real-time monitoring, diagnostics, and control in fusion reactor systems.
Abstract: Magnetohydrodynamic (MHD) phenomena play a pivotal role in the design and operation of nuclear fusion systems, where electrically conducting fluids (such as liquid metals or molten salts employed in reactor blankets) interact with magnetic fields of varying intensity and orientation, influencing the resulting flow dynamics. The numerical solution of MHD models entails the resolution of highly nonlinear, multiphysics systems of equations, which can become computationally demanding, particularly in multi-query, parametric, or real-time contexts. This study investigates a fully data-driven framework for MHD state reconstruction that integrates dimensionality reduction through Singular Value Decomposition (SVD) with the SHallow REcurrent Decoder (SHRED), a neural network architecture designed to reconstruct the full spatio-temporal state from sparse time-series measurements of selected observables, including previously unseen parametric configurations. The SHRED methodology is applied to a three-dimensional geometry representative of a portion of a WCLL blanket cell, in which lead-lithium flows around a water-cooled tube. Multiple magnetic field configurations are examined, including constant toroidal fields, combined toroidal-poloidal fields, and time-dependent magnetic fields. Across all considered scenarios, SHRED achieves high reconstruction accuracy, robustness, and generalization to magnetic field intensities, orientations, and temporal evolutions not seen during training. Notably, in the presence of time-varying magnetic fields, the model accurately infers the temporal evolution of the magnetic field itself using temperature measurements alone. Overall, the findings identify SHRED as a computationally efficient, data-driven, and flexible approach for MHD state reconstruction, with significant potential for real-time monitoring, diagnostics and control in fusion reactor systems.
[542] Auction-Based Online Policy Adaptation for Evolving Objectives
Guruprerana Shabadi, Kaushik Mallik
Main category: cs.LG
TL;DR: Auction-based multi-objective RL framework where selfish local policies bid for action control, enabling dynamic adaptation to changing objectives.
Details
Motivation: Address multi-objective RL problems where objectives from identical families (like reachability) appear/disappear at runtime, requiring adaptive policies that efficiently adjust behaviors.Method: Modular framework with selfish local policies per objective, coordination via auction-based mechanism where policies bid for action control based on state urgency. Highest bidder selects action. Policies trained as general-sum game using PPO.
Result: Achieves substantially better performance than monolithic PPO-trained policies on Atari Assault and gridworld path-planning with dynamic targets.
Conclusion: Auction-based coordination enables interpretable trade-offs and runtime adaptation by simply adding/removing policies, with parameterized policies facilitating immediate adaptation.
Abstract: We consider multi-objective reinforcement learning problems where objectives come from an identical family – such as the class of reachability objectives – and may appear or disappear at runtime. Our goal is to design adaptive policies that can efficiently adjust their behaviors as the set of active objectives changes. To solve this problem, we propose a modular framework where each objective is supported by a selfish local policy, and coordination is achieved through a novel auction-based mechanism: policies bid for the right to execute their actions, with bids reflecting the urgency of the current state. The highest bidder selects the action, enabling a dynamic and interpretable trade-off among objectives. Going back to the original adaptation problem, when objectives change, the system adapts by simply adding or removing the corresponding policies. Moreover, as objectives arise from the same family, identical copies of a parameterized policy can be deployed, facilitating immediate adaptation at runtime. We show how the selfish local policies can be computed by turning the problem into a general-sum game, where the policies compete against each other to fulfill their own objectives. To succeed, each policy must not only optimize its own objective, but also reason about the presence of other goals and learn to produce calibrated bids that reflect relative priority. In our implementation, the policies are trained concurrently using proximal policy optimization (PPO). We evaluate on Atari Assault and a gridworld-based path-planning task with dynamic targets. Our method achieves substantially better performance than monolithic policies trained with PPO.
[543] Neural network methods for two-dimensional finite-source reflector design
Roel Hacking, Lisa Kusch, Koondanibha Mitra, Martijn Anthonissen, Wilbert IJzerman
Main category: cs.LG
TL;DR: Neural network approach for designing 2D reflectors to transform light from extended sources into prescribed far-field distributions, outperforming deconvolution baselines.
Details
Motivation: Address the inverse problem of designing optical reflectors that can transform light from finite, extended sources into specific far-field distributions, which has applications in illumination design, projection systems, and optical engineering.Method: Proposes neural network parameterization of reflector height with two differentiable objective functions: (1) direct change-of-variables loss pushing source distribution through learned inverse mapping, and (2) mesh-based loss mapping target-space grid back to source with continuous integration. Uses automatic differentiation and quasi-Newton optimization. Compares with deconvolution baseline using simplified finite-source approximation and Van Cittert iteration.
Result: Neural network approach converges faster and achieves consistently lower normalized mean absolute error (NMAE) than deconvolution method across four benchmarks (continuous/discontinuous sources, with/without height constraints). Handles height constraints naturally and shows superior performance.
Conclusion: Neural network parameterization with differentiable objectives provides effective solution for reflector design inverse problems, outperforming traditional deconvolution methods and offering natural handling of constraints. Method can be extended to 3D settings via iterative correction schemes.
Abstract: We address the inverse problem of designing two-dimensional reflectors that transform light from a finite, extended source into a prescribed far-field distribution. We propose a neural network parameterization of the reflector height and develop two differentiable objective functions: (i) a direct change-of-variables loss that pushes the source distribution through the learned inverse mapping, and (ii) a mesh-based loss that maps a target-space grid back to the source, integrates over intersections, and remains continuous even when the source is discontinuous. Gradients are obtained via automatic differentiation and optimized with a robust quasi-Newton method. As a comparison, we formulate a deconvolution baseline built on a simplified finite-source approximation: a 1D monotone mapping is recovered from flux balance, yielding an ordinary differential equation solved in integrating-factor form; this solver is embedded in a modified Van Cittert iteration with nonnegativity clipping and a ray-traced forward operator. Across four benchmarks – continuous and discontinuous sources, and with/without minimum-height constraints – we evaluate accuracy by ray-traced normalized mean absolute error (NMAE). Our neural network approach converges faster and achieves consistently lower NMAE than the deconvolution method, and handles height constraints naturally. We discuss how the method may be extended to rotationally symmetric and full three-dimensional settings via iterative correction schemes.
[544] On the Role of Depth in the Expressivity of RNNs
Maude Lizaire, Michael Rizvi-Martel, Éric Dupuis, Guillaume Rabusseau
Main category: cs.LG
TL;DR: Depth in RNNs increases memory capacity efficiently, while 2RNNs (with multiplicative interactions) enable polynomial transformations whose degree grows with depth, and multiplicative interactions cannot be replaced by nonlinearities.
Details
Motivation: While depth benefits in feedforward networks are well understood, it remains unclear how depth interacts with recurrence in RNNs to shape expressive power, particularly regarding memory capacity and computational capabilities.Method: Formal theoretical analysis of depth effects in RNNs and 2RNNs (generalized RNNs with multiplicative interactions), examining how depth increases memory capacity and enables polynomial transformations. Empirical validation on synthetic and real-world tasks.
Result: Depth efficiently increases RNNs’ memory capacity with respect to parameter count, enhancing both input transformations and past information retention. 2RNNs perform polynomial transformations with degree growing with depth, and multiplicative interactions cannot generally be replaced by layerwise nonlinearities.
Conclusion: Depth plays a crucial role in enhancing RNN expressivity through both increased memory capacity and computational capabilities, with multiplicative interactions in 2RNNs providing distinct advantages over standard nonlinearities.
Abstract: The benefits of depth in feedforward neural networks are well known: composing multiple layers of linear transformations with nonlinear activations enables complex computations. While similar effects are expected in recurrent neural networks (RNNs), it remains unclear how depth interacts with recurrence to shape expressive power. Here, we formally show that depth increases RNNs’ memory capacity efficiently with respect to the number of parameters, thus enhancing expressivity both by enabling more complex input transformations and improving the retention of past information. We broaden our analysis to 2RNNs, a generalization of RNNs with multiplicative interactions between inputs and hidden states. Unlike RNNs, which remain linear without nonlinear activations, 2RNNs perform polynomial transformations whose maximal degree grows with depth. We further show that multiplicative interactions cannot, in general, be replaced by layerwise nonlinearities. Finally, we validate these insights empirically on synthetic and real-world tasks.
[545] LEO: Graph Attention Network based Hybrid Multi Sensor Extended Object Fusion and Tracking for Autonomous Driving Applications
Mayank Mayank, Bharanidhar Duraisamy, Florian Geiss
Main category: cs.LG
TL;DR: LEO: A spatio-temporal Graph Attention Network that fuses multi-modal sensor tracks for dynamic object shape and trajectory estimation in automated driving, combining classical Bayesian robustness with deep learning adaptability.
Details
Motivation: Need to bridge strengths of classical Bayesian extended-object models (theoretical robustness/efficiency) and deep learning methods (adaptability) for accurate shape/trajectory estimation in automated driving, while addressing limitations of both approaches.Method: Spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks, learns adaptive fusion weights, ensures temporal consistency, and represents multi-scale shapes using task-specific parallelogram ground-truth formulation.
Result: LEO models complex geometries (articulated trucks/trailers), generalizes across sensor types/configurations/object classes/regions, remains robust for challenging/long-range targets, demonstrates real-time computational efficiency on Mercedes-Benz DRIVE PILOT dataset, and shows cross-dataset generalization on View of Delft.
Conclusion: LEO successfully bridges classical Bayesian and deep learning approaches for dynamic object estimation, achieving production-ready real-time performance with strong generalization capabilities across diverse scenarios and datasets.
Abstract: Accurate shape and trajectory estimation of dynamic objects is essential for reliable automated driving. Classical Bayesian extended-object models offer theoretical robustness and efficiency but depend on completeness of a-priori and update-likelihood functions, while deep learning methods bring adaptability at the cost of dense annotations and high compute. We bridge these strengths with LEO (Learned Extension of Objects), a spatio-temporal Graph Attention Network that fuses multi-modal production-grade sensor tracks to learn adaptive fusion weights, ensure temporal consistency, and represent multi-scale shapes. Using a task-specific parallelogram ground-truth formulation, LEO models complex geometries (e.g. articulated trucks and trailers) and generalizes across sensor types, configurations, object classes, and regions, remaining robust for challenging and long-range targets. Evaluations on the Mercedes-Benz DRIVE PILOT SAE L3 dataset demonstrate real-time computational efficiency suitable for production systems; additional validation on public datasets such as View of Delft (VoD) further confirms cross-dataset generalization.
[546] Universal Hypernetworks for Arbitrary Models
Xuanfeng Zhou
Main category: cs.LG
TL;DR: Universal Hypernetwork (UHN) is a fixed-architecture generator that predicts weights from parameter, architecture, and task descriptors, enabling one generator to instantiate heterogeneous models across different architectures and tasks.
Details
Motivation: Conventional hypernetworks are tied to specific base-model parameterizations, requiring redesign and retraining when changing target architectures. The authors aim to create a universal hypernetwork that can generate weights for diverse architectures and tasks without architectural changes.Method: UHN uses deterministic parameter, architecture, and task descriptors as input to a fixed-architecture generator. This descriptor-based formulation decouples the generator from target-network parameterization, allowing one generator to instantiate models across different architecture families and tasks.
Result: (1) One fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) Supports both multi-model generalization within a family and multi-task learning across heterogeneous models; (3) Enables stable recursive generation with up to three intermediate generated UHNs.
Conclusion: UHN provides a flexible framework for weight generation that transcends architectural constraints, enabling a single generator to handle diverse models and tasks while maintaining competitive performance with direct training.
Abstract: Conventional hypernetworks are typically engineered around a specific base-model parameterization, so changing the target architecture often entails redesigning the hypernetwork and retraining it from scratch. We introduce the \emph{Universal Hypernetwork} (UHN), a fixed-architecture generator that predicts weights from deterministic parameter, architecture, and task descriptors. This descriptor-based formulation decouples the generator architecture from target-network parameterization, so one generator can instantiate heterogeneous models across the tested architecture and task families. Our empirical claims are threefold: (1) one fixed UHN remains competitive with direct training across vision, graph, text, and formula-regression benchmarks; (2) the same UHN supports both multi-model generalization within a family and multi-task learning across heterogeneous models; and (3) UHN enables stable recursive generation with up to three intermediate generated UHNs before the final base model. Our code is available at https://github.com/Xuanfeng-Zhou/UHN.
[547] go-$m$HC: Direct Parameterization of Manifold-Constrained Hyper-Connections via Generalized Orthostochastic Matrices
Torque Dandachi, Sophia Diggs-Galligan
Main category: cs.LG
TL;DR: A novel exact parameterization of doubly stochastic matrices for mixing residual streams that scales O(d³) and interpolates between efficiency and expressivity, applied to dynamic layer connectivity in language models.
Details
Motivation: Existing methods for parameterizing doubly stochastic matrices (Birkhoff polytope) either scale factorially (exact methods) or are expressivity-limited (Kronecker-factorized approaches), creating a need for efficient yet expressive parameterization for learned mixing across residual streams in neural networks.Method: Introduces go-mHC based on generalized orthostochastic matrices theory, providing exact parameterization that scales O(d³) with hyperparameter s that continuously interpolates between computational efficiency and full expressivity. Composes with Kronecker-factorized methods to recover expressivity at similar FLOP costs.
Result: Spectral analysis shows go-mHC fills Birkhoff polytope more completely than baselines. On synthetic tasks, achieves minimum theoretical loss with up to 10× faster convergence. Validated in 30M parameter GPT-style language model, demonstrating practical scaling of d as new model capacity dimension.
Conclusion: go-mHC offers expressivity, efficiency, and exactness for parameterizing doubly stochastic matrices, enabling practical scaling of residual stream mixing as a new dimension of model capacity in neural architectures.
Abstract: Doubly stochastic matrices enable learned mixing across residual streams, but parameterizing the set of doubly stochastic matrices (the Birkhoff polytope) exactly and efficiently remains an open challenge. Existing exact methods scale factorially with the number of streams ($d$), while Kronecker-factorized approaches are efficient but expressivity-limited. We introduce a novel exact parameterization grounded in the theory of generalized orthostochastic matrices, which scales as $\mathcal{O}(d^3)$ and exposes a single hyperparameter $s$ which continuously interpolates between a computationally efficient boundary and the fully expressive Birkhoff polytope. Building on Manifold-Constrained Hyper-Connections ($m$HC), a framework for learned dynamic layer connectivity, we instantiate this parameterization in go-$m$HC. Our method composes naturally with Kronecker-factorized methods, substantially recovering expressivity at similar FLOP costs. Spectral analysis indicates that go-$m$HC fills the Birkhoff polytope far more completely than Kronecker-factorized baselines. On synthetic stream-mixing tasks, go-$m$HC achieves the minimum theoretical loss while converging up to $10\times$ faster. We validate our approach in a 30M parameter GPT-style language model. The expressivity, efficiency, and exactness of go-$m$HC offer a practical avenue for scaling $d$ as a new dimension of model capacity.
[548] Smoothing the Landscape: Causal Structure Learning via Diffusion Denoising Objectives
Hao Zhu, Di Zhou, Donna Slonim
Main category: cs.LG
TL;DR: DDCD uses diffusion model denoising score matching for more stable causal discovery in high-dimensional data with adaptive acyclicity constraints.
Details
Motivation: Existing causal discovery methods (NOTEARS, DAG-GNN) struggle with scalability and stability in high-dimensional data with feature-sample imbalance. Need for faster, more stable convergence in causal structure learning.Method: Proposes Denoising Diffusion Causal Discovery (DDCD) framework that uses denoising score matching objective from diffusion models to smooth gradients for better convergence. Introduces adaptive k-hop acyclicity constraint to avoid matrix inversion and improve runtime.
Result: Demonstrates competitive performance on synthetic benchmarking data. Shows practical utility through qualitative analyses on two real-world examples.
Conclusion: DDCD provides a novel approach to causal discovery using diffusion model concepts, offering improved stability and efficiency for high-dimensional causal structure learning.
Abstract: Understanding causal dependencies in observational data is critical for informing decision-making. These relationships are often modeled as Bayesian Networks (BNs) and Directed Acyclic Graphs (DAGs). Existing methods, such as NOTEARS and DAG-GNN, often face issues with scalability and stability in high-dimensional data, especially when there is a feature-sample imbalance. Here, we show that the denoising score matching objective of diffusion models could smooth the gradients for faster, more stable convergence. We also propose an adaptive k-hop acyclicity constraint that improves runtime over existing solutions that require matrix inversion. We name this framework Denoising Diffusion Causal Discovery (DDCD). Unlike generative diffusion models, DDCD utilizes the reverse denoising process to infer a parameterized causal structure rather than to generate data. We demonstrate the competitive performance of DDCDs on synthetic benchmarking data. We also show that our methods are practically useful by conducting qualitative analyses on two real-world examples. Code is available at this url: https://github.com/haozhu233/ddcd.
[549] Batched Contextual Reinforcement: A Task-Scaling Law for Efficient Reasoning
Bangji Yang, Hongbo Ma, Jiajun Fan, Ge Liu
Main category: cs.LG
TL;DR: BCR trains LLMs to solve multiple problems simultaneously in shared context, creating implicit token budget that reduces token usage while maintaining or improving accuracy.
Details
Motivation: Chain-of-Thought reasoning in LLMs suffers from excessive token consumption that inflates inference costs. Existing efficiency methods degrade reasoning quality or require complex training pipelines.Method: Batched Contextual Reinforcement (BCR): single-stage training paradigm where model solves N problems simultaneously within shared context window, rewarded purely by per-instance accuracy.
Result: Reduces token usage by 15.8% to 62.6% while maintaining/improving accuracy across five mathematical benchmarks. Identifies task-scaling law: per-problem token usage decreases as N increases, with graceful accuracy degradation.
Conclusion: BCR offers stable constraint-based alternative for length control, circumventing adversarial gradients and optimization collapse of explicit length penalties. Simple structural incentives unlock latent high-density reasoning in LLMs.
Abstract: Large Language Models employing Chain-of-Thought reasoning achieve strong performance but suffer from excessive token consumption that inflates inference costs. Existing efficiency methods such as explicit length penalties, difficulty estimators, or multi-stage curricula either degrade reasoning quality or require complex training pipelines. We introduce Batched Contextual Reinforcement, a minimalist, single-stage training paradigm that unlocks efficient reasoning through a simple structural modification: training the model to solve N problems simultaneously within a shared context window, rewarded purely by per-instance accuracy. This formulation creates an implicit token budget that yields several key findings: (1) We identify a novel task-scaling law: as the number of concurrent problems N increases during inference, per-problem token usage decreases monotonically while accuracy degrades far more gracefully than baselines, establishing N as a controllable throughput dimension. (2) BCR challenges the traditional accuracy-efficiency trade-off by demonstrating a “free lunch” phenomenon at standard single-problem inference. Across both 1.5B and 4B model families, BCR reduces token usage by 15.8% to 62.6% while consistently maintaining or improving accuracy across five major mathematical benchmarks. (3) Qualitative analyses reveal emergent self-regulated efficiency, where models autonomously eliminate redundant metacognitive loops without explicit length supervision. (4) Crucially, we empirically demonstrate that implicit budget constraints successfully circumvent the adversarial gradients and catastrophic optimization collapse inherent to explicit length penalties, offering a highly stable, constraint-based alternative for length control. These results prove BCR practical, showing simple structural incentives unlock latent high-density reasoning in LLMs.
[550] Model-Based Reinforcement Learning for Control under Time-Varying Dynamics
Klemens Iten, Bruce Lee, Chenhao Li, Lenart Treven, Andreas Krause, Bhavya Sukhija
Main category: cs.LG
TL;DR: Continual model-based RL for control under time-varying dynamics using Gaussian process models with adaptive data buffers to handle non-stationarity.
Details
Motivation: Real-world systems often violate the stationary dynamics assumption due to drift, wear, or changing conditions, requiring RL methods that can handle time-varying dynamics.Method: Uses Gaussian process dynamics models under frequentist variation-budget assumptions, with an optimistic model-based RL algorithm featuring adaptive data buffer mechanisms to limit influence of outdated data.
Result: Analysis shows persistent non-stationarity requires limiting outdated data influence to maintain calibrated uncertainty and meaningful dynamic regret guarantees. The proposed algorithm demonstrates improved performance on continuous control benchmarks with non-stationary dynamics.
Conclusion: Effective continual RL for control under time-varying dynamics requires explicit mechanisms to handle non-stationarity, with adaptive data buffers being crucial for maintaining performance as dynamics evolve.
Abstract: Learning-based control methods typically assume stationary system dynamics, an assumption often violated in real-world systems due to drift, wear, or changing operating conditions. We study reinforcement learning for control under time-varying dynamics. We consider a continual model-based reinforcement learning setting in which an agent repeatedly learns and controls a dynamical system whose transition dynamics evolve across episodes. We analyze the problem using Gaussian process dynamics models under frequentist variation-budget assumptions. Our analysis shows that persistent non-stationarity requires explicitly limiting the influence of outdated data to maintain calibrated uncertainty and meaningful dynamic regret guarantees. Motivated by these insights, we propose a practical optimistic model-based reinforcement learning algorithm with adaptive data buffer mechanisms and demonstrate improved performance on continuous control benchmarks with non-stationary dynamics.
[551] SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization
Zhengxi Lu, Zhiyuan Yao, Jinyang Wu, Chengcheng Han, Qi Gu, Xunliang Cai, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Main category: cs.LG
TL;DR: SKILL0 is an in-context reinforcement learning framework that internalizes agent skills into model parameters through a curriculum that gradually withdraws skill context, enabling zero-shot autonomous behavior without runtime skill retrieval.
Details
Motivation: Current inference-time skill augmentation for LLM agents has limitations: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and models never truly acquire knowledge but merely follow instructions. The authors want to enable skill internalization into model parameters for zero-shot autonomous behavior.Method: SKILL0 uses an in-context reinforcement learning framework with a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped by category and rendered with interaction history into compact visual context. A Dynamic Curriculum evaluates each skill file’s on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget.
Result: SKILL0 achieves substantial improvements over standard RL baselines (+9.7% for ALFWorld and +6.6% for Search-QA) while maintaining highly efficient context of fewer than 0.5k tokens per step.
Conclusion: The framework successfully internalizes skills into model parameters, enabling zero-shot autonomous behavior without runtime skill retrieval, addressing key limitations of current inference-time skill augmentation approaches.
Abstract: Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file’s on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7% for ALFWorld and +6.6% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.
[552] Crystalite: A Lightweight Transformer for Efficient Crystal Modeling
Tin Hadži Veljković, Joshua Rosenthal, Ivor Lončarić, Jan-Willem van de Meent
Main category: cs.LG
TL;DR: Crystalite: A lightweight diffusion Transformer for crystal modeling using subatomic tokenization and geometric attention biases, achieving state-of-the-art results with faster sampling than equivariant graph networks.
Details
Motivation: Existing generative models for crystalline materials rely on costly equivariant graph neural networks that are slow to train and sample. There's a need for more efficient models that maintain geometric awareness while being computationally practical.Method: Proposes Crystalite, a diffusion Transformer with two key innovations: 1) Subatomic Tokenization - compact chemically structured atom representation replacing high-dimensional one-hot encodings, better suited for continuous diffusion; 2) Geometry Enhancement Module (GEM) - injects periodic minimum-image pair geometry directly into attention through additive geometric biases.
Result: Achieves state-of-the-art results on crystal structure prediction benchmarks and de novo generation performance, attaining the best S.U.N. discovery score among evaluated baselines while sampling substantially faster than geometry-heavy alternatives.
Conclusion: Crystalite preserves Transformer simplicity and efficiency while being better matched to crystalline materials structure, offering a lightweight alternative to costly equivariant graph networks for crystal modeling.
Abstract: Generative models for crystalline materials often rely on equivariant graph neural networks, which capture geometric structure well but are costly to train and slow to sample. We present Crystalite, a lightweight diffusion Transformer for crystal modeling built around two simple inductive biases. The first is Subatomic Tokenization, a compact chemically structured atom representation that replaces high-dimensional one-hot encodings and is better suited to continuous diffusion. The second is the Geometry Enhancement Module (GEM), which injects periodic minimum-image pair geometry directly into attention through additive geometric biases. Together, these components preserve the simplicity and efficiency of a standard Transformer while making it better matched to the structure of crystalline materials. Crystalite achieves state-of-the-art results on crystal structure prediction benchmarks, and de novo generation performance, attaining the best S.U.N. discovery score among the evaluated baselines while sampling substantially faster than geometry-heavy alternatives.
[553] Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing
Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua
Main category: cs.LG
TL;DR: SRPO is a unified RL framework for LLM post-training that combines GRPO’s reward-aligned reinforcement for correct samples with SDPO’s targeted logit-level correction for failed samples, using entropy-aware weighting to prevent late-stage collapse.
Details
Motivation: Current RL methods for LLM post-training have limitations: GRPO lacks token-level focus for specific deviations, while SDPO provides dense supervision but collapses during prolonged training due to optimization ambiguity and degrading signal reliability.Method: SRPO routes correct samples to GRPO’s reward-aligned reinforcement and failed samples to SDPO’s targeted logit-level correction. It incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones.
Result: SRPO achieves both rapid early improvement and long-horizon stability, surpassing peak performance of both baselines. On Qwen3-8B, it raises five-benchmark average by 3.4% over GRPO and 6.3% over SDPO, with moderate response lengths and up to 17.2% lower per-step compute cost.
Conclusion: SRPO provides a unified solution that combines the strengths of both GRPO and SDPO while addressing their weaknesses, offering stable and efficient RL for LLM post-training across multiple benchmarks and model scales.
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher’s signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO’s reward-aligned reinforcement and failed samples to SDPO’s targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.
[554] Taming the Exponential: A Fast Softmax Surrogate for Integer-Native Edge Inference
Dimitrios Danopoulos, Enrico Lupi, Michael Kagan, Maurizio Pierini
Main category: cs.LG
TL;DR: HCCS is a hardware-efficient softmax surrogate using clipped linear mapping with head-specific calibration for AMD AI Engines, enabling int8 optimization while maintaining accuracy.
Details
Motivation: Softmax computation becomes a bottleneck in Transformer MHA blocks, especially in small models with low-precision inference where exponentiation and normalization operations are expensive. Existing implementations on AMD AI Engines use inefficient bfloat16 or LUT-based approaches that don't utilize high-throughput integer vector units.Method: Proposes Head-Calibrated Clipped-Linear Softmax (HCCS) - a bounded, monotone surrogate using clipped linear mapping of max-centered attention logits. Includes lightweight calibration parameters optimized offline per attention head using representative datasets to preserve statistical properties. Specifically designed for AMD AI Engines’ int8 MAC units.
Result: HCCS significantly exceeds speed performance of reference implementations on AMD AI Engines while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining. First int8-optimized softmax surrogate for AMD AI Engines.
Conclusion: HCCS provides an efficient hardware-mapped alternative to softmax that leverages integer processing capabilities, addressing computational bottlenecks in Transformer inference while preserving model accuracy through head-specific calibration.
Abstract: Softmax can become a computational bottleneck in the Transformer model’s Multi-Head Attention (MHA) block, particularly in small models under low-precision inference, where exponentiation and normalization incur significant overhead. As such, we suggest using Head-Calibrated Clipped-Linear Softmax (HCCS), a bounded, monotone surrogate to the exponential softmax function, which uses a clipped linear mapping of the max centered attention logits. This approximation produces a stable probability distribution, maintains the ordering of the original logits and has non-negative values. HCCS differs from previous softmax surrogates as it includes a set of lightweight calibration parameters that are optimized offline based on a representative dataset and calibrated for each individual attention head to preserve the statistical properties of the individual heads. We describe a hardware-motivated implementation of HCCS for high-throughput scenarios targeting the AMD Versal AI Engines. The current reference implementations from AMD for this platform rely upon either bfloat16 arithmetic or LUTs to perform the exponential operation, which might limit the throughput of the platform and fail to utilize the high-throughput integer vector processing units of the AI Engine. In contrast, HCCS provides a natural mapping to the AI Engines’ int8 multiply accumulate (MAC) units. To the best of our knowledge, this is the first int8 optimized softmax surrogate for AMD AI engines that significantly exceeds the speed performance of other reference implementations while maintaining competitive task accuracy on small or heavily quantized MHA workloads after quantization-aware retraining.
[555] In-context Learning in Presence of Spurious Correlations
Hrayr Harutyunyan, Rafayel Darbinyan, Samvel Karapetyan, Hrant Khachatrian
Main category: cs.LG
TL;DR: Training in-context learners for classification with spurious features: conventional methods fail due to spurious feature susceptibility and task memorization, but a novel training approach can match/exceed ERM and GroupDRO performance on specific tasks, though generalization requires diverse synthetic training data.
Details
Motivation: While transformers show impressive in-context learning capabilities for simple regression tasks, their application to classification tasks with spurious features remains underexplored. The paper investigates whether in-context learners can be effectively trained for such classification tasks and addresses issues like susceptibility to spurious features and task memorization.Method: The authors first identify limitations of conventional in-context learning training approaches, which are susceptible to spurious features and lead to task memorization when meta-training on single-task datasets. They then propose a novel training technique specifically designed for classification tasks with spurious features. To achieve generalization to unseen tasks, they train on a diverse dataset of synthetic in-context learning instances.
Result: The proposed in-context learner matches and sometimes outperforms strong baselines like ERM (Empirical Risk Minimization) and GroupDRO (Group Distributionally Robust Optimization) on specific classification tasks. However, it shows poor generalization to other tasks. Training on diverse synthetic in-context learning instances enables the model to generalize to unseen tasks.
Conclusion: In-context learners can be effectively trained for classification tasks with spurious features, achieving competitive performance with specialized methods, but require careful training approaches to avoid spurious feature susceptibility and task memorization. Generalization across tasks necessitates diverse synthetic training data rather than single-task meta-training.
Abstract: Large language models exhibit a remarkable capacity for in-context learning, where they learn to solve tasks given a few examples. Recent work has shown that transformers can be trained to perform simple regression tasks in-context. This work explores the possibility of training an in-context learner for classification tasks involving spurious features. We find that the conventional approach of training in-context learners is susceptible to spurious features. Moreover, when the meta-training dataset includes instances of only one task, the conventional approach leads to task memorization and fails to produce a model that leverages context for predictions. Based on these observations, we propose a novel technique to train such a learner for a given classification task. Remarkably, this in-context learner matches and sometimes outperforms strong methods like ERM and GroupDRO. However, unlike these algorithms, it does not generalize well to other tasks. We show that it is possible to obtain an in-context learner that generalizes to unseen tasks by training on a diverse dataset of synthetic in-context learning instances.
[556] Group Representational Position Encoding
Yifan Zhang, Zixiang Chen, Yifeng Liu, Zhen Qin, Huizhuo Yuan, Kangping Xu, Yang Yuan, Quanquan Gu, Andrew Chi-Chih Yao
Main category: cs.LG
TL;DR: GRAPE is a unified framework for positional encoding using group actions that unifies multiplicative rotations (like RoPE) and additive logit biases (like ALiBi) under a single mathematical framework.
Details
Motivation: To provide a principled mathematical framework for positional encoding in transformers that unifies existing approaches (RoPE and ALiBi) and enables systematic exploration of new positional geometries for long-context models.Method: Uses group actions for positional encoding: (1) Multiplicative GRAPE with rotations in SO(d) using matrix exponentials of skew-symmetric generators, and (2) Additive GRAPE with unipotent actions in GL group producing additive logit biases. The framework allows learned commuting subspaces and non-commuting mixtures.
Result: GRAPE recovers RoPE exactly as a special case of Multiplicative GRAPE and ALiBi/FoX as special cases of Additive GRAPE, while enabling new positional geometries with cross-subspace feature coupling at reasonable computational cost.
Conclusion: GRAPE provides a unified, principled design space for positional encoding that subsumes existing methods and enables systematic exploration of new positional geometries for long-context transformer models.
Abstract: We present GRAPE (Group Representational Position Encoding), a unified framework for positional encoding based on group actions. GRAPE unifies two families of mechanisms: (i) multiplicative rotations (Multiplicative GRAPE) in $\operatorname{SO}(d)$ and (ii) additive logit biases (Additive GRAPE) arising from unipotent actions in the general linear group $\mathrm{GL}$. In Multiplicative GRAPE, a position $n \in \mathbb{Z}$ (or $t \in \mathbb{R}$) acts as $\mathbf{G}(n) = \exp(n , ω, \mathbf{L})$ with a rank-2 skew-symmetric generator $\mathbf{L} \in \mathbb{R}^{d \times d}$, yielding a relative, compositional, norm-preserving map with a closed-form matrix exponential. RoPE is recovered exactly when the $d/2$ planes correspond to canonical coordinate pairs with a log-uniform spectrum. Learned commuting subspaces and compact non-commuting mixtures strictly extend this geometry to capture cross-subspace feature coupling at $O(d)$ and $O(r d)$ cost per head, respectively. In Additive GRAPE, additive logits arise from rank-1 (or low-rank) unipotent actions, recovering ALiBi and the Forgetting Transformer (FoX) as exact special cases while preserving an exact relative law and streaming cacheability. Overall, GRAPE provides a principled design space for positional geometry in long-context models, subsuming RoPE and ALiBi as special cases. Project page: https://github.com/model-architectures/GRAPE.
[557] Risk-Aware Linear Bandits: Theory and Applications in Smart Order Routing
Jingwei Ji, Renyuan Xu, Ruihao Zhu
Main category: cs.LG
TL;DR: Risk-aware linear bandits for smart order routing with mean-variance optimization, proposing RISE and RISE++ algorithms with near-optimal regret bounds.
Details
Motivation: Address practical considerations in financial ML like risk aversion and large action spaces, specifically for smart order routing applications where preliminary observations show linear price impacts in NASDAQ data.Method: Propose two algorithms: instance-independent Risk-Aware Explore-then-Commit (RISE) and instance-dependent Risk-Aware Successive Elimination (RISE++), both leveraging variance-minimizing G-optimal design for risk-aware linear bandits.
Result: Algorithms achieve near-optimal regret upper bounds, significantly outperform competing methods in synthetic and NASDAQ ITCH dataset experiments, and validate linear structure assumption in real financial data.
Conclusion: Risk-aware linear bandits framework effectively addresses financial decision-making challenges, with proposed algorithms showing superior performance in complex scenarios while leveraging linear structure of price impacts.
Abstract: Motivated by practical considerations in machine learning for financial decision-making, such as risk aversion and large action space, we consider risk-aware bandits optimization with applications in smart order routing (SOR). Specifically, based on preliminary observations of linear price impacts made from the NASDAQ ITCH dataset, we initiate the study of risk-aware linear bandits. In this setting, we aim at minimizing regret, which measures our performance deficit compared to the optimum’s, under the mean-variance metric when facing a set of actions whose rewards are linear functions of (initially) unknown parameters. Driven by the variance-minimizing globally-optimal (G-optimal) design, we propose the novel instance-independent Risk-Aware Explore-then-Commit (RISE) algorithm and the instance-dependent Risk-Aware Successive Elimination (RISE++) algorithm. Then, we rigorously analyze their near-optimal regret upper bounds to show that, by leveraging the linear structure, our algorithms can dramatically reduce the regret when compared to existing methods. Finally, we demonstrate the performance of the algorithms by conducting extensive numerical experiments in the SOR setup using both synthetic datasets and the NASDAQ ITCH dataset. Our results reveal that 1) The linear structure assumption can indeed be well supported by the Nasdaq dataset; and more importantly 2) Both RISE and RISE++ can significantly outperform the competing methods, in terms of regret, especially in complex decision-making scenarios.
[558] Output Embedding Centering for Stable LLM Pretraining
Felix Stollenwerk, Anna Lokrantz, Niclas Hertzberg
Main category: cs.LG
TL;DR: OEC (output embedding centering) addresses training instability in LLMs by fixing anisotropic embeddings, outperforming z-loss and matching logit soft-capping.
Details
Motivation: Large language model pretraining suffers from output logit divergence instability at training end; existing solutions (z-loss, logit soft-capping) treat symptoms not root causes.Method: Analyzed instability from output embeddings’ geometry perspective, identified anisotropic embeddings as source, proposed OEC with two implementations: deterministic μ-centering and regularization μ-loss.
Result: Both OEC variants suppress output logit divergence, outperform z-loss in training stability, match logit soft-capping performance, with μ-loss being less sensitive to hyperparameter tuning than z-loss.
Conclusion: OEC addresses root cause of training instability via output embedding geometry, providing effective mitigation strategy for LLM pretraining stability issues.
Abstract: Pretraining of large language models is not only expensive but also prone to certain training instabilities. A specific instability that often occurs at the end of training is output logit divergence. The most widely used mitigation strategies, z-loss and logit soft-capping, merely address the symptoms rather than the underlying cause of the problem. In this paper, we analyze the instability from the perspective of the output embeddings’ geometry and identify anisotropic embeddings as its source. Based on this, we propose output embedding centering (OEC) as a new mitigation strategy, and demonstrate that it suppresses output logit divergence. OEC can be implemented in two different ways: as a deterministic operation called $μ$-centering, or a regularization method called $μ$-loss. Our experiments show that both variants outperform z-loss in terms of training stability, while being on par with logit soft-capping. This holds true both in the presence and the absence of weight tying. As a secondary result, we find that $μ$-loss is significantly less sensitive to regularization hyperparameter tuning than z-loss.
[559] CompressedScaffnew: The First Theoretical Double Acceleration of Communication from Local Training and Compression in Distributed Optimization
Laurent Condat, Ivan Agarský, Peter Richtárik
Main category: cs.LG
TL;DR: CompressedScaffnew is a distributed optimization algorithm that combines infrequent communication and gradient compression to accelerate convergence in strongly convex settings.
Details
Motivation: Communication is the main bottleneck in distributed optimization, where many machines alternate between local computations and server communication. Two common acceleration strategies are infrequent communication (local iterations) and compressed communication, but no existing algorithm effectively combines both.Method: Proposes CompressedScaffnew, which jointly harnesses both strategies: performing multiple local computation iterations between communication rounds and communicating compressed information instead of full-dimensional vectors.
Result: The algorithm converges linearly to an exact solution in strongly convex settings with a doubly accelerated rate, benefiting from both local training (better dependency on condition number) and compression (better dependency on model dimension).
Conclusion: CompressedScaffnew is the first distributed optimization algorithm that effectively combines infrequent communication and compression, achieving accelerated convergence by addressing both major communication bottlenecks.
Abstract: In distributed optimization, a large number of machines alternate between local computations and communication with a coordinating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce this burden and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. We propose CompressedScaffnew, the first algorithm for distributed optimization that jointly harnesses these two strategies and converges linearly to an exact solution in the strongly convex setting, with a doubly accelerated rate: it benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
[560] One Sample to Rule Them All: Extreme Data Efficiency in Multidiscipline Reasoning with Reinforcement Learning
Yiyuan Li, Zhen Huang, Yanan Wu, Weixun Wang, Xuefeng Li, Yijia Luo, Wenbo Su, Bo Zheng, Pengfei Liu
Main category: cs.LG
TL;DR: Polymath learning enables one-shot reinforcement learning for LLMs using strategically designed single samples that improve multidisciplinary reasoning across domains like physics, chemistry, and biology.
Details
Motivation: Existing RL approaches for LLMs require large volumes of high-quality samples, but this paper challenges conventional data requirements by exploring whether a single, well-designed sample can effectively improve reasoning capabilities across multiple domains.Method: Introduces polymath learning framework for designing one training sample that elicits multidisciplinary reasoning improvement. Analyzes salient mathematical skills to identify characteristics of effective polymath samples, and engineers synthetic samples integrating multidisciplinary elements and broader skill coverage.
Result: A single strategically selected math reasoning sample produces significant performance improvements across multiple domains. Engineered synthetic samples achieve stronger performance than naturally occurring individual samples. Polymath learning outperforms larger datasets across various reasoning benchmarks.
Conclusion: Reasoning structure and skills in samples, rather than quantity, may be key to unlocking enhanced reasoning capabilities in language models. Proposes “sample engineering” as a shift toward precision engineering of samples that complements simply increasing data volume.
Abstract: The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually rely on high-quality samples of large volumes. In this paper, we challenge conventional assumptions about data requirements in RL for LLMs by demonstrating the effectiveness of one-shot reinforcement learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary reasoning improvement. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology; (2) Analysis of salient mathematical skills provides insight into the characteristics associated with effective polymath samples; and (3) An engineered synthetic sample that integrates multidisciplinary elements and broader skill coverage achieves stronger performance than naturally occurring individual samples. Across various reasoning benchmarks, polymath learning achieves stronger performance than larger datasets, demonstrating that reasoning structure and skills in samples, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of samples that complements simply increasing data volume.
[561] Transformers Can Solve Non-Linear and Non-Markovian Filtering Problems in Continuous Time For Conditionally Gaussian Signals
Blanka Horvath, Anastasis Kratsios, Yannick Limmer, Xuwei Yang
Main category: cs.LG
TL;DR: Filterformers - transformer models that can approximate optimal stochastic filtering for non-Markovian Gaussian signals with noisy measurements, using customized attention mechanisms for path embeddings and Wasserstein geometry.
Details
Motivation: Address the open problem of whether attention-based deep learning models (like transformers) can solve stochastic filtering problems, which remains largely unknown despite recent interest in using these models for filtering tasks.Method: Propose filterformers - continuous-time transformer models with two custom attention mechanisms: 1) bi-Lipschitz embeddings of regular path sets into low-dimensional Euclidean spaces without dimension reduction error, 2) attention tailored to Gaussian measures in 2-Wasserstein space. Analysis uses stability estimates of robust optimal filters in conditionally Gaussian settings.
Result: Show that filterformers can approximately implement the conditional law of broad classes of non-Markovian, conditionally Gaussian signal processes given noisy continuous-time measurements, with approximation guarantees holding uniformly over regular compact path subsets using worst-case 2-Wasserstein distance.
Conclusion: Provides affirmative answer to open problem: attention-based models can solve stochastic filtering problems, with filterformers offering theoretical guarantees for approximating optimal filters in continuous-time settings with non-Markovian signals.
Abstract: The use of attention-based deep learning models in stochastic filtering, e.g. transformers and deep Kalman filters, has recently come into focus; however, the potential for these models to solve stochastic filtering problems remains largely unknown. The paper provides an affirmative answer to this open problem in the theoretical foundations of machine learning by showing that a class of continuous-time transformer models, called \textit{filterformers}, can approximately implement the conditional law of a broad class of non-Markovian and conditionally Gaussian signal processes given noisy continuous-time (possibly non-Gaussian) measurements. Our approximation guarantees hold uniformly over sufficiently regular compact subsets of continuous-time paths, where the worst-case 2-Wasserstein distance between the true optimal filter and our deep learning model quantifies the approximation error. Our construction relies on two new customizations of the standard attention mechanism: The first can losslessly adapt to the characteristics of a broad range of paths since we show that the attention mechanism implements bi-Lipschitz embeddings of sufficiently regular sets of paths into low-dimensional Euclidean spaces; thus, it incurs no ``dimension reduction error’’. The latter attention mechanism is tailored to the geometry of Gaussian measures in the $2$-Wasserstein space. Our analysis relies on new stability estimates of robust optimal filters in the conditionally Gaussian setting.
[562] Automatic selection of the best neural architecture for time series forecasting
Qianying Cao, Shanqing Liu, Alan John Varghese, Jerome Darbon, Michael Triantafyllou, George Em Karniadakis
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2501.12215: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2501.12215&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[563] Fragility-aware Classification for Understanding Risk and Improving Generalization
Chen Yang, Zheng Cui, Daniel Zhuoyu Long, Jin Qi, Ruohan Zhan
Main category: cs.LG
TL;DR: Unable to analyze paper 2502.13024 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract could not be retrievedMethod: Cannot determine method as abstract could not be retrieved
Result: Cannot determine results as abstract could not be retrieved
Conclusion: Cannot draw conclusions without access to the paper’s content
Abstract: Failed to fetch summary for 2502.13024: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.13024&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[564] TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology
Alexis Chevalier, Soumya Ghosh, Urvi Awasthi, James Watkins, Julia Bieniewska, Nichita Mitrea, Olga Kotova, Kirill Shkura, Andrew Noble, Michael Steinbaugh, Vijay Sadashivaiah, George Dasoulas, Julien Delile, Christoph Meier, Leonid Zhukov, Iya Khalil, Srayanta Mukherjee, Judith Mueller
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2503.03485: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2503.03485&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[565] Intern-S1-Pro: Scientific Multimodal Foundation Model at Trillion Scale
Yicheng Zou, Dongsheng Zhu, Lin Zhu, Tong Zhu, Yunhua Zhou, Peiheng Zhou, Xinyu Zhou, Dongzhan Zhou, Zhiwang Zhou, Yuhao Zhou, Bowen Zhou, Zhanping Zhong, Zhijie Zhong, Haiteng Zhao, Penghao Zhao, Xiaomeng Zhao, Zhiyuan Zhao, Yechen Zhang, Jin Zhang, Wenwei Zhang, Hongjie Zhang, Zhuo Zhang, Wenlong Zhang, Bo Zhang, Chao Zhang, Chen Zhang, Yuhang Zang, Fei Yuan, Jiakang Yuan, Jiashuo Yu, Jinhui Yin, Haochen Ye, Qian Yao, Bowen Yang, Danni Yang, Kaichen Yang, Ziang Yan, Jun Xu, Yicheng Xu, Wanghan Xu, Xuenan Xu, Chao Xu, Ruiliang Xu, Shuhao Xing, Long Xing, Xinchen Xie, Ling-I Wu, Zijian Wu, Zhenyu Wu, Lijun Wu, Yue Wu, Jianyu Wu, Wen Wu, Fan Wu, Xilin Wei, Qi Wei, Bingli Wang, Rui Wang, Ziyi Wang, Zun Wang, Yi Wang, Haomin Wang, Yizhou Wang, Lintao Wang, Yiheng Wang, Longjiang Wang, Bin Wang, Jian Tong, Zhongbo Tian, Huanze Tang, Chen Tang, Shixiang Tang, Yu Sun, Qiushi Sun, Xuerui Su, Qisheng Su, Chenlin Su, Demin Song, Jin Shi, Fukai Shang, Yuchen Ren, Pengli Ren, Xiaoye Qu, Yuan Qu, Jiantao Qiu, Yu Qiao, Biqing Qi, Runyu Peng, Tianshuo Peng, Jiahui Peng, Qizhi Pei, Zhuoshi Pan, Linke Ouyang, Wenchang Ning, Yichuan Ma, Zerun Ma, Ningsheng Ma, Runyuan Ma, Chengqi Lyu, Haijun Lv, Han Lv, Lindong Lu, Kuikun Liu, Jiangning Liu, Yuhong Liu, Kai Liu, Hongwei Liu, Zhoumianze Liu, Mengjie Liu, Ziyu Liu, Wenran Liu, Yang Liu, Liwei Liu, Kaiwen Liu, Junyao Lin, Junming Lin, Tianyang Lin, Dahua Lin, Jianze Liang, Linyang Li, Peiji Li, Zonglin Li, Zehao Li, Pengze Li, Guoyan Li, Lingkai Kong, Linglin Jing, Zhenjiang Jin, Feifei Jiang, Qian Jiang, Junhao Huang, Zixian Huang, Haian Huang, Zhouqi Hua, Ermo Hua, Han Hu, Linfeng Hou, Yinan He, Conghui He, Tianyao He, Xu Guo, Qipeng Guo, Aijia Guo, Yuzhe Gu, Lixin Gu, Jingyang Gong, Qiming Ge, Jiaye Ge, Songyang Gao, Jianfei Gao, Xinyu Fang, Caihua fan, Yue Fan, Yanhui Duan, Zichen Ding, Shengyuan Ding, Ning Ding, Xuanlang Dai, Erfei Cui, Ganqu Cui, Pei Chu, Tao Chu, Guangran Cheng, Yu Cheng, Kai Chen, Yongkang Chen, Chiyu Chen, Guanzhou Chen, Qiaosheng Chen, Sitao Chen, Xin Chen, Haojiong Chen, Yicheng Chen, Weihan Cao, Yuhang Cao, Qinglong Cao, Lei Bai
Main category: cs.LG
TL;DR: Intern-S1-Pro is a 1-trillion parameter scientific multimodal foundation model that combines general reasoning capabilities with specialized scientific expertise across chemistry, materials, life sciences, and earth sciences, featuring advanced agent capabilities and efficient RL training infrastructure.
Details
Motivation: To create the first one-trillion-parameter scientific multimodal foundation model that bridges general intelligence with specialized scientific expertise, addressing the need for models that can handle both broad reasoning tasks and deep scientific domain knowledge across multiple critical science fields.Method: Scales to 1-trillion parameters using XTuner and LMDeploy infrastructure for efficient Reinforcement Learning training while maintaining precision consistency between training and inference. Integrates multimodal capabilities with specialized scientific knowledge across 100+ tasks in chemistry, materials, life sciences, and earth sciences.
Result: Achieves comprehensive enhancement across general and scientific domains, demonstrating top-tier open-source performance for general capabilities while outperforming proprietary models in specialized scientific tasks. Features advanced agent capabilities and robust multimodal understanding.
Conclusion: Intern-S1-Pro successfully creates a “Specializable Generalist” model that fuses general intelligence with deep scientific expertise, establishing a new benchmark for large-scale multimodal scientific foundation models with practical infrastructure support for training and deployment.
Abstract: We introduce Intern-S1-Pro, the first one-trillion-parameter scientific multimodal foundation model. Scaling to this unprecedented size, the model delivers a comprehensive enhancement across both general and scientific domains. Beyond stronger reasoning and image-text understanding capabilities, its intelligence is augmented with advanced agent capabilities. Simultaneously, its scientific expertise has been vastly expanded to master over 100 specialized tasks across critical science fields, including chemistry, materials, life sciences, and earth sciences. Achieving this massive scale is made possible by the robust infrastructure support of XTuner and LMDeploy, which facilitates highly efficient Reinforcement Learning (RL) training at the 1-trillion parameter level while ensuring strict precision consistency between training and inference. By seamlessly integrating these advancements, Intern-S1-Pro further fortifies the fusion of general and specialized intelligence, working as a Specializable Generalist, demonstrating its position in the top tier of open-source models for general capabilities, while outperforming proprietary models in the depth of specialized scientific tasks.
[566] A Simultaneous Approach for Training Neural Differential-Algebraic Systems of Equations
Laurens R. Lueg, Victor Alves, Daniel Schicksnus, John R. Kitchin, Carl D. Laird, Lorenz T. Biegler
Main category: cs.LG
TL;DR: Unable to analyze paper 2504.04665 due to HTTP 429 error when fetching the abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to the paper abstractMethod: Cannot determine method without access to the paper abstract
Result: Cannot determine results without access to the paper abstract
Conclusion: Cannot determine conclusion without access to the paper abstract
Abstract: Failed to fetch summary for 2504.04665: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.04665&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[567] Where You Place the Norm Matters: From Prejudiced to Neutral Initializations
Emanuele Francazi, Francesco Pinto, Aurelien Lucchi, Marco Baity-Jesi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2505.11312: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.11312&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[568] GradPower: Powering Gradients for Faster Language Model Pre-Training
Jinbo Wang, Mingze Wang, Jiaqi Zhang, Wei Wang, Peng Pei, Xunliang Cai, Weinan E, Lei Wu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailable due to server rate limitingMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about paper content due to access limitations
Abstract: Failed to fetch summary for 2505.24275: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2505.24275&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[569] DiffGradCAM: A Universal Class Activation Map Resistant to Adversarial Training
Jacob Piland, Chris Sweet, Adam Czajka
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2506.08514: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.08514&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[570] Beyond Laplace and Gaussian: Exploring the Generalized Gaussian Mechanism for Private Machine Learning
Roy Rinberg, Ilia Shumailov, Vikrant Singhal, Rachel Cummings, Nicolas Papernot
Main category: cs.LG
TL;DR: Paper 2506.12553: Unable to fetch abstract due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.12553: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.12553&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[571] A Residual Guided strategy with Generative Adversarial Networks in training Physics-Informed Transformer Networks
Ziyang Zhang, Feifan Zhang, Weidong Tang, Lei Shi, Tailai Chen
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2508.00855: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.00855&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[572] Modeling Irregular Astronomical Time Series with Neural Stochastic Delay Differential Equations
YongKyung Oh, Seungsu Kam, Dong-Young Lim, Sungil Kim
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2508.17521: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.17521&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[573] Generalized Machine Learning for Fast Calibration of Agent-Based Epidemic Models
Sima Najafzadehkhoei, George Vega Yon, Derek S.Meyer, Bernardo Modenesi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2509.07013: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2509.07013&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[574] StelLA: Subspace Learning in Low-rank Adaptation using Stiefel Manifold
Zhizhong Li, Sina Sajadmanesh, Jingtao Li, Lingjuan Lyu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2510.01938: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.01938&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[575] Reinforcement Learning-based Task Offloading in the Internet of Wearable Things
Waleed Bin Qaim, Aleksandr Ometov, Claudia Campolo, Antonella Molinaro, Elena Simona Lohan, Jari Nurmi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper detailsMethod: Cannot analyze method as paper content is unavailable due to HTTP 429 error
Result: No results available - paper summary fetch failed
Conclusion: Cannot provide analysis due to technical limitations in accessing the paper
Abstract: Failed to fetch summary for 2510.07487: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.07487&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[576] What Do Temporal Graph Learning Models Learn?
Abigail J. Hayes, Tobias Schumacher, Markus Strohmaier
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed data retrievalMethod: Unable to determine method due to failed data retrieval
Result: Unable to determine results due to failed data retrieval
Conclusion: Unable to draw conclusions due to failed data retrieval
Abstract: Failed to fetch summary for 2510.09416: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09416&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[577] On the Identifiability of Tensor Ranks via Prior Predictive Matching
Eliezer da Silva, Arto Klami, Diego Mesquita, Iñigo Urteaga
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.14523: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.14523&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[578] Doubly Robust Estimation of Causal Effects in Strategic Equilibrium Systems
Sibo Xiao
Main category: cs.LG
TL;DR: Unable to analyze paper 2510.15555 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation without access to paper abstractMethod: Cannot determine method without access to paper abstract
Result: Cannot determine results without access to paper abstract
Conclusion: Cannot draw conclusions without access to paper abstract
Abstract: Failed to fetch summary for 2510.15555: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.15555&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[579] Partial VOROS: A Cost-aware Performance Metric for Binary Classifiers with Precision and Capacity Constraints
Christopher Ratigan, Kyle Heuton, Carissa Wang, Lenore Cowen, Michael C. Hughes
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.18520: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.18520&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[580] Interpretable Diagnostics and Adaptive Data Assimilation for Neural ODEs via Discrete Empirical Interpolation
Hojin Kim, Romit Maulik
Main category: cs.LG
TL;DR: Paper analysis unavailable due to HTTP 429 error when fetching abstract from arXiv
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to determine conclusion due to abstract fetch failure
Abstract: Failed to fetch summary for 2510.21852: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.21852&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[581] A Review of Neural Networks in Precipitation Prediction
Yugong Zeng, Jiayuan Wang, Jonathan Wu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2510.22855: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.22855&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[582] Machine Learning Guided Optimal Transmission Switching to Mitigate Wildfire Ignition Risk
Weimin Huang, Ryan Piansky, Bistra Dilkina, Daniel K. Molzahn
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2510.25147: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.25147&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[583] Planning in Branch-and-Bound: Model-Based Reinforcement Learning for Exact Combinatorial Optimization
Paul Strang, Zacharie Alès, Côme Bissuel, Olivier Juan, Safia Kedad-Sidhoum, Emmanuel Rachelson
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting preventing access to paper contentMethod: Cannot analyze method as paper content is unavailable
Result: No results available due to access restrictions
Conclusion: Cannot provide conclusion without access to paper content
Abstract: Failed to fetch summary for 2511.09219: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09219&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[584] Towards Trustworthy Wi-Fi CSI-based Sensing: Systematic Evaluation of Adversarial Robustness
Shreevanth Krishnaa Gopalakrishnan, Stephen Hailes
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to missing paper contentMethod: Unable to determine method due to missing paper content
Result: Unable to determine results due to missing paper content
Conclusion: Unable to determine conclusion due to missing paper content
Abstract: Failed to fetch summary for 2511.20456: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.20456&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[585] Deep Reinforcement Learning for Dynamic Algorithm Configuration: A Case Study on Optimizing OneMax with the (1+($λ$,$λ$))-GA
Tai Nguyen, Phong Le, André Biedenkapp, Carola Doerr, Nguyen Dang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2512.03805: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.03805&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[586] Hybrid Quantum-Classical Autoencoders for Unsupervised Network Intrusion Detection
Mohammad Arif Rasyidi, Omar Alhussein, Sami Muhaidat, Ernesto Damiani
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2512.05069: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.05069&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[587] Beyond the Black Box: Identifiable Interpretation and Control in Generative Models via Causal Minimality
Lingjing Kong, Shaoan Xie, Guangyi Chen, Yuewen Sun, Xiangchen Song, Eric P. Xing, Kun Zhang
Main category: cs.LG
TL;DR: Paper ID 2512.10720 could not be retrieved due to HTTP 429 error (rate limiting).
Details
Motivation: Unable to determine motivation due to retrieval failure.Method: Unable to determine method due to retrieval failure.
Result: Unable to determine results due to retrieval failure.
Conclusion: Unable to determine conclusion due to retrieval failure.
Abstract: Failed to fetch summary for 2512.10720: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.10720&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[588] When the Server Steps In: Calibrated Updates for Fair Federated Learning
Tianrun Yu, Kaixiang Zhao, Cheng Zhang, Anjun Gao, Yueyang Quan, Zhuqing Liu, Minghong Fang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.05352: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.05352&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[589] Multigrade Neural Network Approximation
Shijun Zhang, Zuowei Shen, Yuesheng Xu
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.16884: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.16884&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[590] PUNCH: Physics-informed Uncertainty-aware Network for Coronary Hemodynamics
Sukirt Thakur, Marcus Roper, Yang Zhou, Dmitry Yu. Isaev, Reza Akbarian Bafghi, Brahmajee K. Nallamothu, C. Alberto Figueroa, Srinivas Paruchuri, Scott Burger, Carlos Collet, Maziar Raissi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions as paper content is unavailable
Abstract: Failed to fetch summary for 2601.17192: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17192&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[591] RPNT: Robust Pre-trained Neural Transformer – A Pathway for Generalized Motor Decoding
Hao Fang, Ryan A. Canfield, Tomohiro Ouchi, Beatrice Macagno, Eli Shlizerman, Amy L. Orsborn
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed API requestMethod: Unable to determine method due to failed API request
Result: Unable to determine results due to failed API request
Conclusion: Unable to determine conclusion due to failed API request
Abstract: Failed to fetch summary for 2601.17641: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.17641&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[592] Partial Feedback Online Learning
Shihao Shao, Cong Fang, Zhouchen Lin, Dacheng Tao
Main category: cs.LG
TL;DR: Unable to analyze paper 2601.21462 due to HTTP 429 error when fetching abstract from arXiv API
Details
Motivation: Cannot determine motivation as abstract retrieval failedMethod: Cannot determine method as abstract retrieval failed
Result: Cannot determine results as abstract retrieval failed
Conclusion: Cannot draw conclusions without access to paper content
Abstract: Failed to fetch summary for 2601.21462: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.21462&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[593] Rate-Distortion Optimization for Transformer Inference
Anderson de Andrade, Alon Harell, Ivan V. Bajić
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.22002: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.22002&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[594] Safer by Diffusion, Broken by Context: Diffusion LLM’s Safety Blessing and Its Failure Mode
Zeyuan He, Yupeng Chen, Lang Lin, Yihan Wang, Shenxu Chang, Eric Sommerlade, Philip Torr, Junchi Yu, Adel Bibi, Jialin Yu
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to determine conclusion due to failed paper fetch
Abstract: Failed to fetch summary for 2602.00388: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.00388&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[595] 2Mamba2Furious: Linear in Complexity, Competitive in Accuracy
Gabriel Mongaras, Eric C. Larson
Main category: cs.LG
TL;DR: Unable to analyze paper 2602.17363 due to HTTP 429 error when fetching summary from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.17363: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.17363&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[596] Interleaved Resampling and Refitting: Data and Compute-Efficient Evaluation of Black-Box Predictors
Haichen Hu, David Simchi-Levi
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.14218: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.14218&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[597] Rethinking Machine Unlearning: Models Designed to Forget via Key Deletion
Sonia Laguna, Jorge da Silva Goncalves, Moritz Vandenhirtz, Alain Ryser, Irene Cannistraci, Julia E. Vogt
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: No method information available due to API rate limiting error
Result: No results available - paper summary fetch failed
Conclusion: Cannot analyze paper due to technical limitations in accessing the content
Abstract: Failed to fetch summary for 2603.15033: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.15033&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[598] Regret Bounds for Reinforcement Learning from Multi-Source Imperfect Preferences
Ming Shi, Yingbin Liang, Ness B. Shroff, Ananthram Swami
Main category: cs.LG
TL;DR: Paper 2603.20453 summary unavailable due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to abstract fetch failureMethod: Unable to determine method due to abstract fetch failure
Result: Unable to determine results due to abstract fetch failure
Conclusion: Unable to draw conclusions due to abstract fetch failure
Abstract: Failed to fetch summary for 2603.20453: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20453&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[599] Semantic Interaction Information mediates compositional generalization in latent space
John Schwarcz
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to API rate limiting errorMethod: Unable to determine method due to API rate limiting error
Result: Unable to determine results due to API rate limiting error
Conclusion: Unable to determine conclusion due to API rate limiting error
Abstract: Failed to fetch summary for 2603.27134: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.27134&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[600] Chameleons do not Forget: Prompt-Based Online Continual Learning for Next Activity Prediction
Marwan Hassani, Tamara Verbeek, Sjoerd van Straten
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot draw conclusions about the paper due to access limitations
Abstract: Failed to fetch summary for 2604.00653: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.00653&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[601] Property-Level Flood Risk Assessment Using AI-Enabled Street-View Lowest Floor Elevation Extraction and ML Imputation Across Texas
Xiangpeng Li, Yu-Hsuan Ho, Sam D Brody, Ali Mostafavi
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation due to API access limitationsMethod: Cannot determine method due to API access limitations
Result: Cannot determine results due to API access limitations
Conclusion: Cannot determine conclusion due to API access limitations
Abstract: Failed to fetch summary for 2604.01153: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2604.01153&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[602] Information theory for dimensionality reduction in dynamical systems
Matthew S. Schmitt, Maciej Koch-Janusz, Michel Fruchart, Daniel S. Seara, Michael Rust, Vincenzo Vitelli
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2312.06608: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2312.06608&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[603] Causal K-Means Clustering
Kwangho Kim, Jisu Kim, Edward H. Kennedy
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access limitationsMethod: Unable to determine method due to access limitations
Result: Unable to determine results due to access limitations
Conclusion: Unable to draw conclusions due to access limitations
Abstract: Failed to fetch summary for 2405.03083: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2405.03083&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[604] Weighted quantization using MMD: From mean field to mean shift via gradient flows
Ayoub Belhadji, Daniel Sharp, Youssef Marzouk
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting). The paper ID 2502.10600 could not be retrieved from arXiv API.
Details
Motivation: Unable to determine motivation as the paper content could not be fetched due to API rate limiting.Method: Unable to determine method as the paper content could not be fetched.
Result: Unable to determine results as the paper content could not be fetched.
Conclusion: Unable to draw conclusions as the paper content could not be fetched.
Abstract: Failed to fetch summary for 2502.10600: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2502.10600&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[605] A Polynomial-Time Algorithm for Variational Inequalities under the Minty Condition
Ioannis Anagnostides, Gabriele Farina, Tuomas Sandholm, Brian Hu Zhang
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: Unable to determine method as paper content could not be retrieved
Result: Unable to determine results as paper content could not be retrieved
Conclusion: Unable to draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2504.03432: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2504.03432&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[606] Multi-Timescale Primal Dual Hybrid Gradient with Application to Distributed Optimization
Junhui Zhang, Patrick Jaillet
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2506.15387: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.15387&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[607] AICO: Feature Significance Tests for Supervised Learning
Kay Giesecke, Enguerrand Horel, Chartsiri Jirachotkulthorn
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2506.23396: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2506.23396&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[608] Intervening to Learn and Compose Causally Disentangled Representations
Alex Markham, Isaac Hirsch, Jeri A. Chang, Liam Solus, Bryon Aragam
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation due to API access failureMethod: Unable to determine method due to API access failure
Result: Unable to determine results due to API access failure
Conclusion: Unable to determine conclusion due to API access failure
Abstract: Failed to fetch summary for 2507.04754: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.04754&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[609] Boosted Enhanced Quantile Regression Neural Networks with Spatiotemporal Permutation Entropy for Complex System Prognostics
David J Poland
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2507.14194: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.14194&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[610] Zeroth-order Logconcave Sampling
Yunbum Kook, Santosh S. Vempala
Main category: cs.LG
TL;DR: Paper ID 2507.18021 could not be fetched due to HTTP 429 error (rate limiting), so analysis cannot be performed
Details
Motivation: Unable to determine motivation due to failed paper fetchMethod: Unable to determine method due to failed paper fetch
Result: Unable to determine results due to failed paper fetch
Conclusion: Unable to draw conclusions due to failed paper fetch
Abstract: Failed to fetch summary for 2507.18021: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.18021&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[611] Computationally efficient Gauss-Newton reinforcement learning for model predictive control
Dean Brandner, Sebastien Gros, Sergio Lucia
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2508.02441: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2508.02441&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[612] On the Adversarial Robustness of Learning-based Conformal Novelty Detection
Daofu Zhang, Mehrdad Pournaderi, Hanne M. Clifford, Yu Xiang, Pramod K. Varshney
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to draw conclusions due to fetch failure
Abstract: Failed to fetch summary for 2510.00463: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.00463&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[613] Adaptive Coverage Policies in Conformal Prediction
Etienne Gauthier, Francis Bach, Michael I. Jordan
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) - need to try again later or use alternative access methods
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2510.04318: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.04318&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[614] Learning with Incomplete Context: Linear Contextual Bandits with Pretrained Imputation
Hao Yan, Heyan Zhang, Yongyi Guo
Main category: cs.LG
TL;DR: Unable to analyze paper 2510.09908 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions about paper content due to retrieval failure
Abstract: Failed to fetch summary for 2510.09908: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.09908&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[615] Constraint-Aware Reinforcement Learning via Adaptive Action Scaling
Murad Dawood, Usama Ahmed Siddiquie, Shahram Khorshidi, Maren Bennewitz
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2510.11491: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2510.11491&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[616] Resource-Efficient Variational Quantum Classifier
Petr Ptáček, Paulina Lewandowska, Ryszard Kukulski
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to access errorMethod: Unable to determine method due to access error
Result: Unable to determine results due to access error
Conclusion: Unable to determine conclusion due to access error
Abstract: Failed to fetch summary for 2511.09204: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2511.09204&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[617] Olaf: Bringing an Animated Character to Life in the Physical World
David Müller, Espen Knoop, Dario Mylonopoulos, Agon Serifi, Michael A. Hopkins, Ruben Grandia, Moritz Bächer
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to analyze paper due to technical fetch error
Abstract: Failed to fetch summary for 2512.16705: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.16705&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[618] Linear Attention for Joint Power Optimization and User-Centric Clustering in Cell-Free Networks
Irched Chafaa, Giacomo Bacci, Luca Sanguinetti
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2512.17466: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.17466&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[619] Virasoro Symmetry in Neural Network Field Theories
Brandon Robinson
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Unable to determine motivation as paper content could not be retrievedMethod: No method information available due to API rate limiting error
Result: No results available - technical error prevented paper retrieval
Conclusion: Cannot analyze paper due to technical limitations in accessing content
Abstract: Failed to fetch summary for 2512.24420: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2512.24420&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[620] Coarsening Causal DAG Models
Francisco Madaleno, Pratik Misra, Alex Markham
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2601.10531: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2601.10531&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[621] Ratio Covers of Convex Sets and Optimal Mixture Density Estimation
Spencer Compton, Gábor Lugosi, Jaouad Mourtada, Jian Qian, Nikita Zhivotovskiy
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2602.16142: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.16142&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[622] Generative Recommendation for Large-Scale Advertising
Ben Xue, Dan Liu, Lixiang Wang, Mingjie Sun, Peng Wang, Pengfei Zhang, Shaoyun Shi, Tianyu Xu, Yunhao Sha, Zhiqiang Liu, Bo Kong, Bo Wang, Hang Yang, Jieting Xue, Junhao Wang, Shengyu Wang, Shuping Hui, Wencai Ye, Xiao Lin, Yongzhi Li, Yuhang Chen, Zhihui Yin, Quan Chen, Shiyang Wen, Wenjin Wu, Han Li, Guorui Zhou, Changcheng Li, Peng Jiang, Kun Gai
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailable due to API rate limitingMethod: Cannot determine method as paper content is unavailable due to API rate limiting
Result: Cannot determine results as paper content is unavailable due to API rate limiting
Conclusion: Cannot draw conclusions as paper content is unavailable due to API rate limiting
Abstract: Failed to fetch summary for 2602.22732: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2602.22732&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[623] Wireless Power Control Based on Large Language Models
Jiacheng Wang, Yucheng Sheng, Le Liang, Hao Ye, Shi Jin
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.00474: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.00474&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[624] SEAnet: A Deep Learning Architecture for Data Series Similarity Search
Qitong Wang, Themis Palpanas
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to fetch failureMethod: Unable to determine method due to fetch failure
Result: Unable to determine results due to fetch failure
Conclusion: Unable to determine conclusion due to fetch failure
Abstract: Failed to fetch summary for 2603.01448: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.01448&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[625] Hybrid Hidden Markov Model for Modeling Equity Excess Growth Rate Dynamics: A Discrete-State Approach with Jump-Diffusion
Abdulrahman Alswaidan, Jeffrey D. Varner
Main category: cs.LG
TL;DR: Unable to fetch paper details due to HTTP 429 error (rate limiting) from arXiv API
Details
Motivation: Cannot determine motivation as paper content is unavailableMethod: Cannot determine method as paper content is unavailable
Result: Cannot determine results as paper content is unavailable
Conclusion: Cannot determine conclusion as paper content is unavailable
Abstract: Failed to fetch summary for 2603.10202: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.10202&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[626] The Convergence Frontier: Integrating Machine Learning and High Performance Quantum Computing for Next-Generation Drug Discovery
Narjes Ansari, César Feniou, Nicolaï Gouraud, Daniele Loco, Siwar Badreddine, Baptiste Claudon, Félix Aviat, Marharyta Blazhynska, Kevin Gasperich, Guillaume Michel, Diata Traore, Corentin Villot, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal
Main category: cs.LG
TL;DR: Unable to analyze paper 2603.17790 due to HTTP 429 error when fetching from arXiv API
Details
Motivation: Cannot determine motivation as paper content could not be retrievedMethod: Cannot determine method as paper content could not be retrieved
Result: Cannot determine results as paper content could not be retrieved
Conclusion: Cannot draw conclusions as paper content could not be retrieved
Abstract: Failed to fetch summary for 2603.17790: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.17790&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[627] Graph-Informed Adversarial Modeling: Infimal Subadditivity of Interpolative Divergences
Panagiota Birmpa, Eric Joseph Hall
Main category: cs.LG
TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Unable to determine motivation due to technical error in fetching paper contentMethod: Unable to determine method due to technical error in fetching paper content
Result: Unable to determine results due to technical error in fetching paper content
Conclusion: Unable to draw conclusions due to technical error in fetching paper content
Abstract: Failed to fetch summary for 2603.20025: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20025&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
[628] Operator Learning for Smoothing and Forecasting
Edoardo Calvello, Elizabeth Carlson, Nikola Kovachki, Michael N. Manta, Andrew M. Stuart
Main category: cs.LG
TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)
Details
Motivation: Cannot determine motivation without access to paper contentMethod: Cannot determine method without access to paper content
Result: Cannot determine results without access to paper content
Conclusion: Cannot determine conclusion without access to paper content
Abstract: Failed to fetch summary for 2603.20359: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2603.20359&sortBy=relevance&sortOrder=descending&start=0&max_results=100)
cs.MA
[629] Computational Foundations for Strategic Coopetition: Formalizing Sequential Interaction and Reciprocity
Vik Pant, Eric Yu
Main category: cs.MA
TL;DR: Extends computational foundations for strategic coopetition to sequential interactions, developing bounded reciprocity functions, memory-windowed history tracking, structural reciprocity sensitivity, and trust-gated reciprocity for multi-stakeholder systems.
Details
Motivation: To understand how cooperation persists through time without binding contracts in multi-stakeholder systems, bridging conceptual modeling with game-theoretic reciprocity analysis for both human interactions and multi-agent computational systems.Method: Develops four key components: (1) bounded reciprocity response functions, (2) memory-windowed history tracking over k recent periods, (3) structural reciprocity sensitivity from interdependence matrices, and (4) trust-gated reciprocity where trust modulates responses. Uses uniaxial treatment where agents choose cooperation levels along a single continuum.
Result: Comprehensive validation across 15,625 parameter configurations shows robust reciprocity effects with all six behavioral targets exceeding thresholds (cooperation emergence 97.5%, defection punishment 100%, etc.). Empirical validation using Apple iOS App Store ecosystem (2008-2024) achieves 84.3% accuracy, reproducing documented cooperation patterns across five ecosystem phases with statistical significance (p < 0.001, Cohen’s d = 1.57).
Conclusion: The framework successfully extends computational foundations for strategic coopetition to sequential interaction dynamics, providing validated tools for analyzing cooperation in multi-stakeholder systems. This concludes the Foundations Series with uniaxial treatment, with extensions planned for biaxial treatment where cooperation and competition are independent dimensions.
Abstract: Strategic coopetition in multi-stakeholder systems requires understanding how cooperation persists through time without binding contracts. This technical report extends computational foundations for strategic coopetition to sequential interaction dynamics, bridging conceptual modeling (i* framework) with game-theoretic reciprocity analysis. We develop: (1) bounded reciprocity response functions mapping partner deviations to finite conditional responses, (2) memory-windowed history tracking capturing cognitive limitations over k recent periods, (3) structural reciprocity sensitivity derived from interdependence matrices where behavioral responses are amplified by structural dependencies, and (4) trust-gated reciprocity where trust modulates reciprocity responses. The framework applies to both human stakeholder interactions and multi-agent computational systems. Comprehensive validation across 15,625 parameter configurations demonstrates robust reciprocity effects, with all six behavioral targets exceeding thresholds: cooperation emergence (97.5%), defection punishment (100%), forgiveness dynamics (87.9%), asymmetric differentiation (100%), trust-reciprocity interaction (100%), and bounded responses (100%). Empirical validation using the Apple iOS App Store ecosystem (2008-2024) achieves 43/51 applicable points (84.3%), reproducing documented cooperation patterns across five ecosystem phases. Statistical significance confirmed at p < 0.001 with Cohen’s d = 1.57. This report concludes the Foundations Series (TR-1 through TR-4) adopting uniaxial treatment where agents choose cooperation levels along a single continuum. Companion work on interdependence (arXiv:2510.18802), trust (arXiv:2510.24909), and collective action (arXiv:2601.16237) has been prepublished. Extensions Series (TR-5 through TR-8) introduces biaxial treatment where cooperation and competition are independent dimensions.
[630] Free Information Disrupts Even Bayesian Crowds
Jonas Stein, Shannon Cruz, Davide Grossi, Martina Testori
Main category: cs.MA
TL;DR: Unconstrained information exchange in networks can harm group belief accuracy even among ideal truth-seeking agents, suggesting practical constraints may be needed for social media design.
Details
Motivation: The paper challenges the core assumption that unconstrained information exchange in networks like social media is always beneficial, investigating whether such freedom might actually harm collective belief accuracy.Method: Uses computational agent-based modeling to simulate groups of truth-seeking, cooperative agents with perfect information-processing abilities exchanging information without constraints.
Result: Even among these idealized agents, unconstrained information exchange leads to detrimental effects on the correctness of the group’s beliefs, suggesting the problem is fundamental rather than due to human cognitive limitations.
Conclusion: Constraints on information flow should be carefully considered in the design of communication networks with substantial societal impact, such as social media platforms.
Abstract: A core tenet underpinning the conception of contemporary information networks, such as social media platforms, is that users should not be constrained in the amount of information they can freely and willingly exchange with one another about a given topic. By means of a computational agent-based model, we show how even in groups of truth-seeking and cooperative agents with perfect information-processing abilities, unconstrained information exchange may lead to detrimental effects on the correctness of the group’s beliefs. If unconstrained information exchange can be detrimental even among such idealized agents, it is prudent to assume it can also be so in practice. We therefore argue that constraints on information flow should be carefully considered in the design of communication networks with substantial societal impact, such as social media platforms.
[631] Optimizing Interventions for Agent-Based Infectious Disease Simulations
Anja Wolpers, Johannes Ponge, Adelinde M. Uhrmacher
Main category: cs.MA
TL;DR: ADIOS is a system that uses Grammar-Guided Genetic Programming to automatically optimize non-pharmaceutical interventions for infectious disease control in agent-based simulations, with a case study using the German Epidemic Micro-Simulation System.
Details
Motivation: Non-pharmaceutical interventions (NPIs) are crucial for controlling infectious diseases when pharmaceutical options are unavailable, but identifying effective interventions that minimize societal disruption is challenging. Agent-based simulations can analyze intervention impacts, but automatically optimizing NPIs in these simulations is complex due to large search spaces from multiple attributes, hierarchical group structures, and arbitrary combinations.Method: ADIOS uses Grammar-Guided Genetic Programming (GGGP) with a domain-specific language for expressing NPIs in agent-based simulations. The system structures the intervention search space through a context-free grammar and can further reduce the search space by defining constraints to prevent semantically invalid intervention patterns. It includes an interface for coupling with agent-based simulations.
Result: The system demonstrates potential to generate optimal interventions for realistic epidemiological models, using the German Epidemic Micro-Simulation System (GEMS) as a case study.
Conclusion: ADIOS provides a framework for optimizing non-pharmaceutical interventions in agent-based epidemiological simulations using grammar-guided genetic programming, offering decision-makers a tool to identify effective interventions while minimizing societal disruption.
Abstract: Non-pharmaceutical interventions (NPIs) are commonly used tools for controlling infectious disease transmission when pharmaceutical options are unavailable. Yet, identifying effective interventions that minimize societal disruption remains challenging. Agent-based simulation is a popular tool for analyzing the impact of possible interventions in epidemiology. However, automatically optimizing NPIs using agent-based simulations poses a complex problem because, in agent-based epidemiological models, interventions can target individuals based on multiple attributes, affect hierarchical group structures (e.g., schools, workplaces, and families), and be combined arbitrarily, resulting in a very large or even infinite search space. We aim to support decision-makers with our Agent-based Infectious Disease Intervention Optimization System (ADIOS) that optimizes NPIs for infectious disease simulations using Grammar-Guided Genetic Programming (GGGP). The core of ADIOS is a domain-specific language for expressing NPIs in agent-based simulations that structures the intervention search space through a context-free grammar. To make optimization more efficient, the search space can be further reduced by defining constraints that prevent the generation of semantically invalid intervention patterns. Using this constrained language and an interface that enables coupling with agent-based simulations, ADIOS adopts the GGGP approach for simulation-based optimization. Using the German Epidemic Micro-Simulation System (GEMS) as a case study, we demonstrate the potential of our approach to generate optimal interventions for realistic epidemiological models
[632] HEAS: Hierarchical Evolutionary Agent-Based Simulation Framework for Multi-Objective Policy Search
Ruiyu Zhang, Lin Nie, Xin Zhao
Main category: cs.MA
TL;DR: HEAS framework eliminates metric aggregation divergence in agent-based model policy search through uniform metric contracts, reducing rank reversals by 50% and coupling code by 97%.
Details
Motivation: Metric aggregation divergence is a hidden confound in agent-based model policy search where different pipeline stages independently implement outcome metric extraction, causing champion selection to reflect aggregation artifacts rather than actual policy quality.Method: Proposes Hierarchical Evolutionary Agent Simulation (HEAS), a composable framework that eliminates this confound through a runtime-enforceable metric contract - a uniform metrics_episode() callable shared identically by all pipeline stages (optimization, tournament evaluation, and statistical validation).
Result: HEAS reduces rank reversals by 50% relative to ad-hoc aggregation; the HEAS champion wins all 32 held-out ecological scenarios; reduces coupling code by 97% (160 to 5 lines) relative to Mesa 3.3.1; validated across ecological, enterprise, and mean-field ODE dynamics case studies.
Conclusion: HEAS provides a robust solution to metric aggregation divergence in agent-based models, enabling reliable champion selection through uniform metric contracts while significantly reducing code complexity and improving interpretability of results.
Abstract: Metric aggregation divergence is a hidden confound in agent-based model policy search: when optimization, tournament evaluation, and statistical validation independently implement outcome metric extraction, champion selection reflects aggregation artifact rather than policy quality. We propose Hierarchical Evolutionary Agent Simulation (HEAS), a composable framework that eliminates this confound through a runtime-enforceable metric contract - a uniform metrics_episode() callable shared identically by all pipeline stages. Removing the confound yields robust champion selection: in a controlled experiment (n=30), HEAS reduces rank reversals by 50% relative to ad-hoc aggregation; the HEAS champion wins all 32 held-out ecological scenarios - a null-safety result that would be uninterpretable under aggregation divergence. The contract additionally reduces coupling code by 97% (160 to 5 lines) relative to Mesa 3.3.1. Three case studies validate composability across ecological, enterprise, and mean-field ordinary differential equation dynamics.
[633] Sci-Mind: Cognitively-Inspired Adversarial Debate for Autonomous Mathematical Modeling
Junhao Jia, Huangwei Chen, Ruiying Sun, Yanhui Song, Haishuai Wang, Jiajun Bu, Lei Wu
Main category: cs.MA
TL;DR: Sci-Mind is a framework for autonomous scientific modeling that mimics human scientific discovery by integrating experiential memory recall, adversarial cognitive dialectic, and self-validating execution to improve mathematical modeling rigor and code executability.
Details
Motivation: Current autonomous agents powered by LLMs rely on isolated reasoning paradigms, generating plausible but fundamentally flawed models due to lack of domain grounding and adversarial verification, unlike human experts who use historical analogies and peer scrutiny.Method: 1) Experiential Memory Recall retrieves executable code snippets and modeling paradigm descriptors; 2) Adversarial Cognitive Dialectic uses a Theorist (optimizing mathematical coherence) and Pragmatist (enforcing data feasibility) debating through competing objectives; 3) Self-Validating Execution Strategy ensures blueprint consistency through formal predicates before code generation.
Result: Extensive experiments on MM-Bench and EngiBench demonstrate Sci-Mind significantly outperforms leading autonomous agents in both modeling rigorousness and code executability.
Conclusion: Sci-Mind successfully mirrors human scientific discovery processes, addressing limitations of current LLM-powered agents by integrating experiential grounding, adversarial verification, and formal validation for improved autonomous mathematical modeling.
Abstract: Real-world mathematical modeling is inherently an experiential and collaborative endeavor. Domain experts rarely solve complex problems from scratch; instead, they draw upon analogies from historical cases and subject their hypotheses to rigorous peer scrutiny. However, autonomous agents powered by Large Language Models predominantly rely on isolated reasoning paradigms, frequently generating plausible but fundamentally flawed models due to a lack of domain grounding and adversarial verification. To address these limitations, we propose Sci-Mind, a novel framework that mirrors the human scientific discovery process. Sci-Mind integrates Experiential Memory Recall to retrieve executable code snippets and modeling paradigm descriptors, grounding abstract reasoning in historical solutions. Subsequently, it employs an Adversarial Cognitive Dialectic where a Theorist optimizing mathematical coherence and a Pragmatist enforcing data feasibility debate through competing objectives to prune elegant but infeasible formulations. A Self-Validating Execution Strategy further ensures blueprint consistency through formal predicates before code generation, achieving fully autonomous execution. Extensive experiments on the MM-Bench and EngiBench demonstrate that Sci-Mind significantly outperforms leading autonomous agents in both modeling rigorousness and code executability.
cs.MM
[634] Semantic Compensation via Adversarial Removal for Robust Zero-Shot ECG Diagnosis
Hongjun Liu, Rujun Han, Leyu Zhou, Chao Yao
Main category: cs.MM
TL;DR: SCAR is a robust ECG-language pretraining framework that improves semantic alignment between cardiac signals and clinical text under missing data conditions using adversarial removal and semantic compensation.
Details
Motivation: Existing ECG-language pretraining methods assume fully observed ECG data, but in practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing severe degradation of cross-modal semantic alignment.Method: SCAR uses a differentiable adversarial masker to remove the most alignment-critical spatio-temporal ECG tokens during training, forcing the ECG encoder to learn robust representations. It also includes a semantically supervised adaptive selector that reweights remaining visible tokens and compensates with secondary diagnostic cues.
Result: Experiments on 6 datasets show SCAR consistently improves semantic robustness under joint lead and temporal missingness, with clear advantages in harder cases where primary diagnostic evidence is unavailable, while also yielding stronger linear-probing transferability.
Conclusion: SCAR provides an effective framework for robust ECG-language pretraining that maintains semantic alignment under realistic missing data conditions, with the introduced Counterfactual Missingness Resolution Score (CMRS) offering a better evaluation metric for robustness beyond classification accuracy.
Abstract: Recent ECG–language pretraining methods enable zero-shot diagnosis by aligning cardiac signals with clinical text, but they do not explicitly model robustness to partial observation and are typically studied under fully observed ECG settings. In practice, diagnostically critical leads or temporal segments may be missing due to electrode detachment, motion artifacts, or signal corruption, causing severe degradation of cross-modal semantic alignment. In this paper, we propose \textbf{SCAR}, a robust ECG–language pretraining framework for \textbf{S}emantic \textbf{C}ompensation via \textbf{A}dversarial \textbf{R}emoval. SCAR improves robustness by explicitly training the model to remain semantically aligned with semantically critical missingness and to recover diagnostic meaning from the remaining visible evidence. Specifically, we introduce a differentiable adversarial masker to remove the most alignment-critical spatio-temporal ECG tokens during training, forcing the ECG encoder to learn representations that remain semantically aligned with clinical text even when primary diagnostic evidence is missing. Under such adversarial corruption, we equip the ECG encoder with a semantically supervised adaptive selector that learns to reweight the remaining visible tokens and compensate with secondary yet diagnostically informative morphological cues. To evaluate robustness beyond classification accuracy, we further introduce Counterfactual Missingness Resolution Score (CMRS), which quantifies how well feature preserve diagnostic semantics under missingness. Experiments on $6$ datasets show that SCAR consistently improves semantic robustness under joint lead and temporal missingness, with particularly clear advantages in harder cases where primary diagnostic evidence is unavailable, while also yielding stronger linear-probing transferability.
[635] HippoMM: Hippocampal-inspired Multimodal Memory for Long Audiovisual Event Understanding
Yueqian Lin, Jingyang Zhang, Qinsi Wang, Hancheng Ye, Yuzhe Fu, Yudong Liu, Hai “Helen” Li, Yiran Chen
Main category: cs.MM
TL;DR: HippoMM is a computational cognitive architecture inspired by hippocampal mechanisms for multimodal episodic memory, achieving state-of-the-art performance on audiovisual understanding tasks while being more efficient than retrieval-augmented baselines.
Details
Motivation: Current computational systems struggle with comprehending extended audiovisual experiences, particularly in temporal integration and cross-modal associations that are fundamental to human episodic memory. The authors aim to address these challenges by drawing inspiration from hippocampal mechanisms rather than relying on scaling or architectural sophistication.Method: HippoMM implements three integrated components: 1) Episodic Segmentation that detects audiovisual input changes to split videos into discrete episodes (mirroring dentate gyrus pattern separation), 2) Memory Consolidation that compresses episodes into summaries with key features preserved (analogous to hippocampal memory formation), and 3) Hierarchical Memory Retrieval that first searches semantic summaries, then escalates via temporal window expansion around seed segments for cross-modal queries (mimicking CA3 pattern completion).
Result: On the HippoVlog benchmark testing associative memory, HippoMM achieves state-of-the-art 78.2% accuracy while operating 5x faster than retrieval-augmented baselines. The system demonstrates that cognitive architectures provide blueprints for next-generation multimodal understanding.
Conclusion: The paper shows that biologically-inspired cognitive architectures can effectively address challenges in multimodal episodic memory and audiovisual understanding, offering an alternative approach to scaling-based methods. The integrated hippocampal-inspired components work synergistically to create a system that exceeds the sum of its parts.
Abstract: Comprehending extended audiovisual experiences remains challenging for computational systems, particularly temporal integration and cross-modal associations fundamental to human episodic memory. We introduce HippoMM, a computational cognitive architecture that maps hippocampal mechanisms to solve these challenges. Rather than relying on scaling or architectural sophistication, HippoMM implements three integrated components: (i) Episodic Segmentation detects audiovisual input changes to split videos into discrete episodes, mirroring dentate gyrus pattern separation; (ii) Memory Consolidation compresses episodes into summaries with key features preserved, analogous to hippocampal memory formation; and (iii) Hierarchical Memory Retrieval first searches semantic summaries, then escalates via temporal window expansion around seed segments for cross-modal queries, mimicking CA3 pattern completion. These components jointly create an integrated system exceeding the sum of its parts. On our HippoVlog benchmark testing associative memory, HippoMM achieves state-of-the-art 78.2% accuracy while operating 5x faster than retrieval-augmented baselines. Our results demonstrate that cognitive architectures provide blueprints for next-generation multimodal understanding. The code and benchmark dataset are publicly available at https://github.com/linyueqian/HippoMM.
[636] Catalogue Grounded Multimodal Attribution for Museum Video under Resource and Regulatory Constraints
Minsak Nanang, Adrian Hilton, Armin Mustafa
Main category: cs.MM
TL;DR: Automated pipeline using video-language models to generate catalogue-style metadata for museum audiovisual archives, improving discoverability while respecting data sovereignty constraints.
Details
Motivation: Museums and galleries have growing audiovisual archives that remain inaccessible due to lack of searchable metadata, with existing manual curation methods being too labor-intensive.Method: Multi-pass pipeline using open, locally deployable video language models: (1) summarizes artworks in videos, (2) generates catalogue-style descriptions and genre labels, (3) attributes titles and artists via conservative similarity matching to existing structured catalogues.
Result: Early deployments on painting catalogues suggest the framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulations.
Conclusion: The approach offers a transferable template for application-driven machine learning in high-stakes domains like cultural heritage preservation.
Abstract: Audiovisual (AV) archives in museums and galleries are growing rapidly, but much of this material remains effectively locked away because it lacks consistent, searchable metadata. Existing method for archiving requires extensive manual effort. We address this by automating the most labour intensive part of the workflow: catalogue style metadata curation for in gallery video, grounded in an existing collection database. Concretely, we propose catalogue-grounded multimodal attribution for museum AV content using an open, locally deployable video language model. We design a multi pass pipeline that (i) summarises artworks in a video, (ii) generates catalogue style descriptions and genre labels, and (iii) attempts to attribute title and artist via conservative similarity matching to the structured catalogue. Early deployments on a painting catalogue suggest that this framework can improve AV archive discoverability while respecting resource constraints, data sovereignty, and emerging regulation, offering a transferable template for application-driven machine learning in other high-stakes domains.
[637] A Video Steganography for H.265/HEVC Based on Multiple CU Size and Block Structure Distortion
Xiang Zhang, Wen Jiang, Fei Peng, Wenbin Huang, Ziqiang Li, Zhangjie Fu
Main category: cs.MM
TL;DR: Video steganography algorithm using multiple CU sizes and block structure distortion to embed secret info in I-frames while minimizing disruption to original block structure for better anti-steganalysis.
Details
Motivation: Existing video steganography algorithms based on block structure modification suffer from poor anti-steganalysis performance due to significant disruption of original CU block structure after embedding secret information.Method: Proposes three innovations: 1) CU Block Structure Stability Metric (CBSSM) to analyze anti-steganalysis limitations, 2) novel mapping rule based on multiple CU sizes to reduce block structure changes, 3) three-level distortion function based on block structure to guide embedding.
Result: Comprehensive experiments show CBSSM effectively evaluates anti-steganalysis performance even at low embedding rates. The proposed algorithm outperforms state-of-the-art methods in anti-steganalysis, visual quality, bitrate increase ratio, and embedding capacity.
Conclusion: The triple strategy minimizes disruption to original CU block structure while concealing information in areas where block structure changes occur after recompression, significantly enhancing anti-steganalysis performance.
Abstract: Video steganography based on block structure, which embeds secret information by modifying Coding Unit (CU) block structure of I-frames, is currently a research hotspot. However, the existing algorithms still suffer from the limitation of poor anti-steganalysis, which results from significantly disrupting the original CU block structure after embedding secret information. To overcome this limitation, this paper proposes a video steganography algorithm based on multiple CU size and block structure distortion. Our algorithm introduces three key innovations: 1) a CU Block Structure Stability Metric (CBSSM) based on CU block structure restoration phenomenon to reveal the reasons for the insufficient anti-steganalysis performance of current algorithms. 2) a novel mapping rule based on multiple CU size to reduce block structure change and enhance embedding capacity. 3) a three-level distortion function based on block structure to better guide the secret information embedding. This triple strategy ensures that the secret information embedding minimizes disruption to the original CU block structure while concealing it primarily in areas where block structure changes occur after recompression, ultimately enhancing the algorithm’s anti-steganalysis. Comprehensive experimental results highlight the crucial role of the proposed CBSSM in evaluating anti-steganalysis performance even at a low embedding rate. Meanwhile, compared to State-of-the-Art video steganography algorithms based on block structure, our proposed steganography algorithm exhibits greater anti-steganalysis, as well as further improving visual quality, bitrate increase ratio and embedding capacity.
eess.AS
[638] Reverberation-Robust Localization of Speakers Using Distinct Speech Onsets and Multi-channel Cross-Correlations
Shoufeng Lin
Main category: eess.AS
TL;DR: Two novel speaker localization algorithms using microphone arrays that address reverberation challenges through auditory filterbank decomposition and multi-channel cross-correlation techniques.
Details
Motivation: Speaker localization under strong reverberation remains a major challenge in real-world applications, as existing methods struggle with concurrent speakers and reverberant environments.Method: Two algorithms: 1) Uses auditory filterbank decomposition into subbands, speech onset detection from signal models, and multi-channel cross-correlation coefficient (MCCC) of encoded speech onsets; 2) Extends GCC-PHAT using redundant microphone information to suppress reverberation.
Result: Methods reliably locate static and moving speakers under adverse conditions (reverberation time up to 1s in simulations, 0.65s in real recordings), outperforming state-of-the-art localization methods.
Conclusion: The proposed algorithms effectively address reverberation challenges in speaker localization, demonstrating robust performance in both simulated and real reverberant environments.
Abstract: Many speaker localization methods can be found in the literature. However, speaker localization under strong reverberation still remains a major challenge in the real-world applications. This paper proposes two algorithms for localizing speakers using microphone array recordings of reverberated sounds. To separate concurrent speakers, the first algorithm decomposes microphone signals spectrotemporally into subbands via an auditory filterbank. To suppress reverberation, we propose a novel speech onset detection approach derived from the speech signal and impulse response models, and further propose to formulate the multi-channel cross-correlation coefficient (MCCC) of encoded speech onsets in each subband. The subband results are combined to estimate the directions-of-arrival (DOAs) of speakers. The second algorithm extends the generalized cross-correlation - phase transform (GCC-PHAT) method by using redundant information of multiple microphones to address the reverberation problem. The proposed methods have been evaluated under adverse conditions using not only simulated signals (reverberation time $T_{60}$ of up to $1$s) but also recordings in a real reverberant room ($T_{60} \approx 0.65$s). Comparing with some state-of-the-art localization methods, experimental results confirm that the proposed methods can reliably locate static and moving speakers, in presence of reverberation.
[639] Validating Computational Markers of Depressive Behavior: Cross-Linguistic Speech-Based Depression Detection with Neurophysiological Validation
Fuxiang Tao, Dongwei Li, Shuning Tang, Xuri Ge, Wei Ma, Anna Esposito, Alessandro Vinciarelli
Main category: eess.AS
TL;DR: Extends CDMA framework for depression detection from speech to Chinese Mandarin with EEG validation, showing cross-linguistic robustness and emotional arousal importance
Details
Motivation: To investigate cross-linguistic robustness of acoustic markers for depression detection and establish neurobiological validation through EEG correlationsMethod: Extends Cross-Data Multilevel Attention (CDMA) framework to Chinese Mandarin dataset with EEG recordings, fusing read and spontaneous speech across emotional valences
Result: Achieves state-of-the-art performance (F1-score up to 89.6%) on Chinese dataset, comparable to Italian validation; emotionally valenced speech outperforms neutral; EEG shows correlations between speech-derived depression estimates and neural oscillatory patterns
Conclusion: CDMA framework is universally applicable and neurobiologically validated, establishing novel paradigm for neurophysiological validation of computational mental health models
Abstract: Speech-based depression detection has shown promise as an objective diagnostic tool, yet the cross-linguistic robustness of acoustic markers and their neurobiological underpinnings remain underexplored. This study extends Cross-Data Multilevel Attention (CDMA) framework, initially validated on Italian, to investigate these dimensions using a Chinese Mandarin dataset with Electroencephalography (EEG) recordings. We systematically fuse read speech with spontaneous speech across different emotional valences (positive, neutral, negative) to investigate whether emotional arousal is a more critical factor than valence polarity in enhancing detection performance in speech. Additionally, we establish the first neurophysiological validation for a speech-based depression model by correlating its predictions with neural oscillatory patterns during emotional face processing. Our results demonstrate strong cross-linguistic generalizability of the CDMA framework, achieving state-of-the-art performance (F1-score up to 89.6%) on the Chinese dataset, which is comparable to the previous Italian validation. Critically, emotionally valenced speech (both positive and negative) significantly outperformed neutral speech. This comparable performance between positive and negative tasks supports the emotional arousal hypothesis. Most importantly, EEG analysis revealed significant correlations between the model’s speech-derived depression estimates and neural oscillatory patterns (theta and alpha bands), demonstrating alignment with established neural markers of emotional dysregulation in depression. This alignment, combined with the model’s cross-linguistic robustness, not only supports that the CDMA framework’s approach is a universally applicable and neurobiologically validated strategy but also establishes a novel paradigm for the neurophysiological validation of computational mental health models.
[640] Robust Pitch Estimation and Tracking for Speakers Based on Subband Encoding and the Generalized Labeled Multi-Bernoulli Filter
Shoufeng Lin
Main category: eess.AS
TL;DR: Novel pitch estimation and tracking method using auditory filterbank decomposition, CASA-inspired encoding, and GLMB filtering with Ornstein-Uhlenbeck process for temporal continuity
Details
Motivation: To develop more accurate and robust pitch estimation and tracking methods for speakers that can handle various noise conditions and reverberation, improving upon existing state-of-the-art approachesMethod: 1) Decompose sound signal into subbands using auditory filterbank with novel frequency coverage metric; 2) Encode subband signals using CASA-inspired approach; 3) Calculate normalized autocorrelations for pitch estimation; 4) Use GLMB filter for pitch tracking with Ornstein-Uhlenbeck process state transition model and measurement-driven birth model
Result: Achieved better accuracy than several state-of-the-art pitch estimation methods in most scenarios with various additive noises, and demonstrated robustness against reverberation in real recordings
Conclusion: The proposed pitch estimation and tracking method effectively handles noise and reverberation through auditory-inspired processing and advanced filtering techniques
Abstract: This paper proposes a new pitch estimator and a novel pitch tracker for speakers. We first decompose the sound signal into subbands using an auditory filterbank, assuming time-frequency sparsity of human speech. Instead of directly selecting the number of subbands according to experience, we propose a novel frequency coverage metric to derive the number of subbands and the center frequencies of the filterbank. The subband signals are then encoded inspired by the computational auditory scene analysis (CASA) approach, and the normalized autocorrelations are calculated for pitch estimation. To suppress spurious errors and track the speaker identity, the temporal continuity constraint is exploited and a Generalized Labeled Multi-Bernoulli (GLMB) filter is adapted for pitch tracking, where we use a novel pitch state transition model based on the Ornstein-Uhlenbeck process, and the measurement driven birth model for adaptive new births of pitch targets. Experimental evaluations with various additive noises demonstrate that the proposed methods have achieved better accuracy compared with several state-of-the-art pitch estimation methods in most studied scenarios. Tests using real recordings in a reverberant room also show that the proposed method is robust against reverberation.
[641] PhiNet: Speaker Verification with Phonetic Interpretability
Yi Ma, Shuai Wang, Tianchi Liu, Haizhou Li
Main category: eess.AS
TL;DR: PhiNet is an interpretable speaker verification network that provides phonetic-level explanations for its decisions, bridging the gap between black-box ASV systems and forensic speaker comparison.
Details
Motivation: Current ASV systems lack transparency needed for high-accountability applications. Human forensic experts use phonetic evidence for speaker comparison, so the authors aim to create a speaker verification system with similar interpretability.Method: Proposed PhiNet, a speaker verification network with phonetic interpretability that leverages phonetic evidence in decision-making. Provides both local (phonetic-level) and global interpretability through detailed comparisons of speaker-specific features.
Result: PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful interpretable explanations. Evaluated on VoxCeleb, SITW, and LibriSpeech datasets with both qualitative and quantitative assessments of interpretability.
Conclusion: PhiNet successfully bridges the gap between ASV and forensic analysis by providing transparent, phonetic-based explanations for speaker verification decisions, enabling both user inspection and developer debugging.
Abstract: Despite remarkable progress, automatic speaker verification (ASV) systems typically lack the transparency required for high-accountability applications. Motivated by how human experts perform forensic speaker comparison (FSC), we propose a speaker verification network with phonetic interpretability, PhiNet, designed to enhance both local and global interpretability by leveraging phonetic evidence in decision-making. For users, PhiNet provides detailed phonetic-level comparisons that enable manual inspection of speaker-specific features and facilitate a more critical evaluation of verification outcomes. For developers, it offers explicit reasoning behind verification decisions, simplifying error tracing and informing hyperparameter selection. In our experiments, we demonstrate PhiNet’s interpretability with practical examples, including its application in analyzing the impact of different hyperparameters. We conduct both qualitative and quantitative evaluations of the proposed interpretability methods and assess speaker verification performance across multiple benchmark datasets, including VoxCeleb, SITW, and LibriSpeech. Results show that PhiNet achieves performance comparable to traditional black-box ASV models while offering meaningful, interpretable explanations for its decisions, bridging the gap between ASV and forensic analysis.
[642] T5Gemma-TTS Technical Report
Chihiro Arata, Kiyoshi Kurihara
Main category: eess.AS
TL;DR: T5Gemma-TTS is an encoder-decoder codec language model for multilingual text-to-speech that maintains persistent text conditioning through cross-attention and introduces progress-monitoring position embeddings for better duration control.
Details
Motivation: Decoder-only architectures in neural codec language models treat input text as a prefix that competes with audio sequences for positional capacity, weakening text conditioning over long utterances. The authors aim to create a model that maintains strong text conditioning throughout synthesis.Method: Built on T5Gemma pretrained encoder-decoder backbone (4B parameters) with bidirectional text representations routed through cross-attention at every decoder layer. Introduces Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers to inject normalized progress signals for duration control. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese.
Result: Achieves statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs 0.622) and highest numerical Korean speaker similarity (0.747) despite no Korean training data. Attains lowest numerical Japanese character error rate among baselines (0.126). PM-RoPE ablation shows critical importance: disabling it causes near-complete synthesis failure (CER degrades from 0.129 to 0.982).
Conclusion: Encoder-decoder architecture with persistent cross-attention and progress-monitoring position embeddings enables strong text conditioning and duration control for multilingual TTS, demonstrating zero-shot voice cloning capabilities across languages.
Abstract: Autoregressive neural codec language models have shown strong zero-shot voice cloning ability, but decoder-only architectures treat input text as a prefix that competes with the growing audio sequence for positional capacity, weakening text conditioning over long utterances. We present T5Gemma-TTS, an encoder-decoder codec language model that maintains persistent text conditioning by routing bidirectional text representations through cross-attention at every decoder layer. Built on the T5Gemma pretrained encoder-decoder backbone (2B encoder + 2B decoder; 4B parameters), it inherits rich linguistic knowledge without phoneme conversion and processes text directly at the subword level. To improve duration control, we introduce Progress-Monitoring Rotary Position Embedding (PM-RoPE) in all 26 cross-attention layers, injecting normalized progress signals that help the decoder track target speech length. Trained on 170,000 hours of multilingual speech in English, Chinese, and Japanese, T5Gemma-TTS achieves a statistically significant speaker-similarity gain on Japanese over XTTSv2 (0.677 vs. 0.622; non-overlapping 95% confidence intervals) and the highest numerical Korean speaker similarity (0.747) despite Korean not being included in training, although this margin over XTTSv2 (0.741) is not statistically conclusive. It also attains the lowest numerical Japanese character error rate among five baselines (0.126), though this ranking should be interpreted cautiously because of partial confidence-interval overlap with Kokoro. English results on LibriSpeech should be viewed as an upper-bound estimate because LibriHeavy is a superset of LibriSpeech. Using the same checkpoint, disabling PM-RoPE at inference causes near-complete synthesis failure: CER degrades from 0.129 to 0.982 and duration accuracy drops from 79% to 46%. Code and weights are available at https://github.com/Aratako/T5Gemma-TTS.
[643] GAP-URGENet: A Generative-Predictive Fusion Framework for Universal Speech Enhancement
Xiaobin Rong, Yushi Wang, Zheng Wang, Jing Lu
Main category: eess.AS
TL;DR: GAP-URGENet is a generative-predictive fusion framework for speech enhancement that combines self-supervised representation restoration with spectrogram enhancement, achieving top performance in the ICASSP 2026 URGENT Challenge.
Details
Motivation: To develop a robust speech enhancement system for the ICASSP 2026 URGENT Challenge that combines complementary approaches - generative and predictive methods - to improve perceptual quality and robustness in speech restoration tasks.Method: The framework integrates two branches: 1) a generative branch performing full-stack speech restoration in self-supervised representation domain with waveform reconstruction via neural vocoder, and 2) a predictive branch performing spectrogram-domain enhancement. Outputs are fused by a post-processing module with bandwidth extension to generate enhanced waveform at 48kHz.
Result: Achieved top performance in the blind-test phase and ranked 1st in objective evaluation of the ICASSP 2026 URGENT Challenge Track 1, demonstrating improved robustness and perceptual quality.
Conclusion: The generative-predictive fusion framework effectively combines complementary speech enhancement approaches, resulting in state-of-the-art performance for speech restoration tasks in challenging conditions.
Abstract: We introduce GAP-URGENet, a generative-predictive fusion framework developed for Track 1 of the ICASSP 2026 URGENT Challenge. The system integrates a generative branch, which performs full-stack speech restoration in a self-supervised representation domain and reconstructs the waveform via a neural vocoder, along with a predictive branch that performs spectrogram-domain enhancement, providing complementary cues. Outputs from both branches are fused by a post-processing module, which also performs bandwidth extension to generate the enhanced waveform at 48 kHz, later downsampled to the original sampling rate. This generative-predictive fusion improves robustness and perceptual quality, achieving top performance in the blind-test phase and ranking 1st in the objective evaluation. Audio examples are available at https://xiaobin-rong.github.io/gap-urgenet_demo.
[644] RIFT: Entropy-Optimised Fractional Wavelet Constellations for Ideal Time-Frequency Estimation
James M. Cozens, Simon J. Godsill
Main category: eess.AS
TL;DR: RIFT method for high-resolution time-frequency analysis using fractional wavelet transforms and deconvolution to achieve cross-term-free representations comparable to Wigner-Ville Distribution.
Details
Motivation: Need for high-resolution time-frequency analysis of complex nonstationary signals without cross-term interference that plagues traditional methods like WVD.Method: Uses constellation of Continuous Fractional Wavelet Transforms aligned to local curvatures, combined via entropy-based sparsity, then applies positivity-constrained Lucy-Richardson deconvolution with total-variation regularization.
Result: Achieves auto-term resolution comparable to WVD while effectively suppressing cross-terms, with superior time-frequency precision on synthetic and real-world signals.
Conclusion: RIFT enables high-resolution cross-term-free time-frequency analysis with potential applications in speech, music, and other signal processing domains.
Abstract: We introduce a new method for estimating the Ideal Time-Frequency Representation (ITFR) of complex nonstationary signals. The Reconstructive Ideal Fractional Transform (RIFT) computes a constellation of Continuous Fractional Wavelet Transforms (CFWTs) aligned to different local time-frequency curvatures. This constellation is combined into a single optimised time-frequency energy representation via a localised entropy-based sparsity measure, designed to resolve auto-terms and attenuate cross-terms. Finally, a positivity-constrained Lucy-Richardson deconvolution with total-variation regularisation is applied to estimate the ITFR, achieving auto-term resolution comparable to that of the Wigner-Ville Distribution (WVD), yielding the high-resolution RIFT representation. The required Cohen’s class convolutional kernels are fully derived in the paper for the chosen CFWT constellations. Additionally, the optimisation yields an Instantaneous Phase Direction (IPD) field, which allows the localised curvature in speech or music extracts to be visualised and utilised within a Kalman tracking scheme, enabling the extraction of signal component trajectories and the construction of the Spline-RIFT variant. Evaluation on synthetic and real-world signals demonstrates the algorithm’s ability to effectively suppress cross-terms and achieve superior time-frequency precision relative to competing methods. This advance holds significant potential for a wide range of applications requiring high-resolution cross-term-free time-frequency analysis.
[645] Inter-Speaker Relative Cues for Two-Stage Text-Guided Target Speech Extraction
Wang Dai, Archontis Politis, Tuomas Virtanen
Main category: eess.AS
TL;DR: Relative cues outperform independent cues for text-based target speech extraction, with a two-stage framework (separation then classification) beating single-stage methods and even matching enrollment-audio systems.
Details
Motivation: Current text-based target speech extraction methods often use absolute categorical representations that lose fine-grained distinctions in continuous-valued speaker attributes. The paper aims to show that relative cues preserve these distinctions better and can improve extraction performance.Method: Two-stage framework: 1) speech separation model generates candidate sources, 2) text-guided classifier selects target speaker based on embedding similarity. Two classification models trained to compare relative vs independent cues for continuous-valued attributes.
Result: Relative cues achieve higher classification accuracy and better TSE performance than independent cues. Two-stage framework outperforms single-stage text-conditioned methods on signal-level and perceptual metrics. Several relative cues (language, loudness, distance, etc.) can surpass enrollment-audio-based TSE systems.
Conclusion: Relative cues are superior to independent cues for text-based target speech extraction, preserving fine-grained distinctions in continuous attributes. The two-stage framework is effective, and different relative cues have varying discriminative power.
Abstract: This paper investigates the use of relative cues for text-based target speech extraction (TSE). We first provide a theoretical justification for relative cues from the perspectives of human perception and label quantization, showing that relative cues preserve fine-grained distinctions that are often lost in absolute categorical representations for continuous-valued attributes. Building on this analysis, we propose a two-stage TSE framework in which a speech separation model first generates candidate sources, followed by a text-guided classifier that selects the target speaker based on embedding similarity. Within this framework, we train two separate classification models to evaluate the advantages of relative cues over independent cues in case of continuous-valued attributes, considering both classification accuracy and TSE performance. Experimental results demonstrate that (i) relative cues achieve higher overall classification accuracy and improved TSE performance compared with independent cues; (ii) the proposed two-stage framework substantially outperforms single-stage text-conditioned extraction methods on both signal-level and objective perceptual metrics; and (iii) several relative cues, including language, loudness, distance, temporal order, speaking duration, random cues, and all cues, can even surpass the performance of an enrollment-audio-based TSE system. Further analysis reveals notable differences in discriminative power across cue types, providing insights into the effectiveness of different relative cues for TSE.
eess.IV
[646] OkanNet: A Lightweight Deep Learning Architecture for Classification of Brain Tumor from MRI Images
Okan Uçar, Murat Kurt
Main category: eess.IV
TL;DR: Comparative study of two deep learning approaches for brain tumor classification from MRI images: a custom lightweight CNN (OkanNet) vs. transfer learning with ResNet-50.
Details
Motivation: Manual MRI analysis is time-consuming and error-prone; automated deep learning approaches can improve efficiency and accuracy in brain tumor diagnosis.Method: Two approaches: 1) Custom CNN architecture “OkanNet” designed for low computational cost; 2) Transfer learning using ResNet-50 pre-trained on ImageNet. Both tested on 7,023 MRI images for classifying Glioma, Meningioma, Pituitary, and No Tumor.
Result: ResNet-50 achieved superior performance (96.49% accuracy, 0.963 precision) while OkanNet reached 88.10% accuracy but was 3.2× faster (311 seconds training time).
Conclusion: Demonstrates trade-off between model depth and computational efficiency; ResNet-50 better for accuracy, OkanNet suitable for resource-constrained mobile/embedded systems.
Abstract: Medical imaging techniques, especially Magnetic Resonance Imaging (MRI), are accepted as the gold standard in the diagnosis and treatment planning of neurological diseases. However, the manual analysis of MRI images is a time-consuming process for radiologists and is prone to human error due to fatigue. In this study, two different Deep Learning approaches were developed and analyzed comparatively for the automatic detection and classification of brain tumors (Glioma, Meningioma, Pituitary, and No Tumor). In the first approach, a custom Convolutional Neural Network (CNN) architecture named “OkanNet”, which has a low computational cost and fast training time, was designed from scratch. In the second approach, the Transfer Learning method was applied using the 50-layer ResNet-50 [1] architecture, pre-trained on the ImageNet dataset. In experiments conducted on an extended dataset compiled by Masoud Nickparvar containing a total of $7,023$ MRI images, the Transfer Learning-based ResNet-50 model exhibited superior classification performance, achieving $96.49%$ Accuracy and $0.963$ Precision. In contrast, the custom OkanNet architecture reached an accuracy rate of $88.10%$; however, it proved to be a strong alternative for mobile and embedded systems with limited computational power by yielding results approximately $3.2$ times faster ($311$ seconds) than ResNet-50 in terms of training time. This study demonstrates the trade-off between model depth and computational efficiency in medical image analysis through experimental data.
[647] DenOiS: Dual-Domain Denoising of Observation and Solution in Ultrasound Image Reconstruction
Can Deniz Bezek, Orcun Goksel
Main category: eess.IV
TL;DR: DenOiS is a framework for medical image reconstruction that denoises both input observations and solutions, using observation refinement to correct measurements and diffusion-based PnP reconstruction robust to missing data.
Details
Motivation: Medical imaging faces challenges with inexact imaging models, inaccurate/incomplete measurements, and noise. Existing methods (analytical and deep learning) are sensitive to measurement quality and model accuracy. There's a need for approaches that can generalize from simulations to real data while handling noisy observations and inexact models.Method: DenOiS framework with two components: 1) Observation refinement strategy that corrects degraded measurements while compensating for imaging model simplifications, 2) Diffusion-based plug-and-play (PnP) reconstruction approach robust to missing measurements. Enables training only on simulations but generalizing to real data.
Result: Demonstrated for speed-of-sound imaging in quantitative ultrasound image reconstruction. Achieves high-fidelity image reconstruction with noisy observations and inexact imaging models. Shows generalization capability from simulation training to real data.
Conclusion: DenOiS provides a robust framework for medical image reconstruction that handles both observation noise and solution refinement, enabling effective generalization from simulated training to real-world applications with imperfect measurements and models.
Abstract: Medical imaging aims to recover underlying tissue properties, using inexact (simplified/linearized) imaging models and often from inaccurate and incomplete measurements. Analytical reconstruction methods rely on hand-crafted regularization, sensitive to noise assumptions and parameter tuning. Among deep learning alternatives, plug-and-play (PnP) approaches learn regularization while incorporating imaging physics during inference, outperforming purely data-driven methods. The performance of all these approaches, however, still strongly depends on measurement quality and imaging model accuracy. In this work, we propose DenOiS, a framework that denoises both input observations and resulting solution in their respective domains. It consists of an observation refinement strategy that corrects degraded measurements while compensating for imaging model simplifications, and a diffusion-based PnP reconstruction approach that remains robust under missing measurements. DenOiS enables generalization to real data from training only in simulations, resulting in high-fidelity image reconstruction with noisy observations and inexact imaging models. We demonstrate this for speed-of-sound imaging as a challenging setting of quantitative ultrasound image reconstruction.
[648] Learning to Translate Noise for Robust Image Denoising
Inju Ha, Donghun Ryou, Seonguk Seo, Bohyung Han
Main category: eess.IV
TL;DR: A noise translation framework that converts real-world noise to Gaussian noise for better denoising generalization, using a translation network and pretrained Gaussian denoiser.
Details
Motivation: Deep learning image denoising methods often fail to generalize to out-of-distribution real-world noise, requiring more robust approaches that can handle unknown noise patterns.Method: Proposes a noise translation framework with two components: 1) a noise translation network that converts complex real-world noise into Gaussian noise, and 2) a pretrained image denoising network that effectively removes Gaussian noise. Uses mathematically motivated loss functions and architectures based on Gaussian noise properties.
Result: The method substantially improves robustness and generalizability, outperforming state-of-the-art methods across diverse benchmarks. Visual results and source code are available.
Conclusion: Translating real-world noise to Gaussian noise enables more robust denoising by leveraging well-understood Gaussian noise properties and pretrained denoisers, addressing generalization challenges in real-world applications.
Abstract: Deep learning-based image denoising techniques often struggle with poor generalization performance to out-of-distribution real-world noise. To tackle this challenge, we propose a novel noise translation framework that performs denoising on an image with translated noise rather than directly denoising an original noisy image. Specifically, our approach translates complex, unknown real-world noise into Gaussian noise, which is spatially uncorrelated and independent of image content, through a noise translation network. The translated noisy images are then processed by an image denoising network pretrained to effectively remove Gaussian noise, enabling robust and consistent denoising performance. We also design well-motivated loss functions and architectures for the noise translation network by leveraging the mathematical properties of Gaussian noise. Experimental results demonstrate that the proposed method substantially improves robustness and generalizability, outperforming state-of-the-art methods across diverse benchmarks. Visualized denoising results and the source code are available on our project page.
[649] DeDelayed: Deleting Remote Inference Delay via On-Device Correction
Dan Jacobellis, Mateen Ulhaq, Fabien Racapé, Hyomin Choi, Neeraja J. Yadwadkar
Main category: eess.IV
TL;DR: Dedelayed is a real-time video inference system that splits computation between local and remote models to overcome latency limitations, using remote predictions on delayed frames and local processing of current frames with joint optimization for bandwidth constraints.
Details
Motivation: Current video understanding models are too resource-intensive for edge devices, while cloud offloading introduces latency issues. There's a need for real-time video analysis systems that can balance computational constraints with accuracy requirements for applications like robotics and autonomous driving.Method: Proposes a system with local model processing current frames and remote model processing delayed frames. Remote model predicts future frames which local model incorporates. Joint optimization with autoencoder limits transmission bitrate for available downlink channels.
Result: On BDD100k driving dataset with 100ms round trip delay, Dedelayed improves performance by 6.4 mIoU over fully local inference and 9.8 mIoU over remote inference - equivalent to using a model ten times larger.
Conclusion: Dedelayed effectively addresses the latency-accuracy tradeoff in real-time video analysis by intelligently splitting computation between edge and cloud, achieving significant performance gains while respecting communication constraints.
Abstract: Video comprises the vast majority of bits that are generated daily, and is the primary signal driving current innovations in robotics, remote sensing, and wearable technology. Yet, the most powerful video understanding models are too expensive for the resource-constrained platforms used in these applications. One approach is to offload inference to the cloud; this gives access to GPUs capable of processing high-resolution videos in real time. But even with reliable, high-bandwidth communication channels, the combined latency of video encoding, model inference, and round-trip communication prohibits use for certain real-time applications. The alternative is to use fully local inference; but this places extreme constraints on computational and power costs, requiring smaller models and lower resolution, leading to degraded accuracy. To address these challenges, we propose Dedelayed, a real-time inference system that divides computation between a remote model operating on delayed video frames and a local model with access to the current frame. The remote model is trained to make predictions on anticipated future frames, which the local model incorporates into its prediction for the current frame. The local and remote models are jointly optimized with an autoencoder that limits the transmission bitrate required by the available downlink communication channel. We evaluate Dedelayed on the task of real-time streaming video segmentation using the BDD100k driving dataset. For a round trip delay of 100 ms, Dedelayed improves performance by 6.4 mIoU compared to fully local inference and 9.8 mIoU compared to remote inference – an equivalent improvement to using a model ten times larger. We release our training code, pretrained models, and python library at https://github.com/InterDigitalInc/dedelayed .
[650] SkinGenBench: Generative Model and Preprocessing Effects for Synthetic Dermoscopic Augmentation in Melanoma Diagnosis
N. A. Adarsh Pritam, Jeba Shiney O, Sanyam Jain
Main category: eess.IV
TL;DR: SkinGenBench evaluates generative models (StyleGAN2-ADA vs DDPMs) for synthetic dermoscopic image augmentation, finding StyleGAN2-ADA produces better quality images and that generative architecture choice matters more than preprocessing complexity for melanoma diagnosis.
Details
Motivation: To systematically investigate how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis, addressing the need for effective data augmentation in medical imaging.Method: Uses 14,116 dermoscopic images across 5 lesion classes, evaluates StyleGAN2-ADA and DDPMs under basic geometric augmentation vs advanced artifact removal pipelines, assesses synthetic images using perceptual/distributional metrics (FID, KID, IS), feature space analysis, and impact on diagnostic performance across 5 downstream classifiers.
Result: StyleGAN2-ADA consistently produced better synthetic images (FID ≈65.5, KID ≈0.05) than diffusion models. Generative architecture choice had stronger influence than preprocessing complexity. Synthetic data augmentation improved melanoma detection by 8-15% absolute gains in F1-score, with ViT-B/16 achieving F1 ≈0.88 and ROC-AUC ≈0.98.
Conclusion: Generative architecture choice is more important than preprocessing complexity for synthetic dermoscopic image quality and diagnostic utility. StyleGAN2-ADA outperforms diffusion models, and synthetic augmentation substantially improves melanoma detection despite limited benefits from advanced preprocessing.
Abstract: This work introduces SkinGenBench, a systematic biomedical imaging benchmark that investigates how preprocessing complexity interacts with generative model choice for synthetic dermoscopic image augmentation and downstream melanoma diagnosis. Using a curated dataset of $14,116$ dermoscopic images from HAM10000 and MILK10K across five lesion classes, we evaluate the two representative generative paradigms: StyleGAN2-ADA and Denoising Diffusion Probabilistic Models (DDPMs) under basic geometric augmentation and advanced artifact removal pipelines. Synthetic melanoma images are assessed using established perceptual and distributional metrics (FID, KID, IS), feature space analysis, and their impact on diagnostic performance across five downstream classifiers. Experimental results demonstrate that generative architecture choice has a stronger influence on both image fidelity and diagnostic utility than preprocessing complexity. StyleGAN2-ADA consistently produced synthetic images more closely aligned with real data distributions, achieving the lowest FID ($\approx 65.5$) and KID ($\approx 0.05$), while diffusion models generated higher variance samples at the cost of reduced perceptual fidelity and class anchoring. Advanced artifact removal yielded only marginal improvements in generative metrics and provided limited downstream diagnostic gains, suggesting possible suppression of clinically relevant texture cues. In contrast, synthetic data augmentation substantially improved melanoma detection with $8$-$15$% absolute gains in melanoma F1-score, and ViT-B/16 achieving F1 $\approx 0.88$ and ROC-AUC $\approx 0.98$, representing an improvement of approximately $14%$ over non-augmented baselines. Our code can be found at https://github.com/adarsh-crafts/SkinGenBench
[651] FluoCLIP: Stain-Aware Focus Quality Assessment in Fluorescence Microscopy
Hyejin Park, Jiwon Yoon, Sumin Park, Suree Kim, Sinae Jang, Eunsoo Lee, Dongmin Kang, Dongbo Min
Main category: eess.IV
TL;DR: FluoCLIP: A vision-language framework for stain-aware focus quality assessment in fluorescence microscopy that models stain-dependent focus behavior using stain-conditioned ordinal reasoning.
Details
Motivation: Existing focus quality assessment methods treat focus as stain-agnostic, assuming shared global ordering, but fluorescence microscopy exhibits heterogeneous focus behavior across different stains due to stain-dependent optical variations.Method: Two-stage vision-language framework (FluoCLIP) that grounds stain semantics and enables stain-conditioned ordinal reasoning for focus prediction, decoupling stain representation from ordinal structure. Uses FluoMix dataset spanning multiple tissues, stains, and focus levels.
Result: FluoCLIP consistently outperforms conventional FQA methods and recent vision-language baselines, demonstrating strong generalization across diverse fluorescence microscopy conditions.
Conclusion: Stain-aware formulation of focus quality assessment is necessary for fluorescence microscopy, and the proposed FluoCLIP framework effectively models stain-dependent focus behavior through vision-language grounding and stain-conditioned reasoning.
Abstract: Accurate focus quality assessment (FQA) in fluorescence microscopy is challenging due to stain-dependent optical variations that induce heterogeneous focus behavior across images. Existing methods, however, treat focus quality as a stain-agnostic problem, assuming a shared global ordering. We formulate stain-aware FQA for fluorescence microscopy, showing that focus-rank relationships vary substantially across stains due to stain-dependent imaging characteristics and invalidate this assumption. To support this formulation, we introduce FluoMix, the first dataset for stain-aware FQA spanning multiple tissues, fluorescent stains, and focus levels. We further propose FluoCLIP, a two-stage vision-language framework that grounds stain semantics and enables stain-conditioned ordinal reasoning for focus prediction, effectively decoupling stain representation from ordinal structure. By explicitly modeling stain-dependent focus behavior, FluoCLIP consistently outperforms both conventional FQA methods and recent vision-language baselines, demonstrating strong generalization across diverse fluorescence microscopy conditions. Code and dataset are publicly available at https://fluoclip.github.io/.