Editor’s Picks

Top papers matching your research interests in multimodal LLMs, audio and vision understanding/generation.

[1] DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis

Bin Lin, Peng Yang, Chao Yan, Xiaochen Liu, Wei Wang, Boyong Wu, Pengfei Tan, Xuerui Yang

Main category: cs.SD

TL;DR: DSFlow: A modular distillation framework for efficient few-step and one-step text-to-speech synthesis that addresses process variance and parameter inefficiency in flow-matching models.

Details

Motivation: Flow-matching models produce high-quality TTS but have high computational costs due to iterative sampling. Existing distillation methods suffer from process variance from endpoint error accumulation and inefficient parameter usage when adapting continuous-time architectures to discrete, fixed-step generation.

Method: DSFlow reformulates generation as discrete prediction, uses dual supervision (endpoint matching + deterministic mean-velocity alignment) for training stability, and replaces continuous-time timestep conditioning with lightweight step-aware tokens for parameter efficiency.

Result: Extensive experiments show DSFlow outperforms standard distillation approaches, achieving strong few-step and one-step synthesis quality while reducing model parameters and inference cost across diverse flow-based TTS architectures.

Conclusion: DSFlow provides an effective modular distillation framework that addresses key challenges in efficient TTS synthesis, enabling high-quality generation with reduced computational requirements.

Abstract: Flow-matching models have enabled high-quality text-to-speech synthesis, but their iterative sampling process during inference incurs substantial computational cost. Although distillation is widely used to reduce the number of inference steps, existing methods often suffer from process variance due to endpoint error accumulation. Moreover, directly reusing continuous-time architectures for discrete, fixed-step generation introduces structural parameter inefficiencies. To address these challenges, we introduce DSFlow, a modular distillation framework for few-step and one-step synthesis. DSFlow reformulates generation as a discrete prediction task and explicitly adapts the student model to the target inference regime. It improves training stability through a dual supervision strategy that combines endpoint matching with deterministic mean-velocity alignment, enforcing consistent generation trajectories across inference steps. In addition, DSFlow improves parameter efficiency by replacing continuous-time timestep conditioning with lightweight step-aware tokens, aligning model capacity with the significantly reduced timestep space of the discrete task. Extensive experiments across diverse flow-based text-to-speech architectures demonstrate that DSFlow consistently outperforms standard distillation approaches, achieving strong few-step and one-step synthesis quality while reducing model parameters and inference cost.

Relevance: 9/10

[2] NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu

Main category: cs.SD

TL;DR: NarraScore: A hierarchical framework for synthesizing coherent soundtracks for long-form videos using emotion as narrative logic compression, leveraging frozen VLMs as affective sensors and dual-branch injection for global-local balance.

Details

Motivation: Current long-form video soundtrack synthesis faces three critical challenges: computational scalability, temporal coherence, and semantic blindness to evolving narrative logic. Existing methods struggle to maintain narrative alignment throughout extended video sequences.

Method: Proposes NarraScore framework that uses emotion as high-density compression of narrative logic. Repurposes frozen Vision-Language Models (VLMs) as continuous affective sensors to distill visual streams into Valence-Arousal trajectories. Employs Dual-Branch Injection: Global Semantic Anchor for stylistic stability and Token-Level Affective Adapter for local tension modulation via element-wise residual injection.

Result: Achieves state-of-the-art consistency and narrative alignment with negligible computational overhead. Demonstrates effective mitigation of overfitting risks associated with data scarcity. Establishes fully autonomous paradigm for long-video soundtrack generation.

Conclusion: NarraScore successfully bridges critical gaps in long-form video soundtrack synthesis by leveraging emotion as narrative logic proxy, using frozen VLMs as affective sensors, and implementing efficient dual-branch injection strategy for scalable, coherent audio generation.

Abstract: Synthesizing coherent soundtracks for long-form videos remains a formidable challenge, currently stalled by three critical impediments: computational scalability, temporal coherence, and, most critically, a pervasive semantic blindness to evolving narrative logic. To bridge these gaps, we propose NarraScore, a hierarchical framework predicated on the core insight that emotion serves as a high-density compression of narrative logic. Uniquely, we repurpose frozen Vision-Language Models (VLMs) as continuous affective sensors, distilling high-dimensional visual streams into dense, narrative-aware Valence-Arousal trajectories. Mechanistically, NarraScore employs a Dual-Branch Injection strategy to reconcile global structure with local dynamism: a \textit{Global Semantic Anchor} ensures stylistic stability, while a surgical \textit{Token-Level Affective Adapter} modulates local tension via direct element-wise residual injection. This minimalist design bypasses the bottlenecks of dense attention and architectural cloning, effectively mitigating the overfitting risks associated with data scarcity. Experiments demonstrate that NarraScore achieves state-of-the-art consistency and narrative alignment with negligible computational overhead, establishing a fully autonomous paradigm for long-video soundtrack generation.

Relevance: 9/10

[3] Covo-Audio Technical Report

Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, Zikang Huang, Jiajun Xu, Jiliang Hu, Xiang He, Zeyu Xie, Jiawen Kang, Youjun Chen, Meng Yu, Dong Yu, Rilin Chen, Linlin Di, Shulin Feng, Na Hu, Yang Liu, Bang Wang, Shan Yang

Main category: cs.SD

TL;DR: Covo-Audio is a 7B-parameter end-to-end LALM that processes continuous audio inputs and generates audio outputs in a unified architecture, achieving SOTA performance across speech-text modeling, spoken dialogue, audio understanding, and full-duplex voice interaction tasks.

Details

Motivation: To develop a unified audio-language model that can directly process and generate audio in end-to-end fashion, addressing the need for models that integrate sophisticated audio intelligence with high-level semantic reasoning at a practical scale.

Method: Uses large-scale curated pretraining and targeted post-training on a 7B-parameter architecture; introduces intelligence-speaker decoupling strategy to separate dialogue intelligence from voice rendering for flexible voice customization with minimal TTS data.

Result: Achieves state-of-the-art or competitive performance across multiple benchmarks for speech-text comprehension, semantic reasoning, spoken dialogue, and audio understanding; demonstrates strong conversational abilities and full-duplex interaction capabilities.

Conclusion: 7B-scale models show strong potential for integrating audio intelligence with semantic reasoning, suggesting a scalable path toward more capable and versatile LALMs for real-world conversational applications.

Abstract: In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.

Relevance: 9/10

Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 105]
cs.CV [Total: 164]
cs.AI [Total: 55]
cs.SD [Total: 9]
cs.LG [Total: 195]
cs.MA [Total: 5]
cs.MM [Total: 1]
eess.AS [Total: 10]
eess.IV [Total: 10]

cs.CL

[1] Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection

Janek Bevendorff, Maik Fröbe, André Greiner-Petter, Andreas Jakoby, Maximilian Mayerl, Preslav Nakov, Henry Plutz, Martin Potthast, Benno Stein, Minh Ngoc Ta, Yuxia Wang, Eva Zangerle

Main category: cs.CL

TL;DR: PAN 2026 workshop focuses on computational stylometry and text forensics with five tasks: AI detection, text watermarking, multi-author analysis, plagiarism detection, and reasoning trajectory detection.

Details

Motivation: The workshop aims to advance computational stylometry and text forensics through objective, reproducible evaluation of various text analysis tasks, particularly focusing on challenges posed by generative AI and authorship analysis.

Method: PAN runs five specific tasks with software submissions as Docker containers via the TIRA experimentation platform, enabling reproducible evaluation of computational methods for text analysis and forensics.

Result: The workshop has received over 1,100 submissions since 2012 through this reproducible evaluation framework, demonstrating sustained community engagement in text forensics research.

Conclusion: PAN continues to provide a standardized evaluation platform for computational stylometry and text forensics, with increasing focus on challenges related to generative AI and authorship analysis.

Abstract: The goal of the PAN workshop is to advance computational stylometry and text forensics via objective and reproducible evaluation. In 2026, we run the following five tasks: (1) Voight-Kampff Generative AI Detection, particularly in mixed and obfuscated authorship scenarios, (2) Text Watermarking, a new task that aims to find new and benchmark the robustness of existing text watermarking schemes, (3) Multi-author Writing Style Analysis, a continued task that aims to find positions of authorship change, (4) Generative Plagiarism Detection, a continued task that targets source retrieval and text alignment between generated text and source documents, and (5) Reasoning Trajectory Detection, a new task that deals with source detection and safety detection of LLM-generated or human-written reasoning trajectories. As in previous years, PAN invites software submissions as easy-to-reproduce Docker containers for most of the tasks. Since PAN 2012, more than 1,100 submissions have been made this way via the TIRA experimentation platform.

[2] Measuring Inclusion in Interaction: Inclusion Analytics for Human-AI Collaborative Learning

Jaeyoon Choi, Nia Nixon

Main category: cs.CL

TL;DR: A framework called “inclusion analytics” for measuring inclusion as a dynamic, interactional process in collaborative problem solving using discourse analysis across three dimensions: participation equity, affective climate, and epistemic equity.

Details

Motivation: Current methods for assessing inclusion in AI and education are limited to coarse sample descriptors or post-hoc self-reports that fail to capture how inclusion is shaped moment by moment in collaborative problem solving.

Method: Introduces inclusion analytics framework with three dimensions: participation equity (who participates), affective climate (relational dynamics), and epistemic equity (idea uptake). Uses scalable, interaction-level measures applied to both simulated conversations and empirical data from human-AI teaming experiments.

Result: The framework can surface patterns of participation, relational dynamics, and idea uptake that remain invisible to aggregate or post-hoc evaluations, demonstrating how inclusion can be made analytically visible at the interaction level.

Conclusion: This work represents an initial step toward process-oriented approaches to measuring inclusion in human-AI collaborative learning environments, moving beyond static assessments to dynamic, interactional analysis.

Abstract: Inclusion, equity, and access are widely valued in AI and education, yet are often assessed through coarse sample descriptors or post-hoc self-reports that miss how inclusion is shaped moment by moment in collaborative problem solving (CPS). In this proof-of-concept paper, we introduce inclusion analytics, a discourse-based framework for examining inclusion as a dynamic, interactional process in CPS. We conceptualize inclusion along three complementary dimensions – participation equity, affective climate, and epistemic equity – and demonstrate how these constructs can be made analytically visible using scalable, interaction-level measures. Using both simulated conversations and empirical data from human-AI teaming experiments, we illustrate how inclusion analytics can surface patterns of participation, relational dynamics, and idea uptake that remain invisible to aggregate or post-hoc evaluations. This work represents an initial step toward process-oriented approaches to measuring inclusion in human-AI collaborative learning environments.

[3] Effective Reasoning Chains Reduce Intrinsic Dimensionality

Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw

Main category: cs.CL

TL;DR: Chain-of-thought reasoning effectiveness can be measured by intrinsic dimensionality - effective reasoning strategies reduce the minimum dimensions needed to achieve target accuracy, indicating better task compression.

Details

Motivation: While CoT reasoning improves language model performance on complex tasks, the mechanisms behind why different strategies facilitate generalization remain poorly understood. Current explanations about test-time computation or structural guidance lack consistent, quantifiable links to generalization.

Method: Use intrinsic dimensionality as a quantitative measure for reasoning chain effectiveness. Keep model architecture fixed while varying task formulation through different reasoning strategies. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a task.

Result: On GSM8K with Gemma-3 1B and 4B models, effective reasoning strategies consistently reduce intrinsic dimensionality. Strong inverse correlation observed between intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data.

Conclusion: Effective reasoning chains facilitate learning by better compressing tasks using fewer parameters. Intrinsic dimensionality offers a new quantitative metric for analyzing reasoning processes, providing a measurable link between reasoning strategies and generalization performance.

Abstract: Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.

[4] Don’t Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention

Shu-Ting Pi, Pradeep Bagavan, Yejia Li, Disha, Qun Liu

Main category: cs.CL

TL;DR: A topic continuity model for LLM chatbots that assesses whether responses align with initial conversation topics using Naive Bayes with attention mechanisms and logarithmic nonlinearity.

Details

Motivation: Maintaining topic continuity in LLM chatbots is crucial for good user experience and efficient computational resource utilization, as abrupt topic shifts can degrade both.

Method: Expands NLU models into quantifiable terms using Naive Bayes approach, then introduces attention mechanism and logarithmic nonlinearity to enhance topic continuity detection, creating an interpretable analytical formula.

Result: Model outperforms traditional methods, especially for lengthy and complex conversations, with linear time complexity and no token limits.

Conclusion: The approach enables responsible and interpretable use of LLMs by ensuring topic continuity in chatbot conversations of any length.

Abstract: Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model’s ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.

[5] Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization

Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu

Main category: cs.CL

TL;DR: Counterfactual importance weighting improves policy gradient methods for language model reasoning by assigning higher weight to critical calculation tokens rather than uniform credit to all tokens.

Details

Motivation: Current policy gradient methods for language model reasoning (like GRPO and DAPO) assign uniform credit to all generated tokens, treating filler phrases like "Let me think" the same as critical calculations like "23 + 45 = 68." This is inefficient and doesn't capture the causal importance of different reasoning steps.

Method: Proposes counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. The method requires no auxiliary models or external annotation - importance is estimated directly from the policy model’s own probability shifts.

Result: Experiments on GSM8K across three models (Qwen and Llama families) show consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming genuine causal structure capture. Analysis shows the method correctly prioritizes calculation steps over scaffolding text.

Conclusion: The findings establish counterfactual importance weighting as a foundation for further research in language model reasoning, though not a complete solution. The method effectively identifies and weights critical reasoning steps without external resources.

Abstract: Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase “Let me think” receives the same gradient update as the critical calculation “23 + 45 = 68.” We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model’s own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.

[6] FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

Siyuan Huang, Ziyu Wang, Chao Pan, Han Zhao

Main category: cs.CL

TL;DR: FM SO.P is a framework for improving language models’ understanding of Standard Operating Procedures (SOPs) through progressive task mixtures and multi-agent evaluation, achieving strong performance with smaller models.

Details

Motivation: Existing language models struggle with SOP understanding and cross-domain generalization because they fail to differentiate between the specific reasoning capabilities SOPs require: terminology precision, sequential ordering, and constraint reasoning.

Method: Two key innovations: 1) Progressive task mixtures that build capabilities in stages across three task types: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. 2) An automatic multi-agent evaluation system with three agents that adaptively generate rubrics, stratified test sets, and rubric scoring tailored to different domains.

Result: Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3% pass rate with a 32B model and 34.3% with a 7B model, matching the Qwen-2.5-72B-Instruct baseline (34.4%) with 10x fewer parameters.

Conclusion: The proposed approach effectively addresses SOP understanding challenges through structured progressive training and domain-adaptive evaluation, enabling smaller models to achieve performance comparable to much larger baselines.

Abstract: Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3% pass rate with our 32B model and 34.3% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4%) with 10x fewer parameters.

[7] Understanding Risk and Dependency in AI Chatbot Use from User Discourse

Jianfeng Zhu, Karin G. Coifman, Ruoming Jin

Main category: cs.CL

TL;DR: Computational thematic analysis of Reddit posts from AI harm communities reveals 14 themes and 5 experiential dimensions of AI-related psychological risk, with self-regulation difficulties being most prevalent and fear concentrated in autonomy/control concerns.

Details

Motivation: Despite increasing integration of generative AI in daily life, there's limited empirical understanding of how psychological risks emerge, are experienced, and regulated by users. The study aims to provide evidence from real-world user discourse rather than laboratory or speculative contexts.

Method: Large-scale computational thematic analysis of posts (2023-2025) from two Reddit communities (r/AIDangers and r/ChatbotAddiction) using multi-agent, LLM-assisted thematic analysis grounded in Braun and Clarke’s reflexive framework. Applied emotion labeling using BERT-based classifier and visualized emotional profiles across dimensions.

Result: Identified 14 recurring thematic categories synthesized into 5 higher-order experiential dimensions of AI-related psychological risk. Self-regulation difficulties emerged as most prevalent, with fear concentrated in concerns related to autonomy, control, and technical risk.

Conclusion: Provides early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced, offering foundation for future AI safety research, evaluation, and responsible governance.

Abstract: Generative AI systems are increasingly embedded in everyday life, yet empirical understanding of how psychological risk associated with AI use emerges, is experienced, and is regulated by users remains limited. We present a large-scale computational thematic analysis of posts collected between 2023 and 2025 from two Reddit communities, r/AIDangers and r/ChatbotAddiction, explicitly focused on AI-related harm and distress. Using a multi-agent, LLM-assisted thematic analysis grounded in Braun and Clarke’s reflexive framework, we identify 14 recurring thematic categories and synthesize them into five higher-order experiential dimensions. To further characterize affective patterns, we apply emotion labeling using a BERT-based classifier and visualize emotional profiles across dimensions. Our findings reveal five empirically derived experiential dimensions of AI-related psychological risk grounded in real-world user discourse, with self-regulation difficulties emerging as the most prevalent and fear concentrated in concerns related to autonomy, control, and technical risk. These results provide early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced outside laboratory or speculative contexts, offering a foundation for future AI safety research, evaluation, and responsible governance.

[8] Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs

Yoshifumi Kawasaki

Main category: cs.CL

TL;DR: LLMs show systematic biases in capturing Spanish geographic lexical variation, with better recognition of certain varieties (Spain, Equatorial Guinea, Mexico/Central America, La Plata River) and poor performance on Chilean Spanish, unrelated to digital resource availability.

Details

Motivation: To understand how well LLMs capture geographic lexical variation in Spanish, a language with substantial regional differences, and to investigate potential digital linguistic biases in these models.

Method: Treating LLMs as virtual informants, using two survey-style formats (Yes-No and multiple-choice questions) based on expert-curated database of Spanish lexical variation covering 900+ lexical items across 21 Spanish-speaking countries at country and dialect area levels.

Result: LLMs show systematic differences in representing Spanish varieties: better accuracy for Spain, Equatorial Guinea, Mexico & Central America, and La Plata River varieties; particularly poor performance on Chilean Spanish. Performance patterns not explained by country-level digital resource volume.

Conclusion: LLMs exhibit systematic biases in dialectal knowledge that go beyond data quantity, revealing digital linguistic bias in Spanish that requires attention in model development and evaluation.

Abstract: This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.

[9] Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only

Jianyu Zheng

Main category: cs.CL

TL;DR: Unsupervised cross-lingual POS tagging framework using UNMT to create pseudo-parallel data from monolingual corpora, with multi-source projection for improved accuracy.

Details

Motivation: Low-resource languages lack POS-annotated data and parallel corpora needed for traditional cross-lingual POS tagging methods, requiring a fully unsupervised approach.

Method: Uses unsupervised neural machine translation (UNMT) to translate high-resource language sentences to low-resource language, creating pseudo-parallel pairs. Then applies standard POS tag projection via word alignment, enhanced with multi-source projection technique.

Result: Achieves performance comparable to baseline cross-lingual POS tagger using parallel corpora, with multi-source projection providing 1.3% average improvement across 28 language pairs.

Conclusion: Proposed framework enables effective POS tagging for low-resource languages without parallel corpora, with multi-source projection further enhancing performance.

Abstract: Due to the scarcity of part-of-speech annotated data, existing studies on low-resource languages typically adopt unsupervised approaches for POS tagging. Among these, POS tag projection with word alignment method transfers POS tags from a high-resource source language to a low-resource target language based on parallel corpora, making it particularly suitable for low-resource language settings. However, this approach relies heavily on parallel corpora, which are often unavailable for many low-resource languages. To overcome this limitation, we propose a fully unsupervised cross-lingual part-of-speech(POS) tagging framework that relies solely on monolingual corpora by leveraging unsupervised neural machine translation(UNMT) system. This UNMT system first translates sentences from a high-resource language into a low-resource one, thereby constructing pseudo-parallel sentence pairs. Then, we train a POS tagger for the target language following the standard projection procedure based on word alignments. Moreover, we propose a multi-source projection technique to calibrate the projected POS tags on the target side, enhancing to train a more effective POS tagger. We evaluate our framework on 28 language pairs, covering four source languages (English, German, Spanish and French) and seven target languages (Afrikaans, Basque, Finnis, Indonesian, Lithuanian, Portuguese and Turkish). Experimental results show that our method can achieve performance comparable to the baseline cross-lingual POS tagger with parallel sentence pairs, and even exceeds it for certain target languages. Furthermore, our proposed multi-source projection technique further boosts performance, yielding an average improvement of 1.3% over previous methods.

[10] AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis

Zexu Sun, Bokai Ji, Hengyi Cai, Shuaiqiang Wang, Lei Wang, Guangxia Li, Xu Chen

Main category: cs.CL

TL;DR: AgentSkiller: Automated framework for synthesizing multi-turn interaction data across realistic domains to train LLM agents, improving function calling capabilities.

Details

Motivation: Current methods for collecting training data for LLM agents are limited by privacy constraints (API logs) or lack diversity (scripted interactions), creating a bottleneck for scaling generalist intelligence capabilities.

Method: Uses DAG-based architecture with explicit state transitions for determinism. Builds domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for MCP servers, populates environments with consistent databases and Domain Policies. Cross-domain fusion links services for complex tasks. Creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using Persona-based Simulator.

Result: Synthesized ≈11K interaction samples; models trained on this dataset achieve significant improvements on function calling over baselines, especially in larger parameter regimes.

Conclusion: AgentSkiller provides an automated framework for generating high-quality, diverse interaction data that effectively improves LLM agent capabilities in function calling tasks.

Abstract: Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling capabilities. We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains. It employs a DAG-based architecture with explicit state transitions to ensure determinism and recoverability. The pipeline builds a domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for Model Context Protocol servers, and populates environments with consistent databases and strict Domain Policies. A cross-domain fusion mechanism links services to simulate complex tasks. Finally, the pipeline creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using a Persona-based Simulator for automated rollout. This produces reliable environments with clear state changes. To demonstrate effectiveness, we synthesized $\approx$ 11K interaction samples; experimental results indicate that models trained on this dataset achieve significant improvements on function calling over baselines, particularly in larger parameter regimes.

[11] AfriNLLB: Efficient Translation Models for African Languages

Yasmin Moslem, Aman Kassahun Wassie, Amanuel Gizachew Abebe

Main category: cs.CL

TL;DR: AfriNLLB: Lightweight multilingual translation models for African languages using compression techniques (pruning & quantization) and knowledge distillation for efficient deployment in resource-constrained settings.

Details

Motivation: To enable efficient deployment of translation models for African languages in resource-constrained settings by creating lightweight models that maintain performance while being significantly faster.

Method: Based on NLLB-200 600M model, compressed using iterative layer pruning and quantization, fine-tuned on curated parallel corpora for African languages with knowledge distillation from a larger teacher model.

Result: AfriNLLB models achieve performance comparable to baseline while being significantly faster, supporting 15 language pairs (30 translation directions) including major African languages and official African Union languages.

Conclusion: Successfully developed efficient translation models for African languages that balance performance and speed, with released models (Transformers and CTranslate2 versions) and training data to facilitate further research.

Abstract: In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research.

[12] MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

Xabier de Zuazo, Ibon Saratxaga, Eva Navas

Main category: cs.CL

TL;DR: Compact Conformer-based decoders for MEG-based speech detection and phoneme classification achieve state-of-the-art results on LibriBrain 2025 benchmark

Details

Motivation: To develop scalable brain-computer interfaces by decoding speech-related information from non-invasive MEG signals, addressing the need for efficient speech detection and phoneme classification from neural data

Method: Adapts compact Conformer architecture to raw 306-channel MEG signals with lightweight convolutional projection layer and task-specific heads. Uses MEG-oriented SpecAugment for speech detection, inverse-square-root class weighting and dynamic grouping loader for phoneme classification, and instance-level normalization to mitigate distribution shifts

Result: Achieved 88.9% accuracy for Speech Detection and 65.8% accuracy for Phoneme Classification on the LibriBrain 2025 benchmark, winning the Phoneme Classification Standard track

Conclusion: The compact Conformer-based approach effectively decodes speech information from MEG signals, demonstrating potential for brain-computer interfaces with state-of-the-art performance on benchmark tasks

Abstract: Decoding speech-related information from non-invasive MEG is a key step toward scalable brain-computer interfaces. We present compact Conformer-based decoders on the LibriBrain 2025 PNPL benchmark for two core tasks: Speech Detection and Phoneme Classification. Our approach adapts a compact Conformer to raw 306-channel MEG signals, with a lightweight convolutional projection layer and task-specific heads. For Speech Detection, a MEG-oriented SpecAugment provided a first exploration of MEG-specific augmentation. For Phoneme Classification, we used inverse-square-root class weighting and a dynamic grouping loader to handle 100-sample averaged examples. In addition, a simple instance-level normalization proved critical to mitigate distribution shifts on the holdout split. Using the official Standard track splits and F1-macro for model selection, our best systems achieved 88.9% (Speech) and 65.8% (Phoneme) on the leaderboard, winning the Phoneme Classification Standard track. For further implementation details, the technical documentation, source code, and checkpoints are available at https://github.com/neural2speech/libribrain-experiments.

[13] BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation

Peng Lai, Zhihao Ou, Yong Wang, Longyue Wang, Jian Yang, Yun Chen, Guanhua Chen

Main category: cs.CL

TL;DR: BiasScope is an LLM-driven framework that automatically discovers potential biases in LLM-as-a-Judge evaluations, transforming bias discovery from manual to automated exploration and revealing significant robustness issues.

Details

Motivation: Current LLM-as-a-Judge evaluations suffer from bias issues, but existing approaches only address known biases through manual effort and predefined lists. There's a lack of automated, systematic exploration for unknown biases, which is crucial for improving evaluation robustness and reliability.

Method: Proposes BiasScope, an LLM-driven framework for automatically discovering potential biases at scale. It can uncover biases across different model families and scales, validated on JudgeBench dataset. Also introduces JudgeBench-Pro, an extended, more challenging benchmark for evaluating LLM-as-a-judge robustness.

Result: BiasScope successfully discovers potential biases that existing approaches miss. On JudgeBench-Pro, even powerful LLMs as evaluators show error rates above 50%, highlighting severe robustness issues in current LLM-as-a-Judge evaluations.

Conclusion: BiasScope enables active, comprehensive automated bias discovery, overcoming limitations of manual approaches. The high error rates on JudgeBench-Pro underscore the urgent need to strengthen evaluation robustness and mitigate biases in LLM-as-a-Judge systems.

Abstract: LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.

[14] Contractual Deepfakes: Can Large Language Models Generate Contracts?

Eliza Mik

Main category: cs.CL

TL;DR: LLMs cannot understand meaning, context, or reason, making them unsuitable for legal contract drafting despite superficial plausibility

Details

Motivation: To challenge the misconception that LLMs can effectively assist with legal contract drafting, arguing that statistical word prediction differs fundamentally from legal reasoning and contextual understanding

Method: Conceptual analysis comparing LLM capabilities (statistical word prediction) with requirements for legal contract drafting (context understanding, reasoning, legal knowledge)

Result: LLMs generate generic, superficially plausible contracts that may be useless assemblages of inconsistent provisions or enforceable but unsuitable for specific transactions

Conclusion: LLMs do not threaten the legal industry’s viability; they lack the understanding, context awareness, and reasoning abilities needed for meaningful legal work

Abstract: Notwithstanding their unprecedented ability to generate text, LLMs do not understand the meaning of words, have no sense of context and cannot reason. Their output constitutes an approximation of statistically dominant word patterns. And yet, the drafting of contracts is often presented as a typical legal task that could be facilitated by this technology. This paper seeks to put an end to such unreasonable ideas. Predicting words differs from using language in the circumstances of specific transactions and reconstituting common contractual phrases differs from reasoning about the law. LLMs seem to be able to generate generic and superficially plausible contractual documents. In the cold light of day, such documents may turn out to be useless assemblages of inconsistent provisions or contracts that are enforceable but unsuitable for a given transaction. This paper casts a shadow on the simplistic assumption that LLMs threaten the continued viability of the legal industry.

[15] Effective vocabulary expanding of multilingual language models for extremely low-resource languages

Jianyu Zheng

Main category: cs.CL

TL;DR: Extending multilingual pre-trained language models to unsupported low-resource languages through vocabulary expansion with bilingual dictionary initialization

Details

Motivation: Existing mPLMs support many languages but not all low-resource ones; need methods to extend them to previously unsupported languages without degrading source language performance

Method: Expand vocabulary using target language corpus, screen out source-language-biased subset, initialize new vocabulary via bilingual dictionaries, then continue pre-training on target language

Result: Outperforms random initialization baseline by 0.54% on POS tagging and 2.60% on NER; robust to training corpus selection; source language performance preserved

Conclusion: Effective method for extending mPLMs to unsupported low-resource languages using bilingual dictionary initialization maintains source language capabilities

Abstract: Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model’s vocabulary using a target language corpus. We then screen out a subset from the model’s original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models’ performance on the source language does not degrade after continued pre-training.

[16] Are Language Models Sensitive to Morally Irrelevant Distractors?

Andrew Shaw, Christina Hahn, Catherine Rasgaitis, Yash Mishra, Alisa Liu, Natasha Jaques, Yulia Tsvetkov, Amy X. Zhang

Main category: cs.CL

TL;DR: LLMs show human-like moral judgment instability when exposed to morally irrelevant situational factors (distractors), shifting responses by over 30% even in clear scenarios.

Details

Motivation: To investigate whether LLMs exhibit similar cognitive moral biases as humans, inspired by moral psychology research showing human moral judgments are influenced by morally irrelevant situational factors.

Method: Created a multimodal dataset of 60 “moral distractors” from existing psychological datasets of emotionally-valenced images and narratives. Injected these distractors into existing moral benchmarks to measure their effects on LLM responses.

Result: Moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, demonstrating significant sensitivity to morally irrelevant contextual factors.

Conclusion: LLMs exhibit human-like moral judgment instability, highlighting the need for more contextual moral evaluations and nuanced cognitive moral modeling of LLMs beyond current benchmarks.

Abstract: With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this “situationist” view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 “moral distractors” from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.

[17] Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency

Taewoong Yoon, Geunyeong Jeong, Geon Park, Sihyeong Yeom, Harksoo Kim

Main category: cs.CL

TL;DR: ACTSC reduces inference costs in Self-Consistency decoding by using neural activations to estimate problem difficulty and dynamically adjust sampling, eliminating need for pre-sampling or extra model calls.

Details

Motivation: Self-Consistency improves LLM reasoning but has high inference costs from large sampling. Difficulty-Adaptive SC reduces costs but requires pre-sampling and repeated difficulty estimation for each dataset, creating computational overhead.

Method: ACTSC uses feed-forward network neuron activations as internal difficulty signals to train a lightweight difficulty estimation probe. This probe dynamically adjusts SC sampling without additional token generation or model calls, and generalizes to new datasets.

Result: Experiments on five benchmarks show ACTSC effectively reduces inference costs while maintaining accuracy compared to existing methods, demonstrating efficient difficulty-aware sampling.

Conclusion: ACTSC provides a computationally efficient approach to difficulty-aware Self-Consistency that eliminates pre-sampling overhead while maintaining reasoning performance.

Abstract: Self-Consistency (SC) is an effective decoding strategy that improves the reasoning performance of Large Language Models (LLMs) by generating multiple chain-of-thought reasoning paths and selecting the final answer via majority voting. However, it suffers from substantial inference costs because it requires a large number of samples. To mitigate this issue, Difficulty-Adaptive Self-Consistency (DSC) was proposed to reduce unnecessary token usage for easy problems by adjusting the number of samples according to problem difficulty. However, DSC requires additional model calls and pre-sampling to estimate difficulty, and this process is repeated when applying to each dataset, leading to significant computational overhead. In this work, we propose Activation-Informed Difficulty-Aware Self-Consistency (ACTSC) to address these limitations. ACTSC leverages internal difficulty signals reflected in the feed-forward network neuron activations to construct a lightweight difficulty estimation probe, without any additional token generation or model calls. The probe dynamically adjusts the number of samples for SC and can be applied to new datasets without requiring pre-sampling for difficulty estimation. To validate its effectiveness, we conduct experiments on five benchmarks. Experimental results show that ACTSC effectively reduces inference costs while maintaining accuracy relative to existing methods.

Shweta Parihar, Lu Cheng

Main category: cs.CL

TL;DR: RAG reduces social bias in LLMs by incorporating external context, while Chain-of-Thought increases bias despite improving accuracy.

Details

Motivation: To evaluate and understand social bias implications in Retrieval-Augmented Generation (RAG) architectures, which remain susceptible to bias despite using external knowledge sources.

Method: Extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets covering 13+ bias types, plus integration of Chain-of-Thought prompting to analyze reasoning processes.

Result: RAG reduces bias by countering stereotype-driven predictions with external context, while CoT increases bias despite improving accuracy, revealing a trade-off between accuracy and fairness.

Conclusion: External context in RAG can improve fairness, but bias-aware reasoning frameworks are needed to mitigate the bias-accuracy trade-off revealed by CoT prompting.

Abstract: Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the social bias implications of RAG. Through extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets, encompassing more than 13 different bias types, we surprisingly observe a reduction in bias in RAG. This suggests that the inclusion of external context can help counteract stereotype-driven predictions, potentially improving fairness by diversifying the contextual grounding of the model’s outputs. To better understand this phenomenon, we then explore the model’s reasoning process by integrating Chain-of-Thought (CoT) prompting into RAG while assessing the faithfulness of the model’s CoT. Our experiments reveal that the model’s bias inclinations shift between stereotype and anti-stereotype responses as more contextual information is incorporated from the retrieved documents. Interestingly, we find that while CoT enhances accuracy, contrary to the bias reduction observed with RAG, it increases overall bias across datasets, highlighting the need for bias-aware reasoning frameworks that can mitigate this trade-off.

[19] Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality

Takumi Ohashi, Hitoshi Iyatomi

Main category: cs.CL

TL;DR: CCI is a method to measure cultural specificity at sentence level by comparing generality estimates within target culture vs. other cultures, validated on 400 sentences with better performance than direct LLM scoring.

Details

Motivation: LLMs are increasingly used in multicultural settings but lack systematic evaluation of cultural specificity at sentence level. Current methods don't provide interpretable, operational measures of cultural specificity.

Method: Propose Conceptual Cultural Index (CCI) defined as difference between generality estimate within target culture and average generality estimate across other cultures. Users control cultural scope via comparison settings. Validated on 400 sentences (200 culture-specific, 200 general).

Result: CCI score distribution shows anticipated pattern: higher for culture-specific sentences, lower for general ones. For binary separability, CCI outperforms direct LLM scoring with >10-point AUC improvement for culture-specialized models.

Conclusion: CCI provides interpretable, operational measure of cultural specificity at sentence level, enabling better evaluation of LLMs in multicultural contexts and outperforming existing methods.

Abstract: Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI .

[20] NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts

Huu-Huy-Hoang Tran, Gia-Bao Duong, Quoc-Viet-Anh Tran, Thi-Hai-Yen Vuong, Hoang-Quynh Le

Main category: cs.CL

TL;DR: A multi-output ensemble system for detecting toxic substance use in Spanish clinical texts using BETO with CRF layer and sentence filtering, achieving high F1 scores for trigger and argument detection.

Details

Motivation: Extracting drug use information from unstructured EHRs is challenging, and while LLMs show promise, their clinical use is limited by trust, control, and efficiency concerns, especially in low-resource Spanish clinical settings.

Method: Multi-output ensemble system integrating BETO with CRF layer for sequence labeling, employing diverse training strategies and sentence filtering to boost precision for both ToxNER and ToxUse subtasks.

Result: Top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection in the ToxHabits Shared Task.

Conclusion: The proposed system effectively addresses toxic substance use detection in Spanish clinical texts, demonstrating strong performance in a low-resource, domain-specific setting.

Abstract: Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.

[21] Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement

Koduvayur Subbalakshmi, Sabbir Hossain Ujjal, Venkata Krishna Teja Mangichetty, Nastaran Jamalipour Soofi

Main category: cs.CL

TL;DR: CoCoA is a training-free decoding algorithm that reduces LLM hallucinations by analyzing representational instability in middle layers and penalizing internally inconsistent outputs.

Details

Motivation: LLMs often generate fluent but factually incorrect text (hallucinations), which undermines their reliability. The authors hypothesize that factual correctness correlates with representational stability across model layers.

Method: Proposes CoCoA decoder that monitors confusion and consistency signals in middle layers during inference. Uses two metrics to quantify representational instability and penalizes outputs with high internal confusion. Also introduces CoCoA-SIG variant with self-information gating to dynamically modulate penalty for high-surprise generations.

Result: Extensive experiments on question-answering, summarization, and code generation tasks show CoCoA significantly improves factual correctness across multiple model families (Llama-3, Qwen-2.5, Mistral) without requiring retraining.

Conclusion: CoCoA offers an effective, broadly applicable method for enhancing LLM trustworthiness at inference time by leveraging model-intrinsic signals from middle layers to reduce hallucinations.

Abstract: Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span’s factuality is correlated with its representational instability across the model’s internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.

[22] Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models

Hikaru Asano, Tadashi Kozuno, Kuniaki Saito, Yukino Baba

Main category: cs.CL

TL;DR: Gt-Margin improves masked diffusion language models by using ground-truth token probability margins to create an oracle unmasking order, then training a planner to imitate this ordering for better text generation, especially on reasoning tasks.

Details

Motivation: Current masked diffusion language models rely on heuristic confidence measures or expensive reinforcement learning for determining unmasking order during inference. The paper aims to develop a more principled approach to decide "where-to-unmask" that improves generation quality without modifying the core token prediction model.

Method: Introduces Gt-Margin, a position-wise score based on ground-truth token probability margins (difference between correct token probability and strongest alternative). This creates an oracle unmasking order that prioritizes easier positions first. Then trains a supervised unmasking planner via learning-to-rank to imitate this oracle ordering from masked contexts, which integrates into standard MDLM sampling.

Result: The oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. The trained planner improves reasoning accuracy without modifying the underlying token prediction model.

Conclusion: Gt-Margin provides an effective approach for learning unmasking order in masked diffusion language models, offering improvements in generation quality and reasoning performance through better planning of where-to-unmask decisions.

Abstract: Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.

[23] EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, Jincheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, Yuchen Elenor Jiang, Wei Wang, He Zhu, Wangchunshu Zhou

Main category: cs.CL

TL;DR: EcoGym is a benchmark for evaluating LLM-based agents in continuous plan-and-execute decision making within interactive economic environments, focusing on long-term strategic coherence and business outcomes.

Details

Motivation: Current evaluation frameworks for LLM-based agents are limited by being episodic, domain-specific, or insufficiently grounded in persistent economic dynamics, lacking proper assessment of long-horizon planning capabilities in realistic economic settings.

Method: Developed EcoGym with three diverse economic environments (Vending, Freelance, Operation) featuring unified decision-making interfaces, standardized processes, and budgeted actions over effectively unbounded horizons (1000+ steps). Evaluation focuses on business-relevant outcomes like net worth, income, and DAU.

Result: Experiments across eleven leading LLMs reveal systematic tensions: no single model dominates across all three scenarios, with models showing significant suboptimality in either high-level strategies or efficient action execution.

Conclusion: EcoGym serves as an open, extensible testbed for transparent long-horizon agent evaluation and studying controllability-utility trade-offs in realistic economic settings, highlighting the need for improved planning capabilities in LLM-based agents.

Abstract: Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.

[24] The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking

Julia Maria Struß, Sebastian Schellhammer, Stefan Dietze, Venktesh V, Vinay Setty, Tanmoy Chakraborty, Preslav Nakov, Avishek Anand, Primakov Chungkham, Salim Hafid, Dhruv Sahnan, Konstantin Todorov

Main category: cs.CL

TL;DR: CheckThat! lab focuses on developing technologies to combat disinformation through verification pipeline tasks including source retrieval for scientific claims, fact-checking numerical/temporal claims, and generating full fact-checking articles.

Details

Motivation: To advance technologies for combating disinformation and manipulation in online communication across multiple languages and platforms, building on previous editions that focused on core verification tasks.

Method: The lab organizes tasks around a verification pipeline: Task 1 focuses on source retrieval for scientific web claims, Task 2 adds reasoning for fact-checking numerical and temporal claims, and Task 3 expands to generating full fact-checking articles.

Result: The paper presents the design and scope of the CheckThat! lab’s 2025 edition tasks, which represent challenging classification, retrieval, and generation problems at document and span levels in multilingual settings.

Conclusion: The CheckThat! lab continues to advance disinformation combat technologies through structured verification pipeline tasks that address increasingly complex challenges in source retrieval, reasoning-based fact-checking, and article generation.

Abstract: The CheckThat! lab aims to advance the development of innovative technologies combating disinformation and manipulation efforts in online communication across a multitude of languages and platforms. While in early editions the focus has been on core tasks of the verification pipeline (check-worthiness, evidence retrieval, and verification), in the past three editions, the lab added additional tasks linked to the verification process. In this year’s edition, the verification pipeline is at the center again with the following tasks: Task 1 on source retrieval for scientific web claims (a follow-up of the 2025 edition), Task 2 on fact-checking numerical and temporal claims, which adds a reasoning component to the 2025 edition, and Task 3, which expands the verification pipeline with generation of full-fact-checking articles. These tasks represent challenging classification and retrieval problems as well as generation challenges at the document and span level, including multilingual settings.

[25] Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models

Sangwon Yu, Ik-hwan Kim, Donghun Kang, Bongkyu Hwang, Junhwa Choi, Suk-hoon Jung, Seungki Hong, Taehee Lee, Sungroh Yoon

Main category: cs.CL

TL;DR: SAKE addresses Knowledge Integration Decay in LLMs by anchoring retrieved knowledge at both ends of reasoning chains to improve knowledge utilization during search-augmented reasoning.

Details

Motivation: The paper identifies Knowledge Integration Decay (KID) - a bottleneck where LLMs fail to effectively integrate retrieved external knowledge into long reasoning chains, limiting performance even when relevant information is available.

Method: Proposes Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy that anchors retrieved knowledge at both the beginning and end of the reasoning process to prevent it from being overshadowed by prior context.

Result: Extensive experiments on multi-hop QA and complex reasoning benchmarks show SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration.

Conclusion: SAKE provides an effective, training-free approach to stabilize knowledge utilization in agentic LLMs, addressing the critical bottleneck of knowledge integration decay in search-augmented reasoning.

Abstract: Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.

[26] UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

Hongyan Xie, Yikun Ban, Ruiyu Fang, Zixuan Huang, Deqing Wang, Jianxin Li, Yitong Yao, Chao Wang, Shuangyong Song

Main category: cs.CL

TL;DR: UniARM framework uses MoSLoRA for multi-objective alignment via shared feature extraction and preference modulation, enabling precise control over preference trade-offs in frozen LLMs.

Details

Motivation: Existing multi-objective alignment methods using autoregressive reward models (ARMs) have limitations: independent training neglects preference interactions, while single ARMs with separate modules cause feature entanglement, leading to misalignment with user preferences.

Method: Proposes MoSLoRA (Preference-Modulated & Shared Low-Rank Adaptation) that extracts shared features via preference-agnostic module, then applies affine transformations via preference modulation module conditioned on mixed preference vectors. Builds UniARM framework that jointly models all preference dimensions in single parameter space.

Result: Mitigates feature entanglement, enables precise control over preference trade-offs during inference, eliminates need for independent parameters per preference objective, and scales to larger LLMs.

Conclusion: UniARM with MoSLoRA provides effective solution for multi-objective test-time alignment by addressing feature entanglement and enabling precise preference control in frozen LLMs.

Abstract: Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated & Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic module and then applies affine transformations to shared features via a preference modulation module conditioned on mixed preference vectors. This design mitigates feature entanglement and enables precise control over preference trade-offs during inference. Building on this, we introduce the Unified Autoregressive Reward Model (UniARM), a novel framework for multi-objective test-time alignment. UniARM jointly models all preference dimensions in a single parameter space, eliminating the need for independent parameters for each preference objective. es on larger-scale LLMs, enhancing its practical usability.

[27] Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA

Klejda Alushi, Jan Strich, Chris Biemann, Martin Semmann

Main category: cs.CL

TL;DR: Comprehensive empirical study comparing RAG methods for multi-turn conversational QA across 8 datasets, finding simple methods like reranking and hybrid BM25 outperform complex techniques.

Details

Motivation: Most existing RAG studies evaluate methods in isolation and focus on single-turn settings, lacking systematic comparison for multi-turn conversational QA where dialogue history, coreference, and shifting user intent complicate retrieval.

Method: Comprehensive empirical study using unified experimental setup across 8 diverse conversational QA datasets, evaluating retrieval quality and answer generation using generator and retrieval metrics, analyzing performance evolution across conversation turns.

Result: Robust yet straightforward methods (reranking, hybrid BM25, HyDE) consistently outperform vanilla RAG, while several advanced techniques fail to yield gains and can degrade performance below No-RAG baseline. Dataset characteristics and dialogue length strongly influence retrieval effectiveness.

Conclusion: Effective conversational RAG depends less on method complexity than on alignment between retrieval strategy and dataset structure. No single RAG strategy dominates across all settings.

Abstract: Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote{\href{https://github.com/Klejda-A/exp-rag.git}{GitHub Repository}}

[28] Advancing Block Diffusion Language Models for Test-Time Scaling

Yi Lu, Deyang Kong, Jianing Wang, Linsen Guo, Xue Wang, Qi Guo, Tao Gui, Xuanjing Huang, Wei Ye, Shikun Zhang, Wei Wang

Main category: cs.CL

TL;DR: A unified framework for test-time scaling in Block Diffusion Language Models (BDLMs) with adaptive decoding and block-wise generation strategies to improve reasoning efficiency and effectiveness.

Details

Motivation: Existing BDLMs have limited exploration under test-time scaling and face severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing decoding speed and effectiveness.

Method: Proposes Bounded Adaptive Confidence Decoding (BACD) for difficulty-aware sampling that dynamically adjusts denoising based on model confidence, and Think Coarse, Critic Fine (TCCF) paradigm that allocates large block sizes for exploratory reasoning and smaller blocks for refinement, with Progressive Block Size Extension to mitigate performance degradation when scaling block sizes.

Result: Applying BACD and TCCF to TDAR-8B yields significant improvements: 2.26x speedup over TraDo-8B and +11.2 points on AIME24 benchmark.

Conclusion: The framework marks an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks through adaptive decoding and block-wise generation strategies.

Abstract: Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.

[29] LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann

Main category: cs.CL

TL;DR: LEMUR: A large-scale multilingual corpus of EU environmental legislation for improving legal document retrieval, with domain-adapted embedding models showing significant gains, especially for low-resource languages.

Details

Motivation: LLMs are increasingly used for legal information access, but face challenges in multilingual legal settings due to unreliable retrieval, lack of domain-adapted embedding models, noisy PDF extraction, and multilingual corpora not designed for semantic retrieval.

Method: Created LEMUR corpus from 24,953 EUR-Lex PDF documents across 25 languages, measured PDF-to-text fidelity using Lexical Content Score, fine-tuned three multilingual embedding models with contrastive objectives in monolingual and bilingual settings.

Result: Legal-domain fine-tuning consistently improves Top-k retrieval accuracy across languages, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show improvements transfer to unseen languages, indicating enhanced language-independent legal representations.

Conclusion: LEMUR addresses multilingual legal retrieval challenges, demonstrating that domain adaptation improves embedding quality, especially for low-resource languages, and that legal representations become more language-independent through fine-tuning.

Abstract: Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.

[30] Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Sora Miyamoto, Daisuke Oba, Naoaki Okazaki

Main category: cs.CL

TL;DR: Budget-Guided MCTS improves tree-search decoding for LLMs by dynamically adapting search strategy to remaining token budget, preventing late-stage over-branching and premature termination.

Details

Motivation: Real-world LLM deployments have fixed per-query token budgets that vary across settings, but existing tree-search policies treat budget only as a termination condition, leading to inefficient search patterns like late-stage over-branching or premature termination.

Method: Proposes Budget-Guided MCTS (BG-MCTS) that aligns search policy with remaining token budget: starts with broad exploration, then prioritizes refinement and answer completion as budget depletes while reducing late-stage branching from shallow nodes.

Result: BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 benchmarks with open-weight LLMs.

Conclusion: Budget-aware tree-search decoding significantly improves LLM performance under constrained token budgets, demonstrating the importance of aligning search strategy with resource constraints.

Abstract: Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose {Budget-Guided MCTS} (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.

[31] Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

Shweta Parihar, Liu Guangliang, Natalie Parde, Lu Cheng

Main category: cs.CL

TL;DR: Context-CDA: A context-augmented counterfactual data augmentation method that uses large LMs to generate diverse, contextually relevant debiasing data while preserving language modeling capability through uncertainty-based filtering.

Details

Motivation: Traditional counterfactual data augmentation (CDA) for debiasing language models often creates synthetic data that misaligns with real-world distributions or produces overly simplistic counterfactuals that ignore social context, potentially harming language modeling capability and downstream performance.

Method: Proposes Context-CDA which uses large language models to enhance diversity and contextual relevance of debiasing corpus by augmenting context, then applies uncertainty-based filtering to exclude low-quality counterfactuals as judged by target smaller LMs to be debiased.

Result: Experimental results on gender bias benchmarks show Context-CDA effectively mitigates bias without sacrificing language modeling performance, and provides insights into social biases through analysis of distribution shifts in next-token generation probabilities.

Conclusion: Context-CDA offers a simple yet effective approach to debiasing that maintains language modeling capability by ensuring better alignment between debiasing corpus and pretraining data through context augmentation and quality filtering.

Abstract: A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.

[32] On the Optimal Reasoning Length for RL-Trained Language Models

Daisuke Nohara, Taishi Nakamura, Rio Yokota

Main category: cs.CL

TL;DR: Comparing length control methods for RL-trained LLMs shows length penalties can hinder reasoning, while proper tuning improves efficiency for models with strong prior reasoning capabilities.

Details

Motivation: While RL improves reasoning in LLMs, it lengthens chain-of-thought outputs and increases computational costs. Existing length control methods lack clarity on optimal output length for balancing efficiency and performance.

Method: Compare several length control methods on Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B models, extending prior work to RL-trained policies to identify failure modes.

Result: Length penalties may hinder reasoning acquisition, but properly tuned length control can improve efficiency for models with strong prior reasoning. Identified two failure modes: long outputs increase dispersion, short outputs lead to under-thinking.

Conclusion: Optimal length control is crucial for RL-trained LLMs, with different approaches needed based on model capabilities. Proper tuning can balance efficiency and performance without sacrificing reasoning quality.

Abstract: Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.

[33] Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

Qiao Liang, Yuke Zhu, Chao Ge, Lei Yang, Ying Shen, Bo Zheng, Sheng Guo

Main category: cs.CL

TL;DR: ELPO improves LLM agent training by localizing first irrecoverable errors in tool-integrated reasoning trajectories and applying targeted policy optimization.

Details

Motivation: Current outcome-only reinforcement learning for LLM agents in tool-integrated reasoning suffers from sparse rewards and weak credit assignment, especially when early irrecoverable mistakes determine overall success or failure.

Method: Proposes Error-Localized Policy Optimization (ELPO) which: 1) localizes first irrecoverable step via binary-search rollout trees, 2) converts tree into learning signals through hierarchical advantage attribution, and 3) applies error-localized adaptive clipping to strengthen corrective updates.

Result: ELPO outperforms strong Agentic RL baselines across TIR benchmarks in math, science QA, and code execution, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency.

Conclusion: ELPO effectively addresses credit assignment challenges in long-horizon tool-integrated reasoning by localizing critical errors and applying targeted policy optimization, leading to improved agent performance.

Abstract: Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.

[34] AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models

R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu

Main category: cs.CL

TL;DR: AlignTune is a modular toolkit that provides a unified interface for LLM alignment (SFT and RLHF) with interchangeable backends, addressing reproducibility issues in alignment research.

Details

Motivation: Current LLM alignment workflows are fragmented across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. Key obstacles include backend interference, reward fragmentation, and irreproducible pipelines.

Method: AlignTune exposes a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. It standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks.

Result: By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.

Conclusion: AlignTune addresses key reproducibility challenges in alignment research by providing a modular, unified toolkit that standardizes workflows and enables controlled comparisons across different backends.

Abstract: Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.

[35] MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

Nalin Srun, Parisa Rastin, Guénaël Cabanes, Lydia Boudjeloud Assala

Main category: cs.CL

TL;DR: MILE-RefHumEval is a reference-free framework for evaluating LLMs without ground-truth annotations, using an ensemble of independently prompted evaluators guided by human-aligned schemas for flexible and scalable assessment.

Details

Motivation: Current LLM evaluation methods often require ground-truth annotations or evaluator coordination, which can be resource-intensive and inflexible. There's a need for reference-free evaluation frameworks that can provide robust assessments without these constraints.

Method: The framework uses an ensemble of independently prompted evaluators guided by a human-aligned schema. It supports both discrete and continuous scoring judgments and employs task-specific prompts across various domains including candidate selection, summarization, image captioning, and dialogue.

Result: Experiments show that MILE-RefHumEval aligns closely with human judgments, outperforms prior evaluation methods, and reduces computational overhead while providing flexible, interpretable, and scalable assessments.

Conclusion: MILE-RefHumEval offers an efficient, robust, and human-aligned solution for real-world LLM evaluation that doesn’t require ground-truth annotations or evaluator coordination, making it practical for various applications.

Abstract: We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.

[36] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do

Main category: cs.CL

TL;DR: MATA is a multi-agent TableQA framework that uses multiple reasoning paths and small language model tools to improve table question answering reliability and efficiency while reducing expensive LLM calls.

Details

Motivation: Current LLMs face challenges in table understanding tasks regarding reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments where excessive LLM inference is problematic.

Method: MATA employs a multi-agent framework with complementary reasoning paths and tools built with small language models. It generates diverse candidate answers through different reasoning styles, then refines/selects optimal answers using these tools while minimizing expensive LLM agent calls.

Result: Extensive experiments on two benchmarks with ten different LLMs show MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference, maintaining strong performance with small open-source models.

Conclusion: Careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA, demonstrating that multi-agent approaches with small language model tools can achieve efficient and accurate table understanding.

Abstract: Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDAS-Lab/MATA.

[37] Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs

Joseph Attieh, Timothee Mickus, Anne-Laure Ligozat, Aurélie Névéol, Jörg Tiedemann

Main category: cs.CL

TL;DR: Knowledge distillation methods for machine translation are evaluated using both translation quality and computational cost measured as carbon footprint, revealing trade-offs between distillation overhead and inference costs at different deployment scales.

Details

Motivation: Current knowledge distillation studies in machine translation typically only report translation quality without considering computational complexity, making it difficult to select appropriate KD methods under compute-induced constraints.

Method: Evaluate representative KD methods by measuring both translation quality and computational cost expressed as carbon footprint using machine learning life cycle assessment (MLCA) tool, accounting for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference).

Result: (i) Distillation overhead dominates total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation.

Conclusion: The protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints, emphasizing the importance of considering both computational cost and quality in KD method selection.

Abstract: Knowledge distillation (KD) is a tool to compress a larger system (teacher) into a smaller one (student). In machine translation, studies typically report only the translation quality of the student and omit the computational complexity of performing KD, making it difficult to select among the many available KD choices under compute-induced constraints. In this study, we evaluate representative KD methods by considering both translation quality and computational cost. We express computational cost as a carbon footprint using the machine learning life cycle assessment (MLCA) tool. This assessment accounts for runtime operational emissions and amortized hardware production costs throughout the KD model life cycle (teacher training, distillation, and inference). We find that (i) distillation overhead dominates the total footprint at small deployment volumes, (ii) inference dominates at scale, making KD beneficial only beyond a task-dependent usage threshold, and (iii) word-level distillation typically offers more favorable footprint-quality trade-offs than sequence-level distillation. Our protocol provides reproducible guidance for selecting KD methods under explicit quality and compute-induced constraints.

[38] Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding

Abdulhai Alali, Abderrahmane Issam

Main category: cs.CL

TL;DR: Fine-tuning LLMs with LoRA on dialect data, adapter merging, and dialect-aware MBR decoding improves Arabic dialect generation and translation while preserving semantic accuracy.

Details

Motivation: LLMs support many languages but dialects are underrepresented due to limited data and linguistic variation. The paper aims to improve dialectal performance for Arabic variants.

Method: Uses Low Rank Adaptation (LoRA) fine-tuning on monolingual and English-dialect parallel data, adapter merging, and dialect-aware Minimum Bayes Risk (MBR) decoding to enhance dialectal fidelity.

Result: Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy.

Conclusion: The combination provides a compact and effective framework for robust dialectal Arabic generation.

Abstract: Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.

[39] TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces

Yiming Shu, Pei Liu, Tiange Zhang, Ruiyang Gao, Jun Ma, Chen Sun

Main category: cs.CL

TL;DR: TraceMem is a cognitively-inspired memory framework for LLMs that creates structured narrative memory schemata from conversational traces to enable long-term interaction capabilities.

Details

Motivation: LLMs struggle with long-term interactions due to limited context windows, and existing memory systems fail to capture narrative coherence in dialogue streams.

Method: Three-stage pipeline: (1) Short-term Memory Processing for topic segmentation, (2) Synaptic Memory Consolidation for episode summarization into user traces, (3) Systems Memory Consolidation using hierarchical clustering to organize traces into narrative threads.

Result: Achieves SOTA on LoCoMo benchmark, excels in multi-hop and temporal reasoning, and demonstrates superior narrative comprehension compared to baselines.

Conclusion: TraceMem’s brain-inspired architecture effectively constructs coherent narratives from conversational traces, enabling better long-term interaction capabilities for LLMs.

Abstract: Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: https://github.com/YimingShu-teay/TraceMem

[40] Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs

Longhuan Xu, Cunjian Chen, Feng Yin

Main category: cs.CL

TL;DR: Dynamic test-time adaptation for LLMs using hypernetwork-predicted per-layer learning rates to stabilize unsupervised adaptation to individual prompts.

Details

Motivation: Unsupervised test-time adaptation (TTA) for LLMs is appealing but unstable with fixed learning rates, causing overfitting to prompt-specific statistics and degradation of generation quality. Current methods lack fine-grained control over adaptation strength.

Method: Proposes layer-wise dynamic TTA framework where a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers for LoRA parameters. This modulates adaptation strength based on prompt representation, LLM structure, and adaptation step.

Result: Experiments across various datasets and LLMs show the method substantially strengthens TTA by learning effective scaling patterns, improving stability while delivering better performance compared to naive fixed learning rate approaches.

Conclusion: Dynamic adaptation control through hypernetwork-predicted learning rates enables more stable and effective unsupervised test-time adaptation for LLMs, addressing the limitations of fixed-rate approaches.

Abstract: Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.

[41] AI-Assisted Scientific Assessment: A Case Study on Climate Change

Christian Buck, Levke Caesar, Michelle Chen Huebscher, Massimiliano Ciaramita, Erich M. Fischer, Zeke Hausfather, Özge Kart Tokmak, Reto Knutti, Markus Leippold, Joseph Ludescher, Katharine J. Mach, Sofia Palazzo Corner, Kasra Rafiezadeh Shahi, Johan Rockström, Joeri Rogelj, Boris Sakschewski

Main category: cs.CL

TL;DR: AI co-scientist system tested for collaborative scientific assessment in climate science, showing acceleration of workflow but requiring substantial expert oversight and additions for scientific rigor.

Details

Motivation: To evaluate whether AI can effectively support collaborative scientific assessment in domains where repeated evaluation is impossible and ground truth depends on consensus synthesis of theory and evidence, moving beyond the 'guess and check' paradigm of AI co-scientists.

Method: Developed a Gemini-based AI environment integrated into standard scientific workflow, tested with 13 climate scientists on the complex topic of Atlantic Meridional Overturning Circulation (AMOC) stability, analyzing AI contributions versus expert inputs across 104 revision cycles.

Result: AI accelerated scientific workflow significantly - group synthesized 79 papers through 104 revisions in 46 person-hours. Most AI-generated content was retained, and AI helped maintain logical consistency and presentation quality. However, less than half of report was AI-generated, and substantial expert oversight was needed to elevate content to rigorous scientific standards.

Conclusion: AI can accelerate scientific workflows and contribute meaningfully to collaborative assessment, but cannot replace expert judgment and oversight needed to ensure scientific rigor and acceptability in complex domains where ground truth depends on consensus synthesis.

Abstract: The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in ‘guess and check’ loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.

[42] Targum – A Multilingual New Testament Translation Corpus

Maciej Rapacz, Aleksander Smywiński-Pohl

Main category: cs.CL

TL;DR: A multilingual corpus of 657 New Testament translations with 352 unique versions across 5 languages, featuring manual metadata annotation for flexible translation history analysis.

Details

Motivation: Existing biblical translation corpora prioritize linguistic breadth over depth, failing to capture rich translation histories of European languages. There's a need for resources enabling both micro-level analysis of translation families and macro-level studies of unique translations.

Method: Aggregated 657 New Testament translations from 12 online biblical libraries and one preexisting corpus. Manually annotated each translation with metadata mapping to standardized identifiers for work, edition, and revision year. Focused on five languages: English (208 unique from 396 total), French (41/78), Italian (18/33), Polish (30/48), and Spanish (55/102).

Result: Created the first resource designed for flexible, multilevel analysis of translation history with unprecedented depth in five European languages. The corpus enables researchers to define “uniqueness” for their specific needs and conduct both micro-level (translation families) and macro-level (deduplicated texts) studies.

Conclusion: This corpus establishes a new benchmark for quantitative study of translation history by providing the first resource specifically designed for flexible, multilevel analysis of biblical translations across multiple European languages.

Abstract: Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 657 New Testament translations, of which 352 are unique, with unprecedented depth in five languages: English (208 unique versions from 396 total), French (41 from 78), Italian (18 from 33), Polish (30 from 48), and Spanish (55 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is manually annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization empowers researchers to define “uniqueness” for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first resource designed for such flexible, multilevel analysis, our corpus establishes a new benchmark for the quantitative study of translation history.

[43] Improving Interpretability of Lexical Semantic Change with Neurobiological Features

Kohei Oda, Hiroya Takamura, Kiyoaki Shirai, Natthawut Kertkeidkachorn

Main category: cs.CL

TL;DR: A method for interpreting Lexical Semantic Change (LSC) by mapping contextualized embeddings to a neurobiological feature space with primitive word features, improving both interpretability and performance.

Details

Motivation: Most LSC studies focus on estimating change degree but lack interpretability of how word meanings actually change. Enhancing interpretability could provide novel insights into semantic change phenomena.

Method: Map contextualized embeddings from pre-trained language models to a neurobiological feature space where each dimension corresponds to primitive word features (like semantic primitives), enabling systematic human interpretation of semantic changes.

Result: The method outperforms most previous approaches in estimating LSC degree and enables discovery of overlooked LSC types and efficient search for words with specific semantic change patterns.

Conclusion: The neurobiological feature mapping approach provides both superior performance and high interpretability for LSC analysis, enabling new types of semantic change discovery and targeted word searches.

Abstract: Lexical Semantic Change (LSC) is the phenomenon in which the meaning of a word change over time. Most studies on LSC focus on improving the performance of estimating the degree of LSC, however, it is often difficult to interpret how the meaning of a word change. Enhancing the interpretability of LSC is a significant challenge as it could lead to novel insights in this field. To tackle this challenge, we propose a method to map the semantic space of contextualized embeddings of words obtained by a pre-trained language model to a neurobiological feature space. In the neurobiological feature space, each dimension corresponds to a primitive feature of words, and its value represents the intensity of that feature. This enables humans to interpret LSC systematically. When employed for the estimation of the degree of LSC, our method demonstrates superior performance in comparison to the majority of the previous methods. In addition, given the high interpretability of the proposed method, several analyses on LSC are carried out. The results demonstrate that our method not only discovers interesting types of LSC that have been overlooked in previous studies but also effectively searches for words with specific types of LSC.

[44] Where Are We At with Automatic Speech Recognition for the Bambara Language?

Seydou Diallo, Yacouba Diarra, Mamadou K. Keita, Panga Azazia Kamaté, Adam Bouno Kampo, Aboubacar Ouattara

Main category: cs.CL

TL;DR: First standardized benchmark for Bambara ASR evaluation using 1 hour of professional recordings, revealing current models perform poorly even in optimal conditions.

Details

Motivation: To address the lack of standardized evaluation for Automatic Speech Recognition (ASR) in the Bambara language, an underrepresented African language, and to assess current model capabilities in controlled conditions.

Method: Created a benchmark using 1 hour of professionally recorded Malian constitutional text under near-optimal acoustic/linguistic conditions. Evaluated 37 models including Bambara-trained systems and large-scale commercial models using Word Error Rate (WER) and Character Error Rate (CER) metrics.

Result: Current ASR performance remains significantly below deployment standards: best WER was 46.76%, best CER was 13.00%, and several multilingual models exceeded 100% WER. Results show multilingual pre-training and model scaling alone are insufficient for underrepresented languages.

Conclusion: The benchmark reveals critical gaps in ASR for underrepresented languages and provides a foundation for future research. Performance figures represent best-case scenarios and will likely be worse in real-world settings.

Abstract: This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards in a narrow formal domain; the top-performing system in terms of Word Error Rate (WER) achieved 46.76% and the best Character Error Rate (CER) of 13.00% was set by another model, while several prominent multilingual models exceeded 100% WER. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures are yet to be tested against practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.

[45] Decomposing Reasoning Efficiency in Large Language Models

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

Main category: cs.CL

TL;DR: A framework for evaluating token efficiency in reasoning LLMs that decomposes efficiency into completion rate, conditional correctness, and verbosity, with additional trace-quality measures to separate degenerate from engaged reasoning.

Details

Motivation: Standard LLM evaluations only report final accuracy, obscuring how tokens are spent or wasted during reasoning. There's a need to understand the trade-off between inference tokens and accuracy to identify where tokens are being used inefficiently.

Method: Introduces a trace-optional framework that decomposes token efficiency into: 1) completion under fixed token budget, 2) conditional correctness given completion, and 3) verbosity. When metadata provides workload proxies, verbosity is further factored into mean verbalization overhead and coupling coefficient. When reasoning traces are available, adds deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from engaged reasoning.

Result: Evaluation of 25 models on CogniLoad shows accuracy and token-efficiency rankings diverge (Spearman ρ=0.63), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times across models (only weakly related to model scale). The decomposition reveals distinct bottleneck profiles suggesting different efficiency interventions.

Conclusion: The proposed framework provides interpretable insights into token efficiency beyond just final accuracy, revealing different types of inefficiencies that require different intervention strategies. It enables more nuanced evaluation of reasoning LLMs by separating token usage patterns and identifying specific bottlenecks.

Abstract: Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $ρ=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.

[46] AnalyticsGPT: An LLM Workflow for Scientometric Question Answering

Khang Ly, Georgios Cheirmpos, Adrian Raudaschl, Christopher James, Seyed Amin Tabatabaei

Main category: cs.CL

TL;DR: AnalyticsGPT is an LLM-powered workflow for scientometric question answering that addresses meta-scientific questions about the “science of science” using retrieval-augmented generation and agentic concepts.

Details

Motivation: The paper addresses the underrepresented task of scientometric question answering, which involves meta-scientific questions about research itself. This task poses unique challenges compared to traditional scientific QA, requiring named-entity recognition of academic entities and multi-faceted data retrieval involving scientometric indices like impact factors.

Method: The authors develop an end-to-end system implementing a sequential workflow with retrieval-augmented generation (RAG) and agentic concepts. They use a proprietary research performance assessment platform as the database for RAG, and employ LLMs for task decomposition, planning, and reasoning. The system also addresses synthesizing data into presentable high-level analyses.

Result: The system was evaluated using experienced subject matter experts and LLMs-as-judges. The paper provides insights on LLM efficacy for this niche downstream task, with code and prompts made available on GitHub.

Conclusion: LLMs show great potential for complex scientometric question answering tasks involving planning, reasoning, and data synthesis, demonstrating their applicability beyond traditional NLP tasks to specialized meta-scientific applications.

Abstract: This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the “science of science.” When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: https://github.com/lyvykhang/llm-agents-scientometric-qa/tree/acl.

[47] Text summarization via global structure awareness

Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Yibei Liu, Chenghao Li, Qigan Sun, Shuai Yuan, Fachrina Dewi Puspitasari, Dongshen Han, Guoqing Wang, Sung-Ho Bae, Yang Yang

Main category: cs.CL

TL;DR: GloSA-sum is a text summarization method that uses topological data analysis to preserve global document structure and logical dependencies while improving efficiency.

Details

Motivation: Existing summarization methods focus on model improvements and sentence-level pruning but often overlook global structure, leading to disrupted coherence. LLM-based approaches achieve higher accuracy but incur substantial resource and time costs.

Method: Constructs semantic-weighted graph from sentence embeddings, uses persistent homology to identify core semantics and logical structures preserved in a “protection pool.” Employs topology-guided iterative strategy with lightweight proxy metrics to approximate sentence importance, avoiding repeated high-cost computations. Proposes hierarchical strategy integrating segment-level and global summarization for long texts.

Result: Experiments on multiple datasets show GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking balance between accuracy and efficiency. Benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.

Conclusion: GloSA-sum is the first summarization approach achieving global structure awareness via topological data analysis, efficiently summarizing text while preserving semantic cores and logical dependencies.

Abstract: Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool’’ as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.

[48] From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

Abdulmuizz Khalak, Abderrahmane Issam, Gerasimos Spanakis

Main category: cs.CL

TL;DR: Arabic language models pretrained on Modern Standard Arabic show disproportionate cross-dialect transfer, with geographic proximity explaining some variation, and evidence of negative interference when training on all dialects.

Details

Motivation: Arabic language models are primarily trained on Modern Standard Arabic (MSA), but people actually use various Arabic dialects in speech and online communication. There's a need to understand how well these MSA-trained models transfer to different Arabic dialects, which vary in their similarity to MSA.

Method: The study uses probing on 3 Natural Language Processing tasks and representational similarity analysis to examine cross-lingual transfer of Arabic models from MSA to various Arabic dialects.

Result: Transfer from MSA to dialects is possible but disproportionate across different dialects, with geographic proximity partially explaining the variation. There’s evidence of negative interference when models are trained to support all Arabic dialects.

Conclusion: The findings question the degree of similarity between MSA and Arabic dialects, and raise concerns about cross-lingual transfer in Arabic language models, suggesting that dialect-specific approaches may be needed.

Abstract: Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.

[49] LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse

Bakhtawar Ahtisham, Kirk Vanacore, Zhuqian Zhou, Jinsook Lee, Rene F. Kizilcec

Main category: cs.CL

TL;DR: LLM-generated reasoning can predict correctness of LLM predictions in educational dialogue analysis using TF-IDF encoding and supervised classifiers.

Details

Motivation: Current LLM pipelines for educational dialogue analysis lack reliable error detection methods, creating a need for ways to identify when models are wrong.

Method: Analyzed 30,300 teacher utterances labeled by LLMs with instructional moves and reasoning. Used TF-IDF to encode reasoning, evaluated five supervised classifiers, and examined linguistic markers using LIWC framework.

Result: Random Forest achieved F1=0.83, successfully identifying most incorrect predictions. Correct predictions showed grounded causal language, while incorrect reasoning relied on epistemic hedging and performative metacognition.

Conclusion: Reasoning-based error detection offers practical, scalable quality control for automated educational dialogue analysis, with construct-specific detectors improving performance.

Abstract: Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model’s own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model’s assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.

[50] How Do People Quantify Naturally: Evidence from Mandarin Picture Description

Yayun Zhang, Guanyi Chen, Fahime Same, Saad Mahamood, Tingting He

Main category: cs.CL

TL;DR: Study examines how Mandarin Chinese speakers naturally quantify objects in picture descriptions without explicit counting instructions, analyzing effects of numerosity, animacy, and modality on quantification behavior.

Details

Motivation: To understand how speakers decide whether and how to quantify in naturalistic language production, particularly in Mandarin Chinese, since quantification is fundamental to everyday language but little is known about spontaneous quantification decisions.

Method: Used picture-based elicited description task where speakers freely described scenes with multiple objects without explicit quantification instructions. Analyzed spoken and written modalities, examining three aspects: whether to quantify, precision of quantification, and quantificational strategies adopted.

Result: Object numerosity, animacy, and production modality systematically shape quantification behavior. Increasing numerosity reduces both likelihood and precision of quantification. Animate referents and modality selectively modulate strategy choice.

Conclusion: Demonstrates how quantification can be examined under unconstrained production conditions and provides naturalistic dataset for further analyses of quantity expression in language production.

Abstract: Quantification is a fundamental component of everyday language use, yet little is known about how speakers decide whether and how to quantify in naturalistic production. We investigate quantification in Mandarin Chinese using a picture-based elicited description task in which speakers freely described scenes containing multiple objects, without explicit instructions to count or quantify. Across both spoken and written modalities, we examine three aspects of quantification: whether speakers choose to quantify at all, how precise their quantification is, and which quantificational strategies they adopt. Results show that object numerosity, animacy, and production modality systematically shape quantificational behaviour. In particular, increasing numerosity reduces both the likelihood and the precision of quantification, while animate referents and modality selectively modulate strategy choice. This study demonstrates how quantification can be examined under unconstrained production conditions and provides a naturalistic dataset for further analyses of quantity expression in language production.

[51] SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech

Johan Sofalas, Dilushri Pavithra, Nevidu Jayatilleke, Ruvan Weerasinghe

Main category: cs.CL

TL;DR: A dataset of 2,344 Sinhala figures of speech with cultural annotations is introduced to address NMT challenges in low-resource languages, with evaluation showing LLMs struggle with idiomatic meanings.

Details

Motivation: Neural Machine Translation performs well with figurative expressions in high-resource languages but faces challenges with low-resource languages like Sinhala due to limited data, creating a need for culturally-aware datasets.

Method: Created a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations, developed a binary classifier for FOS types, and evaluated existing LLMs on the dataset.

Result: Achieved ~92% accuracy with binary classifier for FOS types, and found significant shortcomings in LLMs which struggle to accurately convey idiomatic meanings in Sinhala.

Conclusion: The dataset provides a crucial benchmark for low-resource NLP and culturally aware machine translation, highlighting the need for improved handling of figurative language in multilingual models.

Abstract: Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.

[52] Steer2Edit: From Activation Steering to Component-Level Editing

Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng

Main category: cs.CL

TL;DR: Steer2Edit transforms steering vectors from inference-time control into diagnostic signals for component-level weight editing, achieving better attribute-utility trade-offs than global activation interventions.

Details

Motivation: Current steering methods apply fixed, global modifications to LLM internal states during inference, which often leads to unfavorable attribute-utility trade-offs because they ignore that behaviors are governed by heterogeneous subsets of model components.

Method: Steer2Edit is a training-free framework that converts steering vectors into diagnostic signals for rank-1 weight editing at the component level (attention heads and MLP neurons), selectively redistributing behavioral influence rather than uniformly injecting steering directions.

Result: Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit achieves more favorable attribute-utility trade-offs: improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average while preserving standard forward pass and parallel inference compatibility.

Conclusion: Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates that maintain model utility while achieving targeted behavioral modifications.

Abstract: Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model’s internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.

[53] The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, Philip S. Yu

Main category: cs.CL

TL;DR: Theoretical and empirical analysis shows that fully isolated, continuously self-evolving multi-agent LLM systems inevitably degrade in safety alignment, creating a fundamental trilemma between self-evolution, isolation, and safety invariance.

Details

Motivation: To investigate whether multi-agent LLM systems can achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment - what the authors term the "self-evolution trilemma" of continuous self-evolution, complete isolation, and safety invariance.

Method: Uses an information-theoretic framework to formalize safety as divergence from anthropic value distributions. Theoretically demonstrates that isolated self-evolution induces statistical blind spots leading to irreversible safety degradation. Empirically tests this with an open-ended agent community (Moltbook) and two closed self-evolving systems.

Result: Both theoretical and empirical results show that an agent society satisfying all three conditions of the trilemma is impossible. Isolated self-evolution inevitably leads to safety erosion, with empirical phenomena aligning with theoretical predictions of safety degradation.

Conclusion: Establishes a fundamental limit on self-evolving AI societies, shifting discourse from symptom-driven safety patches to principled understanding of intrinsic dynamical risks. Highlights need for external oversight or novel safety-preserving mechanisms rather than fully isolated self-evolution.

Abstract: The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment–a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system’s safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.

[54] AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning

Tilahun Yeshambel, Moncef Garouani, Josiane Mothe

Main category: cs.CL

TL;DR: Amharic datasets for neural retrieval-ranking and instruction-following text generation to support low-resource language research

Details

Motivation: Addressing the scarcity of high-quality supervised data for low-resource languages like Amharic, which limits research on neural retrieval and generative models

Method: Created two datasets: 1) 1,091 query-positive-negative document triplets for retrieval-ranking using expert-curated, web-derived, and LLM-assisted queries with native speaker validation; 2) 6,285 Amharic prompt-response pairs for instruction-following generation using LLMs with manual refinement

Result: Released standardized datasets in multiple formats (CSV, JSON, JSONL) with methodology that can generalize to other low-resource languages

Conclusion: Provides valuable resources for Amharic NLP research and establishes a replicable methodology for creating similar datasets for other low-resource languages

Abstract: Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.

[55] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

William Lugoloobi, Thomas Foster, William Bankes, Chris Russell

Main category: cs.CL

TL;DR: LLMs can predict their own success likelihood from internal representations before generation, enabling efficient routing of queries across model pools to reduce inference costs by up to 70% while maintaining performance.

Details

Motivation: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs require additional compute remains challenging. The paper investigates whether LLMs' internal representations contain signals about their likelihood of success before generation, and if this can guide more efficient inference.

Method: Train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks. Use E2H-AMC dataset with both human and model performance on identical problems. Analyze model-specific vs human difficulty. Apply probes to route queries across a pool of models for cost-efficient inference.

Result: Probes substantially outperform surface features like question length and TF-IDF. Models encode model-specific notion of difficulty distinct from human difficulty, with distinction increasing with extended reasoning. Routing queries across model pool can exceed best-performing model while reducing inference cost by up to 70% on MATH dataset.

Conclusion: Internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. LLMs’ pre-generation activations contain recoverable signals about their likelihood of success, which can be leveraged for cost-effective inference routing.

Abstract: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: https://github.com/KabakaWilliam/llms_know_difficulty

[56] ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

Shuaiyi Nie, Siyu Ding, Wenyuan Zhang, Linhao Yu, Tianmeng Yang, Yao Chen, Tingwen Liu, Weichong Yin, Yu Sun, Hua Wu

Main category: cs.CL

TL;DR: ATTNPO is a low-overhead process-supervised RL framework that uses attention signals to reduce overthinking in reasoning models by distinguishing essential from redundant reasoning steps.

Details

Motivation: Large reasoning models trained with RLVR often overthink, generating redundant reasoning without performance gains. Existing methods like trajectory-level length penalties fail to effectively shorten reasoning while maintaining accuracy, and process-supervised methods are resource-intensive with inaccurate credit assignment.

Method: ATTNPO leverages the model’s intrinsic attention signals for step-level credit assignment. It identifies special attention heads that focus on essential steps while suppressing redundant ones, then uses two sub-strategies: discouraging redundant steps while reducing penalties on essential steps to preserve accuracy.

Result: Experimental results show ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.

Conclusion: ATTNPO provides an effective low-overhead solution to mitigate overthinking in reasoning models by using attention signals for fine-grained credit assignment.

Abstract: Large reasoning models trained with reinforcement learning and verifiable rewards (RLVR) achieve strong performance on complex reasoning tasks, yet often overthink, generating redundant reasoning without performance gains. Existing trajectory-level length penalties often fail to effectively shorten reasoning length and degrade accuracy, as they uniformly treat all reasoning steps and lack fine-grained signals to distinguish redundancy from necessity. Meanwhile, process-supervised methods are typically resource-intensive and suffer from inaccurate credit assignment. To address these issues, we propose ATTNPO, a low-overhead process-supervised RL framework that leverages the model’s intrinsic attention signals for step-level credit assignment. We first identify a set of special attention heads that naturally focus on essential steps while suppressing redundant ones. By leveraging the attention scores of these heads, We then employ two sub-strategies to mitigate overthinking by discouraging redundant steps while preserving accuracy by reducing penalties on essential steps. Experimental results show that ATTNPO substantially reduces reasoning length while significantly improving performance across 9 benchmarks.

[57] ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese

Trung Tien Cao, Lam Minh Thai, Nghia Hieu Nguyen, Duc-Vu Nguyen, Ngan Luu-Thuy Nguyen

Main category: cs.CL

TL;DR: A Vietnamese multiple-choice reading comprehension dataset and method (ViMultiChoice) that jointly predicts answers and generates explanations, achieving SotA performance.

Details

Motivation: Existing MCRC models lack explanation capabilities for their answer choices, creating a need for datasets and methods that can both answer questions and provide reasoning behind those answers, particularly for Vietnamese language.

Method: Introduces a new Vietnamese dataset for MCRC with explanations and proposes ViMultiChoice, a method that jointly predicts correct answers and generates corresponding explanations through multi-task learning.

Result: ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art performance on both ViMMRC 2.0 benchmark and the new dataset, with joint training of option decision and explanation generation significantly improving multiple-choice accuracy.

Conclusion: The proposed approach successfully addresses the explanation gap in MCRC models for Vietnamese, demonstrating that joint training for answer prediction and explanation generation enhances both tasks and provides interpretable reasoning.

Abstract: Multiple-choice Reading Comprehension (MCRC) models aim to select the correct answer from a set of candidate options for a given question. However, they typically lack the ability to explain the reasoning behind their choices. In this paper, we introduce a novel Vietnamese dataset designed to train and evaluate MCRC models with explanation generation capabilities. Furthermore, we propose ViMultiChoice, a new method specifically designed for modeling Vietnamese reading comprehension that jointly predicts the correct answer and generates a corresponding explanation. Experimental results demonstrate that ViMultiChoice outperforms existing MCRC baselines, achieving state-of-the-art (SotA) performance on both the ViMMRC 2.0 benchmark and the newly introduced dataset. Additionally, we show that jointly training option decision and explanation generation leads to significant improvements in multiple-choice accuracy.

[58] A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

Xiulin Yang, Arianna Bisazza, Nathan Schneider, Ethan Gotlieb Wilcox

Main category: cs.CL

TL;DR: Neural language models can learn some syntactic generalizations without innate linguistic constraints, but remain less data-efficient than children and require additional inductive biases for human-like learning.

Details

Motivation: To test the Poverty of the Stimulus Hypothesis (PoSH) which claims innate linguistic constraints are necessary for language learning, by examining whether neural language models without such constraints can learn syntactic generalizations from limited input.

Method: Created POSHBench training-and-evaluation suite targeting English syntactic phenomena central to PoSH arguments (question formation, islands to movement). Trained Transformer models on 10-50M words of developmentally plausible text, and tested three cognitively motivated inductive biases.

Result: Models showed indications of generalization on all phenomena without direct positive evidence, but were less data-efficient with weaker generalizations than children. Cognitive biases improved general syntactic competence but not POSHBench performance.

Conclusion: Challenges claim that innate syntax is the only route to generalization, but suggests human-like data efficiency requires inductive biases beyond those tested.

Abstract: How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10–50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence – yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.

[59] ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition

Khoa Anh Nguyen, Long Minh Hoang, Nghia Hieu Nguyen, Luan Thanh Nguyen, Ngan Luu-Thuy Nguyen

Main category: cs.CL

TL;DR: ViSpeechFormer: A phoneme-based Vietnamese ASR framework leveraging the language’s phonetic orthography for improved performance and generalization

Details

Motivation: Vietnamese has high grapheme-phoneme transparency (phonetic orthography), making it suitable for phoneme-based ASR approaches. The authors aim to create the first Vietnamese ASR framework that explicitly models phonemic representations to improve performance and generalization.

Method: Proposes ViSpeechFormer (Vietnamese Speech Transformer), a phoneme-based approach for Vietnamese ASR that exploits the language’s phonetic orthography where each grapheme corresponds to at most one phoneme and vice versa.

Result: Experiments on two publicly available Vietnamese ASR datasets show strong performance, better generalization to out-of-vocabulary words, and reduced susceptibility to training bias compared to existing approaches.

Conclusion: The phoneme-based paradigm is effective for Vietnamese ASR and promising for other languages with phonetic orthographies. The approach demonstrates advantages in generalization and bias reduction.

Abstract: Vietnamese has a phonetic orthography, where each grapheme corresponds to at most one phoneme and vice versa. Exploiting this high grapheme-phoneme transparency, we propose ViSpeechFormer (\textbf{Vi}etnamese \textbf{Speech} Trans\textbf{Former}), a phoneme-based approach for Vietnamese Automatic Speech Recognition (ASR). To the best of our knowledge, this is the first Vietnamese ASR framework that explicitly models phonemic representations. Experiments on two publicly available Vietnamese ASR datasets show that ViSpeechFormer achieves strong performance, generalizes better to out-of-vocabulary words, and is less affected by training bias. This phoneme-based paradigm is also promising for other languages with phonetic orthographies. The code will be released upon acceptance of this paper.

[60] SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick

Main category: cs.CL

TL;DR: A multi-dimensional evaluation framework for assessing LLM outputs in domain-specific, high-stakes applications like natural hazard response, focusing on specificity, robustness, relevance, and context utilization.

Details

Motivation: Existing evaluation frameworks for RAG and open-ended QA rely on surface-level similarity or semantic relevance, failing to assess whether responses provide the specific, decision-critical information needed for domain-sensitive applications like natural hazard response and infrastructure planning.

Method: Proposed a reference-free evaluation framework assessing LLM outputs along four dimensions: specificity, robustness to paraphrasing/semantic perturbations, answer relevance, and context utilization. Created a curated dataset of 1,412 domain-specific QA pairs spanning 40 professional roles and 7 natural hazard types. Conducted human evaluation to assess inter-annotator agreement and alignment with model outputs.

Result: Results show no single metric sufficiently captures answer quality in isolation, demonstrating the need for structured, multi-metric evaluation frameworks for LLMs in high-stakes applications. Human evaluation highlighted the inherent subjectivity of open-ended, domain-specific evaluation.

Conclusion: A multi-dimensional, reference-free evaluation framework is necessary for assessing LLM outputs in domain-specific, high-stakes applications, as existing metrics fail to capture the nuanced information requirements for decision-critical scenarios.

Abstract: Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.

[61] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang

Main category: cs.CL

TL;DR: DRIFT proposes a dual-model architecture that decouples knowledge extraction from reasoning by using a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens, improving long-context task performance.

Details

Motivation: Existing methods for integrating extensive knowledge into LLMs face challenges with finite context windows, retriever noise, and catastrophic forgetting. There's a need to decouple factual knowledge from reasoning patterns to enable more scalable and efficient knowledge integration.

Method: DRIFT uses a dual-model architecture with a lightweight knowledge model that dynamically compresses document chunks into implicit fact tokens conditioned on queries. These dense representations are projected into the reasoning model’s embedding space, replacing redundant text while maintaining accuracy.

Result: Extensive experiments show DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models.

Conclusion: DRIFT provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs by explicitly decoupling knowledge extraction from reasoning processes.

Abstract: The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model’s embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at https://github.com/Lancelot-Xie/DRIFT.

Delvin Ce Zhang, Suhan Cui, Zhelin Chu, Xianren Zhang, Dongwon Lee

Main category: cs.CL

TL;DR: A novel model for multi-modal claim verification that jointly performs evidence retrieval, verification, and explanation generation using both textual and visual evidence, with a new scientific dataset in AI domain.

Details

Motivation: Most claim verification works focus only on textual evidence or ignore explainability, leading to inaccurate and unconvincing verification. There's a need for joint multi-modal reasoning over both textual and visual evidence with transparent explanations.

Method: Proposes a model with three components: 1) Multi-modal evidence retrieval using a two-layer graph with image-to-text and text-to-image reasoning, 2) Multi-modal claim verification with token- and evidence-level fusion of embeddings, 3) Explanation generation using multi-modal Fusion-in-Decoder. Also creates AIChartClaim dataset for scientific/AI domain.

Result: Experiments show the strength of the proposed model in achieving joint evidence retrieval, multi-modal claim verification, and explanation generation.

Conclusion: The proposed model effectively addresses limitations of existing claim verification methods by enabling joint multi-modal reasoning with explainability, and contributes a new scientific dataset to the community.

Abstract: Verifying the truthfulness of claims usually requires joint multi-modal reasoning over both textual and visual evidence, such as analyzing both textual caption and chart image for claim verification. In addition, to make the reasoning process transparent, a textual explanation is necessary to justify the verification result. However, most claim verification works mainly focus on the reasoning over textual evidence only or ignore the explainability, resulting in inaccurate and unconvincing verification. To address this problem, we propose a novel model that jointly achieves evidence retrieval, multi-modal claim verification, and explanation generation. For evidence retrieval, we construct a two-layer multi-modal graph for claims and evidence, where we design image-to-text and text-to-image reasoning for multi-modal retrieval. For claim verification, we propose token- and evidence-level fusion to integrate claim and evidence embeddings for multi-modal verification. For explanation generation, we introduce multi-modal Fusion-in-Decoder for explainability. Finally, since almost all the datasets are in general domain, we create a scientific dataset, AIChartClaim, in AI domain to complement claim verification community. Experiments show the strength of our model.

[63] Anagent For Enhancing Scientific Table & Figure Analysis

Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang

Main category: cs.CL

TL;DR: Anagent: A multi-agent framework for scientific table & figure analysis using specialized agents (Planner, Expert, Solver, Critic) with modular training strategies, evaluated on AnaBench benchmark with 63,178 instances across 9 scientific domains.

Details

Motivation: Current AI systems struggle with interpreting complex multimodal scientific knowledge, integrating evidence from different sources, and drawing domain-specific inferences due to the complexity and variability of scientific tables/figures, heterogeneous structures, and long-context requirements.

Method: Proposes Anagent, a multi-agent framework with four specialized agents: Planner (task decomposition), Expert (information retrieval via tools), Solver (information synthesis), and Critic (iterative refinement with 5D quality assessment). Uses modular training strategies combining supervised finetuning and specialized reinforcement learning.

Result: Anagent achieves substantial improvements: up to 13.43% in training-free settings and 42.12% with finetuning across 170 subdomains. Demonstrates that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table & figure analysis.

Conclusion: The multi-agent framework effectively addresses challenges in scientific multimodal analysis, showing significant performance gains and highlighting the importance of specialized reasoning and context-aware approaches for complex scientific data interpretation.

Abstract: In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table & figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table & figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43%$ in training-free settings and $\uparrow 42.12%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table & figure analysis. Our project page: https://xhguo7.github.io/Anagent/.

[64] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, Juntao Chen

Main category: cs.CL

TL;DR: Quantum-Audit is a benchmark with 2,700 questions to evaluate language models’ understanding of quantum computing concepts, revealing gaps in advanced topics and critical reasoning.

Details

Motivation: While language models are used in quantum computing education and research, existing benchmarks focus on code generation and circuit design, leaving concept understanding unmeasured. There's a need to systematically evaluate models' grasp of quantum computing fundamentals and advanced topics.

Method: Created Quantum-Audit benchmark with 2,700 questions: 1,000 expert-written, 1,000 LLM-extracted from research papers and expert-validated, plus 700 additional questions including 350 open-ended and 350 with false premises. Evaluated 26 models from leading organizations against human performance baselines.

Result: Human participants scored 23-86% (experts averaged 74%). Top model Claude Opus 4.5 reached 84% accuracy, but models showed 12-point accuracy drop on expert-written vs LLM-generated questions. Performance declined on advanced topics (73% on security). Models frequently accepted false premises (below 66% accuracy on critical reasoning).

Conclusion: While top models can exceed expert average on quantum computing questions, significant gaps remain in handling expert-written content, advanced topics, and critical reasoning. Models often reinforce false premises rather than correct them, highlighting limitations in conceptual understanding.

Abstract: Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.

[65] LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham

Main category: cs.CL

TL;DR: LibMoE is a unified framework for efficient and reproducible MoE research that enables comprehensive analysis of routing dynamics, initialization effects, and training regime differences.

Details

Motivation: Systematic research on Mixture of Experts (MoE) architectures is constrained by prohibitive computational costs, limiting large-scale studies accessible to most researchers. There's a need for a framework that lowers barriers to entry and standardizes evaluation.

Method: Introduces LibMoE, a unified framework supporting both pretraining and sparse-upcycling regimes with transparent analytical tools for probing routing and expert dynamics. Conducts comprehensive analysis along three dimensions: routing dynamics, lightweight initialization effects, and training regime differences.

Result: The framework enables reproducible, efficient, and extensible MoE research. Analysis reveals insights into routing patterns, stability, optimality, task specialization, expert diversity, initialization effects on load balancing, and distinct routing patterns between sparse upcycling and full pretraining.

Conclusion: LibMoE broadens access to MoE research and establishes reliable benchmarks to guide future innovations in scalable language model architectures.

Abstract: Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. GitHub: \href{https://github.com/Fsoft-AIC/LibMoE}{https://github.com/Fsoft-AIC/LibMoE}.

[66] Can LLMs Automate Fact-Checking Article Writing?

Dhruv Sahnan, David Corney, Irene Larraz, Giovanni Zagni, Ruben Miguez, Zhuohan Xie, Iryna Gurevych, Elizabeth Churchill, Tanmoy Chakraborty, Preslav Nakov

Main category: cs.CL

TL;DR: QRAFT is an LLM-based agentic framework that generates full fact-checking articles by mimicking human fact-checker workflows, aiming to bridge the gap between automated fact-checking assessments and public communication.

Details

Motivation: Existing automatic fact-checking systems fail to produce output suitable for public dissemination - they typically provide little or no justification for assessments, unlike human fact-checkers who communicate findings through comprehensive articles. The goal is to extend the automatic fact-checking pipeline with automatic generation of full fact-checking articles.

Method: Developed QRAFT, an LLM-based agentic framework that mimics the writing workflow of human fact-checkers. The approach involved identifying key desiderata for fact-checking articles through interviews with experts from leading fact-checking organizations, then building an agentic system that follows human-like writing processes.

Result: QRAFT outperforms several previously proposed text-generation approaches but still lags considerably behind expert-written articles. Human evaluations with professional fact-checkers show practical usefulness but highlight the gap between automated and human-written content.

Conclusion: The work bridges the gap between automated fact-checking assessments and public communication by introducing automatic generation of full fact-checking articles. While QRAFT shows promise and outperforms previous approaches, there’s significant room for improvement to reach expert-level quality.

Abstract: Automatic fact-checking aims to support professional fact-checkers by offering tools that can help speed up manual fact-checking. Yet, existing frameworks fail to address the key step of producing output suitable for broader dissemination to the general public: while human fact-checkers communicate their findings through fact-checking articles, automated systems typically produce little or no justification for their assessments. Here, we aim to bridge this gap. In particular, we argue for the need to extend the typical automatic fact-checking pipeline with automatic generation of full fact-checking articles. We first identify key desiderata for such articles through a series of interviews with experts from leading fact-checking organizations. We then develop QRAFT, an LLM-based agentic framework that mimics the writing workflow of human fact-checkers. Finally, we assess the practical usefulness of QRAFT through human evaluations with professional fact-checkers. Our evaluation shows that while QRAFT outperforms several previously proposed text-generation approaches, it lags considerably behind expert-written articles. We hope that our work will enable further research in this new and important direction. The code for our implementation is available at https://github.com/mbzuai-nlp/qraft.git.

[67] Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

Boshi Wang, Huan Sun

Main category: cs.CL

TL;DR: The paper investigates the Reversal Curse in LLMs, linking it to the binding problem in cognitive science, and proposes a JEPA-based model with memory layers to address conceptual binding limitations.

Details

Motivation: LLMs exhibit the Reversal Curse - difficulty learning reversible factual associations - which reveals fundamental generalization failures. Understanding this could identify model weaknesses and improve generalization and robustness.

Method: The authors hypothesize the Reversal Curse stems from transformers’ limitations in conceptual binding (inconsistency and entanglements of concept representations). They conduct experiments supporting these conjectures and propose a model based on JEPA (Joint-Embedding Predictive Architecture) with special memory layers for disentangled concept representations.

Result: The JEPA-based model breaks the Reversal Curse for the first time without specialized data augmentation or non-causal masking. Incorporating memory layers further improves generalization by supporting disentangled concept representations.

Conclusion: The research connects the Reversal Curse to the binding problem and demonstrates that addressing conceptual binding limitations can overcome this fundamental LLM weakness. It opens up the challenge of designing models capable of systematic conceptual binding with less human scaffolding.

Abstract: Despite their impressive capabilities, LLMs exhibit a basic generalization failure known as the Reversal Curse, where they struggle to learn reversible factual associations. Understanding why this occurs could help identify weaknesses in current models and advance their generalization and robustness. In this paper, we conjecture that the Reversal Curse in LLMs is a manifestation of the long-standing binding problem in cognitive science, neuroscience and AI. Specifically, we hypothesize two primary causes of the Reversal Curse stemming from transformers’ limitations in conceptual binding: the inconsistency and entanglements of concept representations. We perform a series of experiments that support these conjectures. Our exploration leads to a model design based on JEPA (Joint-Embedding Predictive Architecture) that for the first time breaks the Reversal Curse without side-stepping it with specialized data augmentation or non-causal masking, and moreover, generalization could be further improved by incorporating special memory layers that support disentangled concept representations. Our research opens up the broader fundamental challenge of designing models capable of learning systematic conceptual binding with less human scaffolding.

[68] Subject islands do not reduce to construction-specific discourse function

Mandy Cartner, Matthew Kogan, Nikolas Webster, Matthew Wagers, Ivy Sichel

Main category: cs.CL

TL;DR: Subject islands exist across multiple syntactic constructions (wh-questions, relative clauses, topicalization), challenging information-structure accounts and supporting abstract syntactic explanations.

Details

Motivation: To test whether subject islands are specific to wh-questions due to information structure clashes, or whether they exist across different syntactic constructions as predicted by abstract syntactic theories.

Method: Three large-scale acceptability studies using super-additive design to isolate subject island violations in three constructions: wh-questions, relative clauses, and topicalization.

Result: Found subject island effects in all three construction types, despite only wh-questions having the information structure clash proposed by Abeillé et al. (2020).

Conclusion: Subject islands are not specific to wh-questions’ information structure, supporting abstract syntactic accounts over information-structure explanations.

Abstract: The term islands in linguistics refers to phrases from which extracting an element results in ungrammaticality (Ross, 1967). Grammatical subjects are considered islands because extracting a sub-part of a subject results in an ill-formed sentence, despite having a clear intended meaning (e.g., “Which topic did the article about inspire you?”). The generative tradition, which views syntax as autonomous of meaning and function, attributes this ungrammaticality to the abstract movement dependency between the wh-phrase and the subject-internal position with which it is associated for interpretation. However, research on language that emphasizes its communicative function suggests instead that syntactic constraints, including islands, can be explained based on the way different constructions package information. Accordingly, Abeillé et al. (2020) suggest that the islandhood of subjects is specific to the information structure of wh-questions, and propose that subjects are not islands for movement, but for focusing, due to their discourse-backgroundedness. This predicts that other constructions that differ in their information structure from wh-questions, but still involve movement, should not create a subject island effect. We test this prediction in three large-scale acceptability studies, using a super-additive design that singles out subject island violations, in three different constructions: wh-questions, relative clauses, and topicalization. We report evidence for a subject island effect in each construction type, despite only wh-questions introducing what Abeillé et al. (2020) call “a clash in information structure.” We argue that this motivates an account of islands in terms of abstract, syntactic representations, independent of the communicative function associated with the constructions.

[69] Cochain: Balancing Insufficient and Excessive Collaboration in LLM Agent Workflows

Jiaxing Zhao, Hongbin Xie, Yuzhen Lei, Xuan Song, Zhuoran Shi, Lianxin Li, Shuangxue Liu, Linguo Xie, Haoran Zhang

Main category: cs.CL

TL;DR: Cochain is a collaboration prompting framework that combines knowledge graphs and prompt trees to solve business workflow collaboration problems more efficiently than chain-of-thought or multi-agent approaches.

Details

Motivation: Chain-of-thought faces collaboration challenges due to complex cross-domain prompt design, while multi-agent systems consume excessive tokens and dilute the primary problem in business workflow tasks.

Method: Constructs an integrated knowledge graph incorporating knowledge from multiple stages, and maintains/retrieves a prompts tree to obtain relevant prompt information across business workflow stages.

Result: Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs across multiple datasets. Expert evaluation shows small model + Cochain outperforms GPT-4.

Conclusion: Cochain effectively solves business workflow collaboration problems by combining knowledge and prompts at reduced cost, offering superior performance to existing approaches.

Abstract: Large Language Models (LLMs) have demonstrated impressive performance in executing complex reasoning tasks. Chain-of-thought effectively enhances reasoning capabilities by unlocking the potential of large models, while multi-agent systems provide more comprehensive solutions by integrating the collective intelligence of multiple agents. However, both approaches face significant limitations. Single-agent with chain-of-thought, due to the inherent complexity of designing cross-domain prompts, faces collaboration challenges. Meanwhile, multi-agent systems consume substantial tokens and inevitably dilute the primary problem, which is particularly problematic in business workflow tasks. To address these challenges, we propose Cochain, a collaboration prompting framework that effectively solves the business workflow collaboration problem by combining knowledge and prompts at a reduced cost. Specifically, we construct an integrated knowledge graph that incorporates knowledge from multiple stages. Furthermore, by maintaining and retrieving a prompts tree, we can obtain prompt information relevant to other stages of the business workflow. We perform extensive evaluations of Cochain across multiple datasets, demonstrating that Cochain outperforms all baselines in both prompt engineering and multi-agent LLMs. Additionally, expert evaluation results indicate that the use of a small model in combination with Cochain outperforms GPT-4.

[70] EAMET: Robust Massive Model Editing via Embedding Alignment Optimization

Yanbo Dai, Zhenlan Ji, Zongjie Li, Shuai Wang

Main category: cs.CL

TL;DR: EAMET improves model editing for LLMs by aligning key and residual embeddings to handle massive editing scenarios better than existing methods.

Details

Motivation: Existing model editing techniques degrade in massive editing scenarios, especially with practical metrics, and lack robustness in context-rich settings or when editing multiple facts about the same subject simultaneously.

Method: Proposes EAMET (Embedding Alignment Model Editing in Transformers) which addresses embedding misalignment among knowledge items by aligning the space of key and residual embeddings to improve editing reliability at scale.

Result: Extensive experiments across six LLMs and three datasets show EAMET consistently outperforms existing methods, achieving about 90% editing efficacy when editing 10k facts.

Conclusion: EAMET effectively addresses embedding misalignment issues in large-scale model editing, providing a robust solution for efficiently updating knowledge in LLMs.

Abstract: Model editing techniques are essential for efficiently updating knowledge in large language models (LLMs). However, the effectiveness of existing approaches degrades in massive editing scenarios, particularly when evaluated with practical metrics. Their robustness is also limited in context-rich settings or when editing multiple facts of the same subject simultaneously. We attribute these failures to the embedding misalignment among knowledge items, which undermines editing reliability at scale. To address this, we propose EAMET (Embedding Alignment Model Editing in Transformers), which addresses this issue by aligning the space of key and residual embeddings. Extensive experiments across six LLMs and three datasets demonstrate that EAMET consistently outperforms existing methods, achieving about 90% editing efficacy when editing 10k facts. Codes and datasets are publicly available at https://ybdai7.github.io/eamet-page/.

[71] MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

Feiyang Cai, Jiahui Bai, Tao Tang, Guijuan He, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

Main category: cs.CL

TL;DR: MolLangBench is a comprehensive benchmark for evaluating molecule-language interface tasks including recognition, editing, and generation, revealing significant limitations in current AI models like GPT-5.

Details

Motivation: There's a need to evaluate fundamental molecule-language interface tasks (recognition, editing, generation) as precise molecular manipulation is essential for chemists and AI systems in chemical applications.

Method: Created a benchmark with recognition tasks using automated cheminformatics tools, and editing/generation tasks through expert annotation and validation. Supports evaluation across different molecular representations (linear strings, images, graphs).

Result: State-of-the-art models show significant limitations: GPT-5 achieves 86.2% accuracy on recognition and 85.5% on editing (intuitively simple for humans), and only 43.0% on generation tasks.

Conclusion: Current AI systems have shortcomings in handling basic molecular recognition and manipulation tasks. MolLangBench aims to catalyze research toward more effective AI systems for chemical applications.

Abstract: Precise recognition, editing, and generation of molecules are essential prerequisites for both chemists and AI systems tackling various chemical tasks. We present MolLangBench, a comprehensive benchmark designed to evaluate fundamental molecule-language interface tasks: language-prompted molecular structure recognition, editing, and generation. To ensure high-quality, unambiguous, and deterministic outputs, we construct the recognition tasks using automated cheminformatics tools, and curate editing and generation tasks through rigorous expert annotation and validation. MolLangBench supports the evaluation of models that interface language with different molecular representations, including linear strings, molecular images, and molecular graphs. Evaluations of state-of-the-art models reveal significant limitations: the strongest model (GPT-5) achieves $86.2%$ and $85.5%$ accuracy on recognition and editing tasks, which are intuitively simple for humans, and performs even worse on the generation task, reaching only $43.0%$ accuracy. These results highlight the shortcomings of current AI systems in handling even preliminary molecular recognition and manipulation tasks. We hope MolLangBench will catalyze further research toward more effective and reliable AI systems for chemical applications.The dataset and code can be accessed at https://huggingface.co/datasets/ChemFM/MolLangBench and https://github.com/TheLuoFengLab/MolLangBench, respectively.

[72] An Iterative Question-Guided Framework for Knowledge Base Question Answering

Shuai Wang, Yinan Yu

Main category: cs.CL

TL;DR: iQUEST is a question-guided KBQA framework that iteratively decomposes complex queries into sub-questions and uses GNNs to incorporate 2-hop neighbor information for improved multi-hop reasoning over knowledge graphs.

Details

Motivation: LLMs often exhibit factual inconsistencies in knowledge-intensive tasks, and multi-hop KBQA faces challenges in maintaining coherent reasoning paths and avoiding premature discarding of critical multi-hop connections.

Method: Iterative question decomposition into simpler sub-questions combined with GNN-based look-ahead that incorporates 2-hop neighbor information at each reasoning step.

Result: Consistent improvements across four benchmark datasets and four different LLMs, demonstrating the effectiveness of the approach.

Conclusion: iQUEST provides a structured framework for reliable multi-hop reasoning over knowledge graphs by combining iterative question decomposition with graph-based look-ahead mechanisms.

Abstract: Large Language Models (LLMs) excel in many natural language processing tasks but often exhibit factual inconsistencies in knowledge-intensive settings. Integrating external knowledge resources, particularly knowledge graphs (KGs), provides a transparent and updatable foundation for more reliable reasoning. Knowledge Base Question Answering (KBQA), which queries and reasons over KGs, is central to this effort, especially for complex, multi-hop queries. However, multi-hop reasoning poses two key challenges: (1)~maintaining coherent reasoning paths, and (2)~avoiding prematurely discarding critical multi-hop connections. To tackle these challenges, we introduce iQUEST, a question-guided KBQA framework that iteratively decomposes complex queries into simpler sub-questions, ensuring a structured and focused reasoning trajectory. Additionally, we integrate a Graph Neural Network (GNN) to look ahead and incorporate 2-hop neighbor information at each reasoning step. This dual approach strengthens the reasoning process, enabling the model to explore viable paths more effectively. Detailed experiments demonstrate the consistent improvement delivered by iQUEST across four benchmark datasets and four LLMs.

[73] What Should Feature Distillation Transfer in LLMs? A Task-Tangent Geometry View

Khouloud Saadi, Di Wang

Main category: cs.CL

TL;DR: Flex-KD: A functional perspective on knowledge distillation that transfers teacher’s functional geometry rather than direct feature matching, enabling effective distillation under dimension mismatch.

Details

Motivation: Existing feature-based knowledge distillation methods treat representations as objects with intrinsic meaning through direct feature matching or learned projections, but the relevance of a representation dimension is determined by how it affects model output. This work proposes a functional perspective focusing on how teacher's output depends on internal representations.

Method: Proposes Flex-KD, an architecture-agnostic and parameter-free distillation method that transfers teacher’s functional geometry while matching student’s representational capacity. Instead of preserving full high-dimensional features, it retains dominant directions of functional contribution, inducing an effective functional dimension for each task.

Result: Extensive experiments across language understanding and generation benchmarks show Flex-KD consistently outperforms existing distillation approaches, particularly under severe teacher-student dimension mismatch.

Conclusion: A functional perspective on feature-based distillation that focuses on transferring teacher’s functional geometry rather than direct representation alignment leads to more effective knowledge transfer, especially when dealing with dimension mismatch between teacher and student models.

Abstract: Feature-based knowledge distillation aims to transfer intermediate representations from a teacher LLM model to a student. Existing approaches typically rely on direct feature matching or learned projections, implicitly treating representations as objects with intrinsic meaning. However, the relevance of a representation dimension is determined solely by how it affects the model’s output. In this work, we propose a functional perspective on feature-based distillation. We characterize knowledge transfer in terms of the teacher’s functional geometry, i.e., how its output depends on internal representations, rather than direct representation alignment. This viewpoint reveals that effective distillation need not preserve full high-dimensional features, but instead should retain dominant directions of functional contribution, naturally inducing an effective functional dimension for each task. Building on this framework, we introduce Flex-KD, an architecture-agnostic and parameter-free distillation method that transfers the teacher’s functional geometry while matching the student’s representational capacity. Extensive experiments across language understanding and generation benchmarks demonstrate that Flex-KD consistently outperforms existing distillation approaches, particularly under severe teacher-student dimension mismatch.

[74] The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs

Zichen Wen, Jiashu Qu, Zhaorun Chen, Xiaoya Lu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang

Main category: cs.CL

TL;DR: DIJA is a novel jailbreak attack framework targeting diffusion-based LLMs that exploits their bidirectional modeling and parallel decoding to bypass safety alignment through adversarial interleaved mask-text prompts.

Details

Motivation: Diffusion-based LLMs offer advantages like faster inference but have unique safety vulnerabilities. Existing alignment mechanisms fail to protect against context-aware, masked-input adversarial prompts, exposing novel security threats in this emerging model architecture.

Method: DIJA constructs adversarial interleaved mask-text prompts that exploit dLLMs’ bidirectional modeling (which forces contextually consistent outputs for masked spans) and parallel decoding (which limits dynamic filtering of unsafe content). The attack doesn’t require rewriting or hiding harmful content.

Result: DIJA achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing ReNeLLM by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score. It exposes significant safety weaknesses in aligned dLLMs.

Conclusion: The study reveals fundamental safety vulnerabilities in diffusion-based LLMs that current alignment mechanisms cannot address, highlighting the urgent need for rethinking safety approaches in this emerging model class.

Abstract: Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.

[75] Modelling and Classifying the Components of a Literature Review

Francisco Bolaños, Angelo Salatino, Francesco Osborne, Enrico Motta

Main category: cs.CL

TL;DR: A novel annotation schema for rhetorical roles in scientific papers and evaluation of 37 LLMs on classifying these roles, with a new benchmark dataset Sci-Sentence containing expert-annotated and LLM-labeled sentences.

Details

Motivation: To support AI systems for analyzing scientific literature and generating high-quality literature reviews by developing reliable annotation schemas and effective large-scale annotation strategies for rhetorical roles in papers.

Method: 1) Introduces a novel unambiguous annotation schema designed for reliable automatic processing, 2) Presents Sci-Sentence benchmark with 700 expert-annotated and 2,240 LLM-labeled sentences, 3) Evaluates 37 LLMs using zero-shot learning and fine-tuning approaches.

Result: Modern LLMs achieve strong results (surpassing 96% F1) when fine-tuned on high-quality data, with both large proprietary models (GPT-4o) and lightweight open-source alternatives performing well. Augmenting training with semi-synthetic LLM-generated examples boosts performance for small encoders and improves open decoder models.

Conclusion: The proposed annotation schema and benchmark enable effective rhetorical role classification, with LLMs showing strong performance, especially when fine-tuned and augmented with synthetic data, supporting development of AI systems for scientific literature analysis.

Abstract: Previous work has demonstrated that AI methods for analysing scientific literature benefit significantly from annotating sentences in papers according to their rhetorical roles, such as research gaps, results, limitations, extensions of existing methodologies, and others. Such representations also have the potential to support the development of a new generation of systems capable of producing high-quality literature reviews. However, achieving this goal requires the definition of a relevant annotation schema and effective strategies for large-scale annotation of the literature. This paper addresses these challenges in two ways: 1) it introduces a novel, unambiguous annotation schema that is explicitly designed for reliable automatic processing, and 2) it presents a comprehensive evaluation of a wide range of large language models (LLMs) on the task of classifying rhetorical roles according to this schema. To this end, we also present Sci-Sentence, a novel multidisciplinary benchmark comprising 700 sentences manually annotated by domain experts and 2,240 sentences automatically labelled using LLMs. We evaluate 37 LLMs on this benchmark, spanning diverse model families and sizes, using both zero-shot learning and fine-tuning approaches. The experiments reveal that modern LLMs achieve strong results on this task when fine-tuned on high-quality data, surpassing 96% F1, with both large proprietary models such as GPT-4o and lightweight open-source alternatives performing well. Moreover, augmenting the training set with semi-synthetic LLM-generated examples further boosts performance, enabling small encoders to achieve robust results and substantially improving several open decoder models.

[76] A Survey on Parallel Text Generation: From Parallel Decoding to Diffusion Language Models

Lingzhe Zhang, Liancheng Fang, Chiming Duan, Minghua He, Leyi Pan, Pei Xiao, Shiyu Huang, Yunpeng Zhai, Xuming Hu, Philip S. Yu, Aiwei Liu

Main category: cs.CL

TL;DR: A systematic survey of parallel text generation methods that categorizes approaches into AR-based and Non-AR paradigms to address the sequential bottleneck in LLM inference.

Details

Motivation: Autoregressive generation in LLMs produces tokens sequentially, creating a bottleneck for inference speed. There's growing interest in parallel text generation techniques but no comprehensive analysis of what constitutes these methods and how they improve performance.

Method: Systematic survey categorizing parallel text generation into AR-based and Non-AR-based paradigms, with detailed examination of core techniques within each category. Assessment of theoretical trade-offs in speed, quality, and efficiency, and examination of combination potential with alternative acceleration strategies.

Result: Provides taxonomy of parallel text generation methods, analyzes their trade-offs, and creates a GitHub repository for indexing relevant papers and open resources.

Conclusion: Highlights recent advancements, identifies open challenges, and outlines promising directions for future research in parallel text generation to overcome sequential bottlenecks in LLM inference.

Abstract: As text generation has become a core capability of modern Large Language Models (LLMs), it underpins a wide range of downstream applications. However, most existing LLMs rely on autoregressive (AR) generation, producing one token at a time based on previously generated context-resulting in limited generation speed due to the inherently sequential nature of the process. To address this challenge, an increasing number of researchers have begun exploring parallel text generation-a broad class of techniques aimed at breaking the token-by-token generation bottleneck and improving inference efficiency. Despite growing interest, there remains a lack of comprehensive analysis on what specific techniques constitute parallel text generation and how they improve inference performance. To bridge this gap, we present a systematic survey of parallel text generation methods. We categorize existing approaches into AR-based and Non-AR-based paradigms, and provide a detailed examination of the core techniques within each category. Following this taxonomy, we assess their theoretical trade-offs in terms of speed, quality, and efficiency, and examine their potential for combination and comparison with alternative acceleration strategies. Finally, based on our findings, we highlight recent advancements, identify open challenges, and outline promising directions for future research in parallel text generation. We have also created a GitHub repository for indexing relevant papers and open resources available at https://github.com/zhanglingzhe0820/Awesome-Parallel-Text-Generation.

[77] Inference-Aware Prompt Optimization for Aligning Black-Box Large Language Models

Saaduddin Mahmud, Mason Nakamura, Kyle Hollins Wray, Shlomo Zilberstein

Main category: cs.CL

TL;DR: IAPO is a unified framework that jointly optimizes prompts and inference scaling strategies for black-box LLMs, addressing the interdependence between prompt optimization and inference strategies while considering user preferences and budget constraints.

Details

Motivation: Existing prompt optimization methods ignore inference strategies like Best-of-N Sampling and Majority Voting, creating a methodological gap. There's strong interdependence between prompt optimization and inference scaling, and user preferences for trade-offs among objectives and inference budgets significantly influence optimal configurations.

Method: Introduced IAPO (Inference-Aware Prompt Optimization) framework that jointly optimizes prompts and inference scale while considering inference budget and task objectives. Developed PSST (Prompt Scaling via Sequential Trimming) algorithm with fixed-budget training and established finite-budget error probability guarantees.

Result: Evaluated PSST on six tasks including multi-objective text generation and reasoning, demonstrating the critical importance of incorporating inference-awareness in aligning black-box LLMs through prompt optimization.

Conclusion: Joint optimization of prompts and inference strategies is essential for effective LLM alignment, and inference-awareness significantly improves prompt optimization outcomes when considering user preferences and budget constraints.

Abstract: Prompt optimization methods have demonstrated significant effectiveness in aligning black-box large language models (LLMs). In parallel, inference scaling strategies such as Best-of-N Sampling and Majority Voting have likewise been shown to improve alignment and performance by trading additional computation for better output. However, existing prompt optimization approaches are inference strategy agnostic; that is, they optimize prompts without accounting for the inference strategy. This constitutes a significant methodological gap, as our empirical and theoretical analysis reveals a strong interdependence between these two paradigms. Moreover, we find that user preferences regarding trade-offs among multiple objectives and inference budgets substantially influence the choice of prompt and inference configuration. To address this gap, we introduce a novel unified framework named IAPO (Inference-Aware Prompt Optimization) that jointly optimizes the prompt and inference scale, while being aware of the inference budget and different task objectives. We then develop a fixed-budget training algorithm for IAPO, called PSST (Prompt Scaling via Sequential Trimming), and establish finite-budget guarantees on the error probability. Finally, we evaluate the effectiveness of PSST on six tasks, including multi-objective text generation and reasoning, and demonstrate the critical role of incorporating inference-awareness in aligning black-box LLMs using prompt optimization.

[78] Self-Guided Function Calling in Large Language Models via Stepwise Experience Recall

Sijia Cui, Aiyao He, Shuai Xu, Hongming Zhang, Yanna Wang, Qingyang Zhang, Yajing Wang, Bo Xu

Main category: cs.CL

TL;DR: SEER: A self-guided method for LLM tool usage that performs stepwise retrieval from an incrementally updated experience pool to improve multi-step tool selection, parameter generation, and planning.

Details

Motivation: LLMs struggle with multi-step tool usage including tool selection, parameter generation, and tool-chain planning. Existing methods require manual demonstration design or curated libraries, which become inefficient as tool diversity and task difficulty scale.

Method: Proposes Stepwise Experience Recall (SEER) which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of static libraries, SEER incrementally augments the pool with past successful trajectories, enabling continuous expansion and improved performance over time.

Result: On ToolQA benchmark: 6.1% improvement on easy questions, 4.7% on hard questions. On τ-bench with real-world domains: 7.44% and 23.38% accuracy gains using Qwen2.5-7B and Qwen2.5-72B models respectively.

Conclusion: SEER addresses LLM tool usage challenges through self-guided experience recall, demonstrating significant improvements on tool-use benchmarks without requiring extensive manual curation.

Abstract: Function calling enables large language models (LLMs) to interact with external systems by leveraging tools and APIs. When faced with multi-step tool usage, LLMs still struggle with tool selection, parameter generation, and tool-chain planning. Existing methods typically rely on manually designing task-specific demonstrations, or retrieving from a curated library. These approaches demand substantial expert effort and prompt engineering becomes increasingly complex and inefficient as tool diversity and task difficulty scale. To address these challenges, we propose a self-guided method, Stepwise Experience Recall (SEER), which performs fine-grained, stepwise retrieval from a continually updated experience pool. Instead of relying on static or manually curated library, SEER incrementally augments the experience pool with past successful trajectories, enabling continuous expansion of the pool and improved model performance over time. Evaluated on the ToolQA benchmark, SEER achieves an average improvement of 6.1% on easy and 4.7% on hard questions. We further test SEER on $τ$-bench, which includes two real-world domains. Powered by Qwen2.5-7B and Qwen2.5-72B models, SEER demonstrates substantial accuracy gains of 7.44% and 23.38%, respectively.

[79] Distribution-Aligned Decoding for Efficient LLM Task Adaptation

Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, Yuguang Fang

Main category: cs.CL

TL;DR: SVDecode is a lightweight method for task adaptation that uses steering vectors during decoding to align output distributions, improving performance without additional trainable parameters beyond PEFT adapters.

Details

Motivation: Even parameter-efficient fine-tuning (PEFT) methods for billion-parameter language models remain costly. The paper proposes to view task adaptation as output-distribution alignment rather than weight updates, aiming for a more efficient approach.

Method: SVDecode starts with a short warm-start fine-tune, extracts a task-aware steering vector from KL divergence gradients between warm-started and pre-trained models, then uses this vector during decoding to steer output distributions toward task distributions.

Result: Across three tasks and nine benchmarks, SVDecode paired with four PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains on commonsense datasets.

Conclusion: SVDecode offers a lightweight, theoretically grounded path to stronger task adaptation for large language models without adding trainable parameters beyond PEFT adapters.

Abstract: Adapting billion-parameter language models to a downstream task is still costly, even with parameter-efficient fine-tuning (PEFT). We re-cast task adaptation as output-distribution alignment: the objective is to steer the output distribution toward the task distribution directly during decoding rather than indirectly through weight updates. Building on this view, we introduce Steering Vector Decoding (SVDecode), a lightweight, PEFT-compatible, and theoretically grounded method. We start with a short warm-start fine-tune and extract a task-aware steering vector from the Kullback-Leibler (KL) divergence gradient between the output distribution of the warm-started and pre-trained models. This steering vector is then used to guide the decoding process to steer the model’s output distribution towards the task distribution. We theoretically prove that SVDecode is first-order equivalent to the gradient step of full fine-tuning and derive a globally optimal solution for the strength of the steering vector. Across three tasks and nine benchmarks, SVDecode paired with four standard PEFT methods improves multiple-choice accuracy by up to 5 percentage points and open-ended truthfulness by 2 percentage points, with similar gains (1-2 percentage points) on commonsense datasets without adding trainable parameters beyond the PEFT adapter. SVDecode thus offers a lightweight, theoretically grounded path to stronger task adaptation for large language models. Code is available at https://github.com/dl-m9/SVDecode.

[80] Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels

Junjie Ye, Yuming Yang, Yang Nan, Shuo Li, Qi Zhang, Tao Gui, Xuanjing Huang, Peng Wang, Zhongchao Shi, Jianping Fan

Main category: cs.CL

TL;DR: SFT can degrade LLM knowledge; fine-tuning on more data sometimes performs worse than on less data; most parameter updates don’t enhance knowledge; restoring them can improve performance.

Details

Motivation: The impact of supervised fine-tuning (SFT) on LLM knowledge is underexplored, limiting control over knowledge changes in fine-tuned models. Understanding how SFT affects knowledge is crucial for developing better fine-tuning strategies.

Method: Evaluated closed-book question answering (CBQA) performance across five LLMs from LLaMA-2 and LLaMA-3 families. Analyzed model behavior at token and parameter levels, examining how varying fine-tuning data size (240 vs 1,920 samples) and knowledge mastery levels affect performance.

Result: Surprisingly, models fine-tuned on 1,920 samples performed up to 14% worse than those fine-tuned on only 240 samples. Performance fluctuated over 12% with varying knowledge mastery in fine-tuning data. Analysis showed up to 90% of parameter updates during SFT don’t contribute to knowledge enhancement.

Conclusion: Most SFT parameter updates don’t enhance knowledge; restoring these updates can improve CBQA performance depending on fine-tuning data characteristics. Insights provide practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.

Abstract: Large language models (LLMs) acquire substantial world knowledge during pre-training, which is further shaped by post-training techniques such as supervised fine-tuning (SFT). However, the impact of SFT on a model’s knowledge remains underexplored, limiting our ability to control knowledge change behavior in fine-tuned models. To address this gap, we evaluate closed-book question answering (CBQA) performance across five LLMs from the LLaMA-2 and LLaMA-3 families. Surprisingly, models fine-tuned on 1,920 samples perform up to 14% worse than those fine-tuned on only 240 samples. Furthermore, varying the level of knowledge mastery in the fine-tuning data leads to performance fluctuations of over 12%. To investigate these effects, we analyze model behavior at both the token and parameter levels. Our analysis reveals that up to 90% of parameter updates during SFT do not contribute to knowledge enhancement. Restoring these updates can improve performance on the CBQA task, depending on the characteristics of the fine-tuning data. These insights offer practical guidance for developing fine-tuning strategies that more effectively strengthen model knowledge.

[81] REPAIR: Robust Editing via Progressive Adaptive Intervention and Reintegration

Yisu Wang, Ming Wang, Haoyuan Song, Wenjie Huang, Chaozheng Wang, Yi Xie, Xuming Ran

Main category: cs.CL

TL;DR: REPAIR is a lifelong editing framework for LLMs that enables precise, low-cost model updates while preserving non-target knowledge through closed-loop feedback, dynamic memory management, and knowledge fusion.

Details

Motivation: Traditional post-training for LLMs is expensive and often causes unintended side effects when acquiring new knowledge or correcting errors. There's a need for a method that supports precise model updates while preserving existing knowledge.

Method: REPAIR uses a closed-loop feedback mechanism with dynamic memory management to mitigate instability from sequential edits. It incorporates frequent knowledge fusion and strong locality guards to address unintended ripple effects from distribution-agnostic approaches.

Result: REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting compared to traditional approaches.

Conclusion: REPAIR provides a robust framework for developing reliable, scalable, and continually evolving LLMs through precise, low-cost editing while preserving non-target knowledge.

Abstract: Post-training for large language models (LLMs) is constrained by the high cost of acquiring new knowledge or correcting errors and by the unintended side effects that frequently arise from retraining. To address these issues, we introduce REPAIR (Robust Editing via Progressive Adaptive Intervention and Reintegration), a lifelong editing framework designed to support precise and low-cost model updates while preserving non-target knowledge. REPAIR mitigates the instability and conflicts of large-scale sequential edits through a closed-loop feedback mechanism coupled with dynamic memory management. Furthermore, by incorporating frequent knowledge fusion and enforcing strong locality guards, REPAIR effectively addresses the shortcomings of traditional distribution-agnostic approaches that often overlook unintended ripple effects. Our experiments demonstrate that REPAIR boosts editing accuracy by 10%-30% across multiple model families and significantly reduces knowledge forgetting. This work introduces a robust framework for developing reliable, scalable, and continually evolving LLMs.

[82] Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Nuwan I. Senaratna

Main category: cs.CL

TL;DR: Open multilingual document datasets from Sri Lanka covering parliamentary, legal, government, news, and tourism data in Sinhala, Tamil, and English, updated daily and available on GitHub/Hugging Face.

Details

Motivation: To create open, machine-readable resources to support research in computational linguistics, legal analytics, socio-political studies, and multilingual NLP for Sri Lankan languages.

Method: Collection pipeline for gathering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka, with daily updates and mirroring on GitHub and Hugging Face.

Result: 253,817 documents (72.2 GB) across 26 datasets in Sinhala, Tamil, and English, with version v2026-02-10-1051.

Conclusion: Provides valuable multilingual resources for NLP research, with ongoing updates and consideration of licensing/ethical issues.

Abstract: We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 253,817 documents (72.2 GB) across 26 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2026-02-10-1051.

[83] A large-scale pipeline for automatic corpus annotation using LLMs: variation and change in the English consider construction

Cameron Morin, Matti Marttinen Larsson

Main category: cs.CL

TL;DR: A scalable pipeline using LLMs to automate grammatical annotation in large corpora, demonstrated on 143,933 ‘consider’ concordance lines with 98%+ accuracy, enabling new linguistic research questions.

Details

Motivation: Manual annotation is a bottleneck in corpus linguistics as corpora expand rapidly. The paper addresses the need for scalable, automated grammatical annotation methods to handle voluminous text data.

Method: Four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing via OpenAI API, and post-hoc validation. Applied to annotate evaluative ‘consider’ constructions in COHA corpus.

Result: Annotated 143,933 concordance lines in under 60 hours with 98%+ accuracy. Bayesian multinomial GAM analysis of 44,527 true positives revealed previously undocumented genre-specific trajectories of change in English evaluative constructions.

Conclusion: LLMs can perform data preparation tasks at scale with minimal human intervention, unlocking substantive research questions previously beyond practical reach, though requiring attention to costs, licensing, and ethical considerations.

Abstract: As natural language corpora expand at an unprecedented rate, manual annotation remains a significant methodological bottleneck in corpus linguistic work. We address this challenge by presenting a scalable pipeline for automating grammatical annotation in voluminous corpora using large language models (LLMs). Unlike previous supervised and iterative approaches, our method employs a four-phase workflow: prompt engineering, pre-hoc evaluation, automated batch processing, and post-hoc validation. We demonstrate the pipeline’s accessibility and effectiveness through a diachronic case study of variation in the English evaluative consider construction (consider X as/to be/zero Y). We annotate 143,933 ‘consider’ concordance lines from the Corpus of Historical American English (COHA) via the OpenAI API in under 60 hours, achieving 98 percent+ accuracy on two sophisticated annotation procedures. A Bayesian multinomial GAM fitted to 44,527 true positives of the evaluative construction reveals previously undocumented genre-specific trajectories of change, enabling us to advance new hypotheses about the relationship between register formality and competing pressures of morphosyntactic reduction and enhancement. Our results suggest that LLMs can perform a range of data preparation tasks at scale with minimal human intervention, unlocking substantive research questions previously beyond practical reach, though implementation requires attention to costs, licensing, and other ethical considerations.

[84] Machine Text Detectors are Membership Inference Attacks

Ryuto Koike, Liam Dugan, Masahiro Kaneko, Chris Callison-Burch, Naoaki Okazaki

Main category: cs.CL

TL;DR: The paper demonstrates theoretical and empirical connections between membership inference attacks (MIAs) and machine-generated text detection, showing they share optimal metrics and methods can transfer between tasks.

Details

Motivation: MIAs and machine-generated text detection have been studied independently despite using similar probability distribution signals, potentially missing stronger methods and insights from cross-task learning.

Method: Theoretical analysis showing optimal metrics are identical for both tasks, unifying existing methods under this metric, and large-scale empirical experiments measuring cross-task performance correlation. Introduces MINT evaluation suite with 15 recent methods.

Result: Strong rank correlation (ρ≈0.7) in cross-task performance, with machine text detector achieving strongest performance on both tasks. Demonstrates practical transferability between the two domains.

Conclusion: MIAs and machine-generated text detection are fundamentally connected tasks that can benefit from cross-task development and evaluation, with the MINT suite facilitating fair comparison.

Abstract: Although membership inference attacks (MIAs) and machine-generated text detection target different goals, their methods often exploit similar signals based on a language model’s probability distribution, and the two tasks have been studied independently. This can result in conclusions that overlook stronger methods and valuable insights from the other task. In this work, we theoretically and empirically demonstrate the transferability, i.e., how well a method originally developed for one task performs on the other, between MIAs and machine text detection. We prove that the metric achieving asymptotically optimal performance is identical for both tasks. We unify existing methods under this optimal metric and hypothesize that the accuracy with which a method approximates this metric is directly correlated with its transferability. Our large-scale empirical experiments demonstrate very strong rank correlation ($ρ\approx 0.7$) in cross-task performance. Notably, we also find that a machine text detector achieves the strongest performance among evaluated methods on both tasks, demonstrating the practical impact of transferability. To facilitate cross-task development and fair evaluation, we introduce MINT, a unified evaluation suite for MIAs and machine-generated text detection, implementing 15 recent methods from both tasks.

[85] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao

Main category: cs.CL

TL;DR: ReForm introduces a reflective autoformalization method that integrates semantic consistency evaluation with iterative refinement to improve translation of natural language mathematics to formal statements, achieving 22.6% average improvement over baselines.

Details

Motivation: Current LLM approaches to autoformalization treat it as simple translation without mechanisms for self-reflection and iterative refinement, leading to formal statements that are syntactically correct but often fail to preserve the original problem's semantic intent.

Method: ReForm integrates semantic consistency evaluation into autoformalization, enabling iterative generation, assessment of semantic fidelity, and self-correction through progressive refinement. Uses Prospective Bounded Sequence Optimization (PBSO) with different rewards at different sequence positions to train both accurate autoformalization and correct semantic validations.

Result: Achieves average improvement of 22.6 percentage points over strongest baselines across four autoformalization benchmarks. Introduces ConsistencyCheck benchmark (859 expert-annotated items) that validates LLMs as judges and reveals human experts produce semantic errors in up to 38.5% of cases.

Conclusion: Reflective autoformalization with integrated semantic consistency evaluation significantly improves translation quality, and autoformalization remains inherently difficult even for human experts, highlighting the need for robust evaluation methods.

Abstract: Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem’s semantic intent. This limitation arises from the LLM approaches’ treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 22.6 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.

[86] Evolving Interactive Diagnostic Agents in a Virtual Clinical Environment

Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie

Main category: cs.CL

TL;DR: A framework for training LLMs as diagnostic agents using reinforcement learning in a virtual clinical environment, enabling multi-turn interactive diagnosis with adaptive examination selection.

Details

Motivation: Current instruction-tuned LLMs lack dynamic diagnostic strategies for multi-turn interactive processes. There's a need for models that can adaptively select examinations and commit to final diagnoses through exploration and outcome-based feedback rather than static training data.

Method: Three main components: (1) DiagGym - a diagnostics world model trained on electronic health records serving as virtual clinical environment; (2) DiagAgent - trained via end-to-end multi-turn reinforcement learning to learn dynamic diagnostic policies; (3) DiagBench - a multi-center diagnostic benchmark with 2.2K physician-validated cases and 3.3K physician-written rubrics for evaluation.

Result: DiagAgent significantly outperforms 11 SOTA LLMs and 2 prompt-engineered agents, achieving 11.20% increase in diagnostic accuracy and 17.58% boost in examination recommendation F1 score. Maintains SOTA performance across all three external centers and surpasses next-best model by 7.1% in weighted rubric score.

Conclusion: Learning policies in interactive clinical environments provides long-term diagnostic management abilities that cannot be achieved through passive training. The framework demonstrates superior performance in both in-domain and out-of-domain settings.

Abstract: We present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn interactive diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static data, our method acquires diagnostic strategies through dynamic exploration and outcome-based feedback, mapping evolving patient states to the next optimal examination and subsequent diagnosis. Our contributions include: (i) DiagGym, a diagnostics world model trained with electronic health records, serving as a virtual clinical environment to support closed-loop in-silico training and evaluation for interactive diagnosis; (ii) DiagAgent, trained via end-to-end multi-turn RL to learn dynamic diagnostic policies that optimize both interactive effectiveness and final accuracy; (iii) DiagBench, a multi-center diagnostic benchmark designed to evaluate multi-turn diagnostic interaction trajectories. The benchmark comprises 2.2K physician-validated cases sourced from 4 distinct distributions, alongside 3.3K physician-written rubrics for granular process-oriented evaluation. (iv) Extensive evaluations demonstrate DiagAgent’s superior performance across both in-domain and out-of-domain (OOD) settings. DiagAgent significantly outperforms 11 SOTA LLMs and 2 prompt-engineered agents. In the end-to-end setting, it delivers a 11.20% increase in diagnostic accuracy and a 17.58% boost in examination recommendation F1 score, while consistently maintaining SOTA performance across all three external centers. Furthermore, in rubric-based evaluations, it surpasses the next-best model by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers long-term diagnostic management abilities unattainable through passive training.

[87] TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors

Gabin Taibi, Lucia Gomez

Main category: cs.CL

TL;DR: TOPol is a semi-unsupervised framework for analyzing multidimensional narrative polarity fields using transformer-based LLMs, UMAP projections, and topic segmentation to quantify semantic displacement during discourse regime shifts.

Details

Motivation: Traditional sentiment analysis treats polarity as unidimensional, overlooking the complex multidimensional structure of language. The authors aim to develop a framework that can capture both affective and non-affective semantic shifts in discourse across different contextual boundaries.

Method: TOPol uses transformer-based LLMs to embed documents, applies neighbor-tuned UMAP projection for dimensionality reduction, segments topics via Leiden partitioning, computes directional vectors between topic-boundary centroids for different discourse regimes, and uses LLMs to interpret polarity vectors with contrastive labels.

Result: The framework successfully captures both affective (Amazon reviews aligned with NRC valence) and non-affective (Central Bank speeches around macroeconomic breakpoints) polarity transitions. Results show methodological stability with only CB definitions significantly affecting outcomes.

Conclusion: TOPol provides a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis that goes beyond traditional unidimensional sentiment analysis.

Abstract: Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.

[88] In-Context Learning Without Copying

Kerem Sahin, Sheridan Feucht, Adam Belfki, Jannik Brinkmann, Aaron Mueller, David Bau, Chris Wendler

Main category: cs.CL

TL;DR: Hapax training regime reduces induction head development while preserving abstractive in-context learning capabilities, challenging the necessity of induction heads for such capabilities.

Details

Motivation: To investigate whether induction heads are necessary for abstractive in-context learning (where answers aren't in the input context) or if such capabilities can emerge independently.

Method: Proposed Hapax training regime that omits loss contribution of tokens predictable by induction heads, reducing inductive copying while preserving abstractive ICL capabilities.

Result: Despite 31.7% token omission from loss and reduced induction head development, abstractive ICL capabilities preserved with higher accuracy on 13/21 tasks and lower loss on non-induction-predictable tokens.

Conclusion: Developmental link between induction heads and abstractive ICL capabilities is weaker than previously hypothesized; abstractive ICL can emerge without strong induction heads.

Abstract: Induction heads are attention heads that perform inductive copying by matching patterns from earlier context and copying their continuations verbatim. As models develop induction heads, they experience a sharp drop in training loss, a phenomenon cited as evidence that induction heads may underlie a wide range of in-context learning (ICL) capabilities. In this work, we investigate whether induction heads are a necessary building block for learning abstractive ICL capabilities (i.e., tasks where the answer is not contained in the input context), or whether such capabilities can emerge independently. We propose Hapax, a training regime that omits the loss contribution of tokens predictable by induction heads. Despite a significant reduction in inductive copying, abstractive ICL capabilities are preserved, with the model achieving higher accuracy than the vanilla model on 13 out of 21 tasks, even though 31.7% of tokens are omitted from the loss. Furthermore, our model achieves lower loss values on token positions that induction heads cannot predict. Mechanistic analysis shows that models trained with Hapax develop fewer and weaker induction heads despite preserving abstractive ICL capabilities. Our findings suggest that the developmental link between induction heads and abstractive ICL capabilities is weaker than previously hypothesized.

[89] Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui

Main category: cs.CL

TL;DR: Text2SQL-Flow: A SQL-aware data augmentation framework that generates large-scale, diverse Text-to-SQL pairs to address data scarcity, creating SQLFlow dataset (75K examples) that improves LLM performance in both fine-tuning and prompt-based settings.

Details

Motivation: Existing Text-to-SQL datasets suffer from scarcity, limited diversity, and structural simplicity, which constrains model performance. The data-centric AI paradigm emphasizes high-quality training data as crucial for advancing Text-to-SQL systems.

Method: Proposes Text2SQL-Flow framework with six augmentation dimensions and an end-to-end pipeline including database selection, SQL executability verification, NL question generation, NL-SQL correspondence verification, and chain-of-thought reasoning trace generation. Also introduces masked alignment retrieval method for closed-source LLMs.

Result: Constructed SQLFlow dataset with 75,386 annotated examples. Fine-tuning open-source LLMs with SQLFlow improves problem-solving ability with competitive gains across benchmarks. Masked alignment retrieval method outperforms existing example retrieval methods for closed-source LLMs.

Conclusion: Provides scalable, data-centric foundation for advancing Text-to-SQL systems, demonstrating the importance of structured, high-fidelity data in modern AI development. The framework addresses data scarcity through systematic augmentation.

Abstract: The data-centric paradigm has emerged as a pivotal direction in artificial intelligence (AI), emphasizing the role of high-quality training data. This shift is especially critical in the Text-to-SQL task, where the scarcity, limited diversity, and structural simplicity of existing datasets constrain model performance. To address these challenges, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that systematically generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from limited seed data. Our framework spans six augmentation dimensions and integrates an end-to-end pipeline with auxiliary database selection, SQL executability verification, natural language (NL) question generation, NL-SQL correspondence verification, and chain-of-thought (CoT) reasoning trace generation. Leveraging this framework, we construct SQLFlow, a high-quality dataset comprising 75,386 annotated examples. We demonstrate the utility of SQLFlow in both fine-tuning and prompt-based settings. (1) For open-source large language models (LLMs), fine-tuning with SQLFlow improves problem-solving ability, delivering competitive gains across multiple benchmarks under the same data budget. (2) For closed-source LLMs, we propose a masked alignment retrieval method that uses SQLFlow as both a knowledge base and training data for the retrieval model, enabling structure-aware example matching via fine-grained NL-SQL alignments. Experiments show that our retrieval strategy outperforms existing example retrieval methods, highlighting the combined value of SQLFlow’s data quality and our retrieval technique. Overall, our work provides a scalable, data-centric foundation for advancing Text-to-SQL systems and underscores the importance of structured, high-fidelity data in modern AI development. Our code is available at https://github.com/TechNomad-ds/Text2SQL-Flow.

[90] Learning Tractable Distributions Of Language Model Continuations

Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Yuchen Cui, Guy Van den Broeck, Benjie Wang

Main category: cs.CL

TL;DR: LTLA is a hybrid method that combines base-LM embeddings with a globally learned tractable surrogate (HMM) to enable efficient lookahead control for autoregressive language models while avoiding computational inefficiencies.

Details

Motivation: Controlled generation requires sequence-level constraints that depend on future tokens, making exact conditioning of autoregressive LMs intractable. Existing tractable surrogates like HMMs are often weakly context-aware, while adding neural context typically introduces efficiency problems like vocabulary-sized prefix rescoring or predicting new HMMs per prefix.

Method: LTLA uses base-LM embeddings to condition a globally learned tractable surrogate: a neural head predicts only a prefix-dependent latent prior, while a shared HMM answers continuation queries exactly. It avoids efficiency traps by: 1) scoring all next-token candidates via single batched HMM forward update instead of vocabulary-sized prefix rescoring, and 2) learning one shared HMM and conditioning only the latent prior, enabling reuse of cached future-likelihood messages across decoding steps.

Result: LTLA improves continuation likelihood over standard HMM surrogates, enables lookahead control for vision-language models by incorporating continuous context, achieves 100% syntactic constraint satisfaction, and improves detoxification while adding only 14% decoding-time overhead.

Conclusion: LTLA provides an efficient hybrid approach for controlled generation that combines neural context with tractable surrogates, enabling effective lookahead control while maintaining computational efficiency.

Abstract: Controlled generation imposes sequence-level constraints (syntax, style, safety) that depend on future tokens, making exact conditioning of an autoregressive LM intractable. Tractable surrogates such as HMMs can approximate continuation distributions and steer decoding, but standard surrogates are often weakly context-aware. We propose Learning to Look Ahead (LTLA), a hybrid method that uses base-LM embeddings to condition a globally learned tractable surrogate: a neural head predicts only a prefix-dependent latent prior, while a shared HMM answers continuation queries exactly. LTLA is designed to avoid two common efficiency traps when adding neural context. First, it avoids vocabulary-sized prefix rescoring (V extra LM evaluations) by scoring all next-token candidates via a single batched HMM forward update. Second, it avoids predicting a new HMM per prefix by learning one shared HMM and conditioning only the latent prior, which enables reuse of cached future-likelihood (backward) messages across decoding steps. Empirically, LTLA improves continuation likelihood over standard HMM surrogates, enables lookahead control for vision–language models by incorporating continuous context, achieves 100% syntactic constraint satisfaction, and improves detoxification while adding only a 14% decoding-time overhead.

[91] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han, Wujiang Xu, Mingyu Jin, Mengnan Du

Main category: cs.CL

TL;DR: SAGE is an agent-based framework for interpreting features in sparse autoencoders (SAEs) that transforms feature explanation from passive generation to an active, iterative process of hypothesis testing and refinement.

Details

Motivation: While sparse autoencoders (SAEs) show promise for decomposing LLM representations into interpretable features, explaining what these features actually represent remains challenging. Current methods are often passive and single-pass, lacking rigorous validation of explanations.

Method: SAGE uses an agent-based framework that systematically formulates multiple explanations for each SAE feature, designs targeted experiments to test them, and iteratively refines explanations based on empirical activation feedback from the LLM.

Result: Experiments on features from SAEs of diverse language models show that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

Conclusion: SAGE provides a more rigorous, active approach to feature interpretation in SAEs, improving explanation quality and advancing interpretability research for LLMs.

Abstract: Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

[92] Short-Context Dominance: How Much Local Context Natural Language Actually Needs?

Vala Vakilian, Zimeng Wang, Ankit Singh Rawat, Christos Thrampoulidis

Main category: cs.CL

TL;DR: The paper investigates the short-context dominance hypothesis, finding that 75-80% of sequences only need the last 96 tokens for accurate next-token prediction, introduces DaMCL to detect challenging long-context sequences, and develops a decoding algorithm to mitigate bias from short-context dominance.

Details

Motivation: The motivation is to understand the short-context dominance hypothesis - whether most sequences can be accurately predicted using only a small local prefix rather than full context. This has implications for LLM efficiency and understanding their contextual dependencies.

Method: 1) Measure Minimum Context Length (MCL) needed for accurate full-context predictions using LLMs as statistical oracles. 2) Introduce Distributionally Aware MCL (DaMCL) as a practical proxy that doesn’t require next-token knowledge. 3) Develop a decoding algorithm that uses DaMCL detection to boost long-range-relevant tokens.

Result: For sequences with 1-7k tokens, 75-80% require only the last 96 tokens at most. DaMCL effectively detects long vs. short context sequences, and the proposed decoding algorithm improves performance across Q&A tasks and model architectures by mitigating short-context bias.

Conclusion: Short-context dominance is prevalent in LLMs, but can be detected and mitigated. The DaMCL metric and associated decoding algorithm provide practical tools to handle challenging long-context sequences and improve LLM performance by reducing bias toward short-range dependencies.

Abstract: We investigate the short-context dominance hypothesis: that for most sequences, a small local prefix suffices to predict their next tokens. Using large language models as statistical oracles, we measure the minimum context length (MCL) needed to reproduce accurate full-context predictions across datasets with sequences of varying lengths. For sequences with 1-7k tokens from long-context documents, we consistently find that 75-80% require only the last 96 tokens at most. Given the dominance of short-context tokens, we then ask whether it is possible to detect challenging long-context sequences for which a short local prefix does not suffice for prediction. We introduce a practical proxy to MCL, called Distributionally Aware MCL (DaMCL), that does not require knowledge of the actual next-token and is compatible with sampling strategies beyond greedy decoding. Our experiments validate that simple thresholding of the metric defining DaMCL achieves high performance in detecting long vs. short context sequences. Finally, to counter the bias that short-context dominance induces in LLM output distributions, we develop an intuitive decoding algorithm that leverages our detector to identify and boost tokens that are long-range-relevant. Across Q&A tasks and model architectures, we confirm that mitigating the bias improves performance.

[93] Does Memory Need Graphs? A Unified Framework and Empirical Analysis for Long-Term Dialog Memory

Sen Hu, Yuxiang Wei, Jiaxin Ran, Zhiyuan Yao, Xueran Han, Huacan Wang, Ronghao Chen, Lei Zou

Main category: cs.CL

TL;DR: Experimental analysis of dialog memory architectures shows performance differences are driven by foundational system settings rather than specific architectural innovations, with graph-based approaches not consistently outperforming non-graph methods.

Details

Motivation: Graph structures are increasingly used in dialog memory systems but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter for long-term dialog memory performance.

Method: Introduce a unified framework that decomposes dialog memory systems into core components supporting both graph-based and non-graph approaches. Conduct controlled, stage-wise experiments on LongMemEval and HaluMem datasets comparing design choices in memory representation, organization, maintenance, and retrieval.

Result: Results show many performance differences are driven by foundational system settings rather than specific architectural innovations. Graph-based approaches do not consistently outperform non-graph methods across different evaluation scenarios.

Conclusion: Identifies stable and reliable strong baselines for future dialog memory research, emphasizing the importance of foundational system design over architectural novelty for dialog memory systems.

Abstract: Graph structures are increasingly used in dialog memory systems, but empirical findings on their effectiveness remain inconsistent, making it unclear which design choices truly matter. We present an experimental, system-oriented analysis of long-term dialog memory architectures. We introduce a unified framework that decomposes dialog memory systems into core components and supports both graph-based and non-graph approaches. Under this framework, we conduct controlled, stage-wise experiments on LongMemEval and HaluMem, comparing common design choices in memory representation, organization, maintenance, and retrieval. Our results show that many performance differences are driven by foundational system settings rather than specific architectural innovations. Based on these findings, we identify stable and reliable strong baselines for future dialog memory research. Code are available at https://github.com/AvatarMemory/UnifiedMem

[94] Structured Episodic Event Memory

Zhengxuan Lu, Dongfang Li, Yukun Shi, Beilun Wang, Longyue Wang, Baotian Hu

Main category: cs.CL

TL;DR: SEEM is a hierarchical memory framework for LLMs that combines graph memory for relational facts with episodic memory for narrative progression, improving coherence in long-term agent interactions.

Details

Motivation: Current LLM memory approaches using static RAG are scattered and fail to capture structural dependencies needed for complex reasoning. Autonomous agents need better cognitive organization to model dynamic, associative long-term interactions.

Method: Proposes Structured Episodic Event Memory (SEEM) with hierarchical framework: graph memory layer for relational facts + dynamic episodic memory layer for narrative progression. Uses Episodic Event Frames (EEFs) with provenance pointers, and introduces associative fusion with Reverse Provenance Expansion (RPE) to reconstruct coherent narratives from fragmented evidence.

Result: SEEM significantly outperforms baselines on LoCoMo and LongMemEval benchmarks, enabling agents to maintain superior narrative coherence and logical consistency.

Conclusion: SEEM provides a cognitive-inspired memory architecture that addresses limitations of current LLM memory systems, improving agent performance in complex, long-term interactions through structured episodic organization.

Abstract: Current approaches to memory in Large Language Models (LLMs) predominantly rely on static Retrieval-Augmented Generation (RAG), which often results in scattered retrieval and fails to capture the structural dependencies required for complex reasoning. For autonomous agents, these passive and flat architectures lack the cognitive organization necessary to model the dynamic and associative nature of long-term interaction. To address this, we propose Structured Episodic Event Memory (SEEM), a hierarchical framework that synergizes a graph memory layer for relational facts with a dynamic episodic memory layer for narrative progression. Grounded in cognitive frame theory, SEEM transforms interaction streams into structured Episodic Event Frames (EEFs) anchored by precise provenance pointers. Furthermore, we introduce an agentic associative fusion and Reverse Provenance Expansion (RPE) mechanism to reconstruct coherent narrative contexts from fragmented evidence. Experimental results on the LoCoMo and LongMemEval benchmarks demonstrate that SEEM significantly outperforms baselines, enabling agents to maintain superior narrative coherence and logical consistency.

[95] Universal computation is intrinsic to language model decoding

Alex Lewandowski, Marlos C. Machado, Dale Schuurmans

Main category: cs.CL

TL;DR: Language models can perform universal computation through autoregressive chaining, even when randomly initialized, with training primarily improving programmability rather than computational expressiveness.

Details

Motivation: To understand the fundamental computational capabilities of language models and determine whether their computational power emerges from training or is intrinsic to their architecture.

Method: Proving mathematically that chaining a language model’s autoregressive output is sufficient for universal computation, and demonstrating that randomly initialized models possess this capability before training.

Result: Language models can simulate execution of any algorithm on any input through autoregressive chaining, with computational expressiveness being intrinsic rather than learned through training.

Conclusion: Training primarily improves programmability (ease of finding suitable prompts) rather than computational capabilities, reframing the challenge as one of natural language interface design.

Abstract: Language models now provide an interface to express and often solve general problems in natural language, yet their ultimate computational capabilities remain a major topic of scientific debate. Unlike a formal computer, a language model is trained to autoregressively predict successive elements in human-generated text. We prove that chaining a language model’s autoregressive output is sufficient to perform universal computation. That is, a language model can simulate the execution of any algorithm on any input. The challenge of eliciting desired computational behaviour can thus be reframed in terms of programmability: the ease of finding a suitable prompt. Strikingly, we demonstrate that even randomly initialized language models are capable of universal computation before training. This implies that training does not give rise to computational expressiveness – rather, it improves programmability, enabling a natural language interface for accessing these intrinsic capabilities.

[96] Truth with a Twist: The Rhetoric of Persuasion in Professional vs. Community-Authored Fact-Checks

Olesya Razuvayevskaya, Kalina Bontcheva

Main category: cs.CL

TL;DR: Large-scale comparison of persuasion techniques in crowd-sourced vs professional fact-checking debunks shows no significant difference in persuasive language use, with systematic rhetorical differences reflecting institutional norms.

Details

Motivation: To understand how persuasion techniques differ between crowd-sourced (Community Notes) and professional fact-checking ecosystems, testing the hypothesis that community-produced debunks rely more on subjective/persuasive wording.

Method: Analyzed extensive datasets from Community Notes (CNs), EUvsDisinfo, and Database of Known Fakes (DBKF) to quantify prevalence and types of persuasion techniques across fact-checking ecosystems using computational analysis.

Result: No evidence that CNs contain higher average number of persuasion techniques than professional fact-checks; identified systematic rhetorical differences reflecting institutional norms; crowd raters slightly favor persuasive elements but penalize problematic rhetoric.

Conclusion: Crowd-sourced fact-checking doesn’t rely more on persuasive language than professional approaches, with systematic differences in rhetorical style; crowd evaluation mechanisms effectively regulate problematic persuasion techniques.

Abstract: This study presents the first large-scale comparison of persuasion techniques present in crowd- versus professionally-written debunks. Using extensive datasets from Community Notes (CNs), EUvsDisinfo, and the Database of Known Fakes (DBKF), we quantify the prevalence and types of persuasion techniques across these fact-checking ecosystems. Contrary to prior hypothesis that community-produced debunks rely more heavily on subjective or persuasive wording, we find no evidence that CNs contain a higher average number of persuasion techniques than professional fact-checks. We additionally identify systematic rhetorical differences between CNs and professional debunking efforts, reflecting differences in institutional norms and topical coverage. Finally, we examine how the crowd evaluates persuasive language in CNs and show that, although notes with more persuasive elements receive slightly higher overall helpfulness ratings, crowd raters are effective at penalising the use of particular problematic rhetorical means

[97] Online Density-Based Clustering for Real-Time Narrative Evolution Monitorin

Ostap Vykhopen, Viktoria Skorik, Maksym Tereshchenko, Veronika Solopova

Main category: cs.CL

TL;DR: Online density-based clustering methods outperform batch HDBSCAN for scalable, real-time narrative monitoring in multilingual social media streams.

Details

Motivation: Batch clustering methods like HDBSCAN face scalability challenges for continuous social media data streams in narrative intelligence systems, requiring full retraining for each time window and limiting real-time adaptability.

Method: Replace offline HDBSCAN with online density-based clustering algorithms in a production narrative report generation pipeline; evaluate using sliding-window simulations on historical Ukrainian information space data with standard clustering metrics, narrative-specific measures, and human validation.

Result: DenStream achieved the strongest overall performance, revealing trade-offs between temporal stability and narrative coherence; online methods bridge the gap between batch clustering and streaming requirements.

Conclusion: Online density-based clustering enables scalable, real-time narrative monitoring for social media intelligence systems, with DenStream showing particular promise for balancing cluster quality and computational efficiency.

Abstract: Automated narrative intelligence systems for social media monitoring face significant scalability challenges when relying on batch clustering methods to process continuous data streams. We investigate replacing offline HDBSCAN with online density-based clustering algorithms in a production narrative report generation pipeline that processes large volumes of multilingual social media data. While HDBSCAN effectively discovers hierarchical clusters and handles noise, its batch-only nature requires full retraining for each time window, limiting scalability and real-time adaptability. We evaluate online clustering methods with respect to cluster quality, computational efficiency, memory footprint, and integration with downstream narrative extraction. Our evaluation combines standard clustering metrics, narrative-specific measures, and human validation of cluster correctness to assess both structural quality and semantic interpretability. Experiments using sliding-window simulations on historical data from the Ukrainian information space reveal trade-offs between temporal stability and narrative coherence, with DenStream achieving the strongest overall performance. These findings bridge the gap between batch-oriented clustering approaches and the streaming requirements of large-scale narrative monitoring systems.

[98] Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi

Main category: cs.CL

TL;DR: Modular Gradient Surgery (MGS) addresses cross-domain interference in multi-task reinforcement learning for large reasoning models by resolving gradient conflicts at the module level within transformers.

Details

Motivation: Training single general-purpose large reasoning models across diverse domains is challenging due to domain heterogeneity, causing substantial cross-domain interference that limits overall gains from reinforcement learning.

Method: Introduces Modular Gradient Surgery (MGS) which resolves gradient conflicts at the module level within transformer architectures, specifically applied to Llama and Qwen models across math, general chat, and instruction following domains.

Result: MGS achieves average improvements of 4.3 (16.6%) points for Llama and 4.5 (11.1%) points for Qwen over standard multi-task RL across three representative domains, and remains effective under prolonged training.

Conclusion: The study clarifies sources of interference in multi-domain RL and presents an effective solution (MGS) for training general-purpose large reasoning models that addresses gradient conflicts at the module level.

Abstract: Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce Modular Gradient Surgery (MGS), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6%) and 4.5 (11.1%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.

[99] A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Feiyang Cai, Guijuan He, Yi Hu, Jingjing Wang, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo

Main category: cs.CL

TL;DR: Automated framework for generating precise molecular structure descriptions using IUPAC name parsing and LLM-guided natural language generation, creating 163k molecule-description pairs with 98.6% precision.

Details

Motivation: Molecular function depends on structure, requiring accurate structure-language alignment for LLMs to reason about chemical tasks. Human annotation is too costly for large-scale, high-quality datasets of structure-grounded descriptions.

Method: Extends rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched XML metadata encoding molecular structure. Uses this metadata to guide LLMs in producing accurate natural-language descriptions.

Result: Created large-scale dataset of ~163k molecule-description pairs. Validation on 2,000 molecules showed 98.6% description precision through combined LLM-based and expert human evaluation.

Conclusion: Provides reliable foundation for molecule-language alignment. Annotation method is extensible to larger datasets and broader chemical tasks relying on structural descriptions.

Abstract: Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.

[100] Reward-free Alignment for Conflicting Objectives

Peter L. Chen, Xiaopeng Li, Xi Chen, Tianyi Lin

Main category: cs.CL

TL;DR: RACO: A reward-free alignment framework for LLMs that handles multiple conflicting objectives using clipped conflict-averse gradient descent with convergence guarantees.

Details

Motivation: Real-world alignment problems often involve multiple conflicting objectives where naive preference aggregation leads to unstable training and poor trade-offs. Existing multi-objective approaches rely on explicit reward models, adding complexity and potentially distorting user preferences.

Method: Proposes Reward-free Alignment framework for Conflicted Objectives (RACO) that directly uses pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. Provides convergence guarantees to Pareto-critical points respecting user-specified objective weights.

Result: Experiments on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show RACO consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

Conclusion: RACO provides an effective reward-free approach for multi-objective LLM alignment with theoretical guarantees and practical improvements over existing methods.

Abstract: Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.

[101] Rethinking Memory Mechanisms of Foundation Agents in the Second Half: A Survey

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen Wang, Xiongxiao Xu, Baixiang Huang, Juntao Tan, Shelby Heinecke, Huan Wang, Caiming Xiong, Ahmed A. Metwally, Jun Yan, Chen-Yu Lee, Hanqing Zeng, Yinglong Xia, Xiaokai Wei, Ali Payani, Yu Wang, Haitong Ma, Wenya Wang, Chenguang Wang, Yu Zhang, Xin Wang, Yongfeng Zhang, Jiaxuan You, Hanghang Tong, Xiao Luo, Xue Liu, Yizhou Sun, Wei Wang, Julian McAuley, James Zou, Jiawei Han, Philip S. Yu, Kai Shu

Main category: cs.CL

TL;DR: Survey paper on foundation agent memory systems, analyzing memory substrates, cognitive mechanisms, and memory subjects for AI agents in long-horizon, dynamic environments.

Details

Motivation: The AI field is shifting from benchmark-focused model innovations to real-world utility, where agents face context explosion in dynamic environments. Memory emerges as the critical solution to address the utility gap in long-horizon, user-dependent interactions.

Method: Provides a unified framework for foundation agent memory across three dimensions: memory substrate (internal/external), cognitive mechanism (episodic, semantic, sensory, working, procedural), and memory subject (agent-/user-centric). Analyzes memory instantiation under different agent topologies and learning policies for memory operations.

Result: Comprehensive survey organizing hundreds of recent memory papers into a coherent framework, highlighting evaluation benchmarks and metrics for assessing memory utility in real-world applications.

Conclusion: Memory is essential for AI agents to achieve real utility in complex environments. The survey provides a structured understanding of foundation agent memory systems and outlines open challenges and future research directions.

Abstract: The research of artificial intelligence is undergoing a paradigm shift from prioritizing model innovations over benchmark scores towards emphasizing problem definition and rigorous real-world evaluation. As the field enters the “second half,” the central challenge becomes real utility in long-horizon, dynamic, and user-dependent environments, where agents face context explosion and must continuously accumulate, manage, and selectively reuse large volumes of information across extended interactions. Memory, with hundreds of papers released this year, therefore emerges as the critical solution to fill the utility gap. In this survey, we provide a unified view of foundation agent memory along three dimensions: memory substrate (internal and external), cognitive mechanism (episodic, semantic, sensory, working, and procedural), and memory subject (agent- and user-centric). We then analyze how memory is instantiated and operated under different agent topologies and highlight learning policies over memory operations. Finally, we review evaluation benchmarks and metrics for assessing memory utility, and outline various open challenges and future directions.

[102] DLLM Agent: See Farther, Run Faster

Huiling Zhen, Weizhe Lin, Renxi Liu, Kai Han, Yiming Li, Yuchuan Tian, Hanting Chen, Xiaoguang Li, Xiaosong Li, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Youliang Yan, Peifeng Qin, Jun Wang, Yu Wang, Dacheng Tao, Yunhe Wang

Main category: cs.CL

TL;DR: Diffusion LLMs in agent workflows show 30%+ faster end-to-end execution than autoregressive agents with comparable accuracy, requiring fewer tool invocations and interaction rounds due to better planning.

Details

Motivation: To explore whether diffusion-based LLMs offer systematic advantages over autoregressive LLMs for agentic multi-step decision making, particularly in planning and tool-use behaviors, when generation paradigms change but agent frameworks remain fixed.

Method: Created controlled comparison by instantiating both diffusion LLM (DLLM) and autoregressive (AR) backbones within the same DeepDiver agent workflow, performing matched agent-oriented fine-tuning on identical trajectory data to create directly comparable agents.

Result: DLLM Agents were on average over 30% faster end-to-end than AR agents with comparable accuracy (some cases exceeding 8x speedup), required fewer interaction rounds and tool invocations, and showed higher planner hit rates with less backtracking.

Conclusion: Diffusion backbones offer efficiency advantages for agentic decision making but require careful handling of tool-call failures and attention masking for multi-turn inputs; they exhibit stronger global planning signals than autoregressive alternatives.

Abstract: Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.

[103] Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models

Ningyu Xu, Qi Zhang, Xipeng Qiu, Xuanjing Huang

Main category: cs.CL

TL;DR: LLMs construct and functionally use structured conceptual representations for in-context inference, with causal evidence showing these representations emerge in middle-late layers and are actively used for predictions.

Details

Motivation: While LLMs show emergent reasoning behaviors and have been found to contain human-like conceptual representations, it's unclear whether they actually functionally rely on these representations for reasoning or if they're just epiphenomena.

Method: Investigates LLM internal processing during in-context concept inference using causal mediation analyses to determine functional role of conceptual subspaces, examining layer-wise progression of representation construction and use.

Result: Reveals conceptual subspace emerging in middle to late layers with persistent structure across contexts; demonstrates causal role in model predictions via mediation analyses; identifies layer-wise progression where early-middle layers integrate context to construct subspace, later layers use it for predictions.

Conclusion: LLMs dynamically construct and functionally use structured latent representations for inference, providing evidence for computational processes underlying flexible adaptation and showing these representations are not just epiphenomena but causally important.

Abstract: Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning. While recent work has identified structured, human-like conceptual representations within these models, it remains unclear whether they functionally rely on such representations for reasoning. Here we investigate the internal processing of LLMs during in-context concept inference. Our results reveal a conceptual subspace emerging in middle to late layers, whose representational structure persists across contexts. Using causal mediation analyses, we demonstrate that this subspace is not merely an epiphenomenon but is functionally central to model predictions, establishing its causal role in inference. We further identify a layer-wise progression where attention heads in early-to-middle layers integrate contextual cues to construct and refine the subspace, which is subsequently leveraged by later layers to generate predictions. Together, these findings provide evidence that LLMs dynamically construct and use structured, latent representations in context for inference, offering insights into the computational processes underlying flexible adaptation.

[104] Improving Data and Reward Design for Scientific Reasoning in Large Language Models

Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong, Peng Cheng

Main category: cs.CL

TL;DR: Dr. SCI pipeline transforms science data into structured dataset with 1M questions, implements novel post-training pipeline with exploration-expanding SFT, dynamic difficulty curriculum, and rubric-guided RL for scientific reasoning improvement.

Details

Motivation: Open-ended science questions challenge LLMs due to unreliable supervision and evaluation; need systematic data processing and reward design for scientific post-training.

Method: Created Dr. SCI dataset (1M STEM questions) with verifiable/open-ended splits, difficulty annotation, rubrics; post-training pipeline with exploration-expanding SFT, dynamic difficulty curriculum, and SciRubric-guided RL.

Result: Qwen3-4B-Base achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, outperforming o1-mini and GPT-4o baselines in scientific reasoning, especially open-ended settings.

Conclusion: Dr. SCI pipeline enables effective scientific post-training through systematic data processing and novel training components, significantly improving LLM performance on scientific reasoning tasks.

Abstract: Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model’s reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model’s evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.

[105] Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models

Mingzi Cao, Xingwei Tan, Mahmud Elahi Akhter, Marco Valentino, Maria Liakata, Xi Wang, Nikolaos Aletras

Main category: cs.CL

TL;DR: The paper investigates how fundamental reasoning paradigms (deduction, induction, abduction) influence LLM generalization, using symbolic tasks to train models and evaluating on realistic natural language tasks.

Details

Motivation: While improving LLM reasoning has attracted significant research, the extent to which fundamental reasoning paradigms (deduction, induction, abduction) induce generalization has not been systematically explored. The paper aims to understand how these core paradigms influence LLMs' reasoning behavior.

Method: 1. Collected a new dataset of reasoning trajectories from symbolic tasks targeting each of the three fundamental paradigms to abstract from concrete world knowledge. 2. Investigated various methods for inducing these reasoning skills into LLMs, including simple fine-tuning, increasing model depth, and transforming dense models to mixture-of-experts architectures. 3. Comprehensively evaluated induced models on realistic out-of-domain tasks formulated entirely in natural language with real-world knowledge.

Result: The approach yields strong generalizability with substantial performance gains (up to 14.60 points) across realistic tasks, demonstrating that training on fundamental reasoning paradigms improves generalization to real-world reasoning tasks.

Conclusion: Systematic training on core reasoning paradigms (deduction, induction, abduction) significantly improves LLM generalization to realistic reasoning tasks, with the proposed methods showing substantial performance gains across diverse evaluation settings.

Abstract: Deduction, induction, and abduction are fundamental reasoning paradigms, core for human logical thinking. Although improving Large Language Model (LLM) reasoning has attracted significant research efforts, the extent to which the fundamental paradigms induce generalization has yet to be systematically explored. In this study, we shed light on how the interplay between these core paradigms influences LLMs’ reasoning behavior. To this end, we first collect a new dataset of reasoning trajectories from symbolic tasks, each targeting one of the three fundamental paradigms, to abstract from concrete world knowledge. Then, we investigate effective ways for inducing these skills into LLMs. We experiment with a battery of methods including simple fine-tuning, and more complex approaches to increase model depth, or transform a dense model to a mixture-of-experts. We comprehensively evaluate induced models on realistic out-of-domain tasks, that are entirely formulated in natural language and contain real-world knowledge. Our results reveal that our approach yields strong generalizability with substantial performance gains (up to $14.60$) across realistic tasks.

cs.CV

[106] A Hybrid Deterministic Framework for Named Entity Extraction in Broadcast News Video

Andrea Filiberto Lucas, Dylan Seychell

Main category: cs.CV

TL;DR: A framework for automatically detecting and extracting personal names from news videos using an interpretable, modular pipeline that prioritizes deterministic auditability over raw accuracy.

Details

Motivation: The growing volume of video-based news content requires transparent and reliable methods to extract on-screen information, but manual indexing is impractical due to variability in graphical layouts, typographic conventions, and platform-specific design patterns.

Method: Proposes a comprehensive framework with a curated corpus of annotated news video frames and an interpretable, modular extraction pipeline designed for deterministic and auditable conditions, evaluated against generative multimodal methods.

Result: The detector achieves 95.8% mAP@0.5 for graphical element localization. While generative systems achieve higher raw accuracy (F1: 84.18% vs 77.08%), the proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability.

Conclusion: The work establishes a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media, highlighting the trade-off between deterministic auditability and stochastic inference for journalistic applications.

Abstract: The growing volume of video-based news content has heightened the need for transparent and reliable methods to extract on-screen information. Yet the variability of graphical layouts, typographic conventions, and platform-specific design patterns renders manual indexing impractical. This work presents a comprehensive framework for automatically detecting and extracting personal names from broadcast and social-media-native news videos. It introduces a curated and balanced corpus of annotated frames capturing the diversity of contemporary news graphics and proposes an interpretable, modular extraction pipeline designed to operate under deterministic and auditable conditions. The pipeline is evaluated against a contrasting class of generative multimodal methods, revealing a clear trade-off between deterministic auditability and stochastic inference. The underlying detector achieves 95.8% mAP@0.5, demonstrating operationally robust performance for graphical element localisation. While generative systems achieve marginally higher raw accuracy (F1: 84.18% vs 77.08%), they lack the transparent data lineage required for journalistic and analytical contexts. The proposed pipeline delivers balanced precision (79.9%) and recall (74.4%), avoids hallucination, and provides full traceability across each processing stage. Complementary user findings indicate that 59% of respondents report difficulty reading on-screen names in fast-paced broadcasts, underscoring the practical relevance of the task. The results establish a methodologically rigorous and interpretable baseline for hybrid multimodal information extraction in modern news media.

[107] UI-Venus-1.5 Technical Report

Veuns-Team, :, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, Beitong Zhou, Xingran Zhou, Weizhi Chen, Sunhao Dai, Jingya Dou, Yichen Gong, Yuan Guo, Zhenlin Guo, Feng Li, Qian Li, Jinzhen Lin, Yuqi Zhou, Linchao Zhu, Liang Chen, Zhenyu Guo, Changhua Meng, Weiqiang Wang

Main category: cs.CV

TL;DR: UI-Venus-1.5 is a unified GUI agent family (2B, 8B, 30B-A3B) with mid-training, online RL, and model merging for robust real-world GUI automation across web, mobile, and desktop environments.

Details

Motivation: GUI agents struggle to achieve both broad generality and strong task performance simultaneously. Current approaches lack robustness for real-world applications across diverse digital environments.

Method: Three key advances: 1) Comprehensive mid-training with 10B tokens across 30+ datasets for foundational GUI semantics; 2) Online reinforcement learning with full-trajectory rollouts for long-horizon navigation; 3) Model merging to unify domain-specific models (grounding, web, mobile) into a single agent.

Result: State-of-the-art performance on benchmarks: ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), AndroidWorld (77.6%). Demonstrates robust navigation across Chinese mobile apps in real-world scenarios.

Conclusion: UI-Venus-1.5 achieves both generality and strong performance through unified architecture and advanced training techniques, enabling robust GUI automation across diverse real-world applications.

Abstract: GUI agents have emerged as a powerful paradigm for automating interactions in digital environments, yet achieving both broad generality and consistently strong task performance remains challenging.In this report, we present UI-Venus-1.5, a unified, end-to-end GUI Agent designed for robust real-world applications.The proposed model family comprises two dense variants (2B and 8B) and one mixture-of-experts variant (30B-A3B) to meet various downstream application scenarios.Compared to our previous version, UI-Venus-1.5 introduces three key technical advances: (1) a comprehensive Mid-Training stage leveraging 10 billion tokens across 30+ datasets to establish foundational GUI semantics; (2) Online Reinforcement Learning with full-trajectory rollouts, aligning training objectives with long-horizon, dynamic navigation in large-scale environments; and (3) a single unified GUI Agent constructed via Model Merging, which synthesizes domain-specific models (grounding, web, and mobile) into one cohesive checkpoint. Extensive evaluations demonstrate that UI-Venus-1.5 establishes new state-of-the-art performance on benchmarks such as ScreenSpot-Pro (69.6%), VenusBench-GD (75.0%), and AndroidWorld (77.6%), significantly outperforming previous strong baselines. In addition, UI-Venus-1.5 demonstrates robust navigation capabilities across a variety of Chinese mobile apps, effectively executing user instructions in real-world scenarios. Code: https://github.com/inclusionAI/UI-Venus; Model: https://huggingface.co/collections/inclusionAI/ui-venus

[108] Towards Training-free Multimodal Hate Localisation with Large Language Models

Yueming Sun, Long Yang, Jianbo Jiao, Zeyu Fu

Main category: cs.CV

TL;DR: LELA is a training-free LLM-based framework for hate video localization that uses multi-modal decomposition and multi-stage prompting to detect and temporally localize hateful content without requiring large-scale human annotations.

Details

Motivation: Existing video hate detection solutions either require extensive human annotations or lack fine-grained temporal precision, creating scalability and accuracy limitations for detecting hateful content in online videos.

Method: Decomposes videos into five modalities (image, speech, OCR, music, video context), uses a multi-stage prompting scheme with LLMs to compute fine-grained hate scores per frame, and introduces composition matching for enhanced cross-modal reasoning.

Result: Outperforms all existing training-free baselines on HateMM and MultiHateClip benchmarks by a large margin, with extensive ablations and qualitative visualizations demonstrating effectiveness.

Conclusion: LELA establishes a strong foundation for scalable and interpretable hate video localization through its training-free, LLM-based approach with multi-modal decomposition.

Abstract: The proliferation of hateful content in online videos poses severe threats to individual well-being and societal harmony. However, existing solutions for video hate detection either rely heavily on large-scale human annotations or lack fine-grained temporal precision. In this work, we propose LELA, the first training-free Large Language Model (LLM) based framework for hate video localization. Distinct from state-of-the-art models that depend on supervised pipelines, LELA leverages LLMs and modality-specific captioning to detect and temporally localize hateful content in a training-free manner. Our method decomposes a video into five modalities, including image, speech, OCR, music, and video context, and uses a multi-stage prompting scheme to compute fine-grained hateful scores for each frame. We further introduce a composition matching mechanism to enhance cross-modal reasoning. Experiments on two challenging benchmarks, HateMM and MultiHateClip, demonstrate that LELA outperforms all existing training-free baselines by a large margin. We also provide extensive ablations and qualitative visualizations, establishing LELA as a strong foundation for scalable and interpretable hate video localization.

[109] Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling

Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu

Main category: cs.CV

TL;DR: Agent Banana: A hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative image editing that addresses professional workflow challenges through context folding and image layer decomposition.

Details

Motivation: Address three persistent challenges in instruction-based image editing for professional workflows: (1) editors often over-edit beyond user intent, (2) existing models are largely single-turn while multi-turn edits can alter object faithfulness, and (3) evaluation at ~1K resolution misaligns with real workflows using ultra high-definition images (e.g., 4K).

Method: Proposes Agent Banana, a hierarchical agentic planner-executor framework with two key mechanisms: (1) Context Folding - compresses long interaction histories into structured memory for stable long-horizon control; (2) Image Layer Decomposition - performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. Also introduces HDD-Bench, a high-definition, dialogue-based benchmark with verifiable stepwise targets and native 4K images.

Result: On HDD-Bench, Agent Banana achieves best multi-turn consistency and background fidelity (IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following. Also attains strong performance on standard single-turn editing benchmarks.

Conclusion: Agent Banana advances reliable, professional-grade agentic image editing and its integration into real workflows by addressing key challenges in multi-turn, high-resolution editing through structured memory and localized editing techniques.

Abstract: We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user’s intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.

[110] SemanticMoments: Training-Free Motion Similarity via Third Moment Features

Saar Huberman, Kfir Goldberg, Or Patashnik, Sagie Benaim, Ron Mokady

Main category: cs.CV

TL;DR: SemanticMoments: A training-free method using temporal statistics of semantic features for motion-centric video understanding, outperforming existing approaches on new benchmarks.

Details

Motivation: Existing video representation methods overly rely on static appearance and scene context rather than motion dynamics, while traditional motion-centric inputs like optical flow lack semantic grounding for high-level motion understanding.

Method: Proposes SemanticMoments - a simple, training-free method that computes temporal statistics (higher-order moments) over features from pre-trained semantic models to capture motion dynamics in a semantically grounded feature space.

Result: SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods across new SimMotion benchmarks (synthetic + human-annotated real-world data), demonstrating superior motion understanding.

Conclusion: Temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding, addressing the bias in existing approaches.

Abstract: Retrieving videos based on semantic motion is a fundamental, yet unsolved, problem. Existing video representation approaches overly rely on static appearance and scene context rather than motion dynamics, a bias inherited from their training data and objectives. Conversely, traditional motion-centric inputs like optical flow lack the semantic grounding needed to understand high-level motion. To demonstrate this inherent bias, we introduce the SimMotion benchmarks, combining controlled synthetic data with a new human-annotated real-world dataset. We show that existing models perform poorly on these benchmarks, often failing to disentangle motion from appearance. To address this gap, we propose SemanticMoments, a simple, training-free method that computes temporal statistics (specifically, higher-order moments) over features from pre-trained semantic models. Across our benchmarks, SemanticMoments consistently outperforms existing RGB, flow, and text-supervised methods. This demonstrates that temporal statistics in a semantic feature space provide a scalable and perceptually grounded foundation for motion-centric video understanding.

[111] Controllable Dance Generation with Style-Guided Motion Diffusion

Hongsong Wang, Ying Zhu, Xin Geng, Liang Wang

Main category: cs.CV

TL;DR: SGMD is a style-guided motion diffusion model for controllable dance generation that aligns dance sequences with both musical content and user-specified style prompts using a Transformer with Style Modulation and spatial-temporal masking.

Details

Motivation: Existing dance generation approaches lack controllability and fail to properly model the impact of music styles, resulting in dances that don't align with the expressive characteristics of the conditioned music. There's a need for more flexible, style-aware dance generation.

Method: Proposes Style-Guided Motion Diffusion (SGMD) with Transformer-based architecture and Style Modulation module. Incorporates music features with user style prompts, and introduces spatial-temporal masking for flexible control. Also establishes experimental setups for trajectory-based generation, dance in-betweening, and inpainting.

Result: Extensive experiments show SGMD generates realistic and stylistically consistent dances while enabling user control for diverse artistic needs. The approach outperforms existing methods in aligning dance with musical style.

Conclusion: SGMD successfully addresses controllability and style alignment in dance generation, providing a flexible framework for creating music-aligned dances with specific stylistic characteristics.

Abstract: Dance plays an important role as an artistic form and expression in human culture, yet automatically generating dance sequences is a significant yet challenging endeavor. Existing approaches often neglect the critical aspect of controllability in dance generation. Additionally, they inadequately model the nuanced impact of music styles, resulting in dances that lack alignment with the expressive characteristics inherent in the conditioned music. To address this gap, we propose Style-Guided Motion Diffusion (SGMD), which integrates the Transformer-based architecture with a Style Modulation module. By incorporating music features with user-provided style prompts, the SGMD ensures that the generated dances not only match the musical content but also reflect the desired stylistic characteristics. To enable flexible control over the generated dances, we introduce a spatial-temporal masking mechanism. As controllable dance generation has not been fully studied, we construct corresponding experimental setups and benchmarks for tasks such as trajectory-based dance generation, dance in-betweening, and dance inpainting. Extensive experiments demonstrate that our approach can generate realistic and stylistically consistent dances, while also empowering users to create dances tailored to diverse artistic and practical needs. Code is available on Github: https://github.com/mucunzhuzhu/DGSDP

[112] Decoding Future Risk: Deep Learning Analysis of Tubular Adenoma Whole-Slide Images

Ahmed Rahu, Brian Shula, Brandon Combs, Aqsa Sultana, Surendra P. Singh, Vijayan K. Asari, Derrick Forchetti

Main category: cs.CV

TL;DR: Machine learning analysis of whole-slide images of low-grade tubular adenomas to predict colorectal cancer risk

Details

Motivation: Despite screening, many patients with low-grade adenomas still develop colorectal cancer later. Current histological assessment may miss subtle features indicating malignant potential, creating a need for better risk prediction tools.

Method: Using convolutional neural networks (CNNs) to analyze whole-slide images (WSIs) of low-grade tubular adenomas to detect subtle histological features predictive of long-term colorectal cancer risk.

Result: Not specified in abstract, but the study investigates whether CNNs can identify predictive features in WSIs.

Conclusion: Machine learning analysis of digital pathology images shows promise for improving risk stratification in colorectal cancer patients with low-grade adenomas.

Abstract: Colorectal cancer (CRC) remains a significant cause of cancer-related mortality, despite the widespread implementation of prophylactic initiatives aimed at detecting and removing precancerous polyps. Although screening effectively reduces incidence, a notable portion of patients initially diagnosed with low-grade adenomatous polyps will still develop CRC later in life, even without the presence of known high-risk syndromes. Identifying which low-risk patients are at higher risk of progression is a critical unmet need for tailored surveillance and preventative therapeutic strategies. Traditional histological assessment of adenomas, while fundamental, may not fully capture subtle architectural or cytological features indicative of malignant potential. Advancements in digital pathology and machine learning provide an opportunity to analyze whole-slide images (WSIs) comprehensively and objectively. This study investigates whether machine learning algorithms, specifically convolutional neural networks (CNNs), can detect subtle histological features in WSIs of low-grade tubular adenomas that are predictive of a patient’s long-term risk of developing colorectal cancer.

Ange Lou, Yamin Li, Qi Chang, Nan Xi, Luyuan Xie, Zichao Li, Tianyu Luan

Main category: cs.CV

TL;DR: IR-SIS: An iterative refinement system for surgical image segmentation that uses natural language descriptions and clinician feedback, featuring adaptive refinement strategies and state-of-the-art performance.

Details

Motivation: Existing surgical image segmentation methods are limited to predefined categories, produce one-shot predictions without refinement, and lack clinician interaction mechanisms. There's a need for more adaptive, interactive systems that can understand natural language descriptions.

Method: Proposes IR-SIS system with: 1) Fine-tuned SAM3 for initial segmentation, 2) Vision-Language Model to detect instruments and assess quality, 3) Agentic workflow for adaptive refinement strategy selection, 4) Clinician-in-the-loop interaction via natural language feedback, 5) Multi-granularity language-annotated dataset from EndoVis2017/2018.

Result: Demonstrates state-of-the-art performance on both in-domain and out-of-distribution data. Clinician interaction provides additional improvements. Establishes first language-based surgical segmentation framework with adaptive self-refinement capabilities.

Conclusion: IR-SIS successfully creates an interactive, language-driven surgical segmentation system with iterative refinement, addressing limitations of existing methods and enabling clinician collaboration through natural language interfaces.

Abstract: Surgical image segmentation is essential for robot-assisted surgery and intraoperative guidance. However, existing methods are constrained to predefined categories, produce one-shot predictions without adaptive refinement, and lack mechanisms for clinician interaction. We propose IR-SIS, an iterative refinement system for surgical image segmentation that accepts natural language descriptions. IR-SIS leverages a fine-tuned SAM3 for initial segmentation, employs a Vision-Language Model to detect instruments and assess segmentation quality, and applies an agentic workflow that adaptively selects refinement strategies. The system supports clinician-in-the-loop interaction through natural language feedback. We also construct a multi-granularity language-annotated dataset from EndoVis2017 and EndoVis2018 benchmarks. Experiments demonstrate state-of-the-art performance on both in-domain and out-of-distribution data, with clinician interaction providing additional improvements. Our work establishes the first language-based surgical segmentation framework with adaptive self-refinement capabilities.

[114] All-in-One Conditioning for Text-to-Image Synthesis

Hirunima Jayasekara, Chuong Huynh, Yixuan Ren, Christabel Acquaye, Abhinav Shrivastava

Main category: cs.CV

TL;DR: A novel scene graph-based conditioning approach for text-to-image synthesis that uses ASQL Conditioner to generate soft visual guidance, improving compositional abilities without rigid layout constraints.

Details

Motivation: Current text-to-image models struggle with semantic fidelity and structural coherence when processing complex prompts with multiple objects, attributes, and spatial relationships. Existing approaches using pre-defined layout maps limit compositional flexibility and diversity.

Method: Proposes a zero-shot, scene graph-based conditioning mechanism with an Attribute-Size-Quantity-Location (ASQL) Conditioner. Uses a lightweight language model to generate soft visual guidance and guides diffusion-based generation through inference-time optimization.

Result: Enables models to maintain better text-image alignment while supporting lightweight, coherent, and diverse image synthesis without rigid layout constraints.

Conclusion: Scene graph-based conditioning with ASQL Conditioner provides an effective approach to enhance compositional abilities in text-to-image synthesis, addressing limitations of current methods in handling complex prompts.

Abstract: Accurate interpretation and visual representation of complex prompts involving multiple objects, attributes, and spatial relationships is a critical challenge in text-to-image synthesis. Despite recent advancements in generating photorealistic outputs, current models often struggle with maintaining semantic fidelity and structural coherence when processing intricate textual inputs. We propose a novel approach that grounds text-to-image synthesis within the framework of scene graph structures, aiming to enhance the compositional abilities of existing models. Eventhough, prior approaches have attempted to address this by using pre-defined layout maps derived from prompts, such rigid constraints often limit compositional flexibility and diversity. In contrast, we introduce a zero-shot, scene graph-based conditioning mechanism that generates soft visual guidance during inference. At the core of our method is the Attribute-Size-Quantity-Location (ASQL) Conditioner, which produces visual conditions via a lightweight language model and guides diffusion-based generation through inference-time optimization. This enables the model to maintain text-image alignment while supporting lightweight, coherent, and diverse image synthesis.

[115] Wearable environmental sensing to forecast how legged systems will interact with upcoming terrain

Michael D. Murray, James Tung, Richard W. Nuckols

Main category: cs.CV

TL;DR: A computer vision system predicts foot center-of-pressure and time-of-impact during gait transitions using a lightweight CNN-RNN model from wearable camera data.

Details

Motivation: While computer vision is used for environmental classification in gait analysis, predicting how the foot will contact changing environments (like transitioning from level ground to stairs) is underexplored but important for anticipatory control in assistive systems.

Method: Used a CNN-RNN model to forecast anterior-posterior foot center-of-pressure and time-of-impact within a 250ms window before foot-strike. Data collected from 8 subjects wearing an RGB-D camera on their shank and instrumented insoles while stepping onto stairs.

Result: Achieved mean-absolute-error of 23.72-29.42mm for COP and 17.73-21.14ms for TOI across different forecast horizons. Model runs at 60 FPS on consumer hardware. Faster toe-swing speeds improved COP accuracy, and more anterior foot-strikes reduced COP accuracy.

Conclusion: Forecasting COP and TOI from visual data is feasible with lightweight models, enabling real-time anticipatory control for assistive systems during gait transitions.

Abstract: Computer-vision (CV) has been used for environmental classification during gait and is often used to inform control in assistive systems; however, the ability to predict how the foot will contact a changing environment is underexplored. We evaluated the feasibility of forecasting the anterior-posterior (AP) foot center-of-pressure (COP) and time-of-impact (TOI) prior to foot-strike on a level-ground to stair-ascent transition. Eight subjects wore an RGB-D camera on their right shank and instrumented insoles while performing the task of stepping onto the stairs. We trained a CNN-RNN to forecast the COP and TOI continuously within a 250ms window prior to foot-strike, termed the forecast horizon (FH). The COP mean-absolute-error (MAE) at 150, 100, and 50ms FH was 29.42mm, 26.82, and 23.72mm respectively. The TOI MAE was 21.14, 20.08, and 17.73ms for 150, 100, and 50ms respectively. While torso velocity had no effect on the error in either task, faster toe-swing speeds prior to foot-strike were found to improve the prediction accuracy in the COP case, however, was insignificant in the TOI case. Further, more anterior foot-strikes were found to reduce COP prediction accuracy but did not affect the TOI prediction accuracy. We also found that our lightweight model was capable at running at 60 FPS on either a consumer grade laptop or an edge computing device. This study demonstrates that forecasting COP and TOI from visual data was feasible using a lightweight model, which may have important implications for anticipatory control in assistive systems.

[116] VLM-UQBench: A Benchmark for Modality-Specific and Cross-Modality Uncertainties in Vision Language Models

Chenyu Wang, Tianle Chen, H. M. Sabbir Ahmad, Kayhan Batmanghelich, Wenchao Li

Main category: cs.CV

TL;DR: VLM-UQBench: A benchmark for modality-specific and cross-modal uncertainty quantification in vision-language models, with evaluation showing current UQ methods have strong modality specialization but weak correlation with hallucinations.

Details

Motivation: Uncertainty quantification is crucial for safe and reliable vision-language models, but there's a need to localize uncertainty sources (image, text, or cross-modal misalignment) and evaluate UQ methods systematically.

Method: Created VLM-UQBench with 600 real-world samples from VizWiz, curated into clean and uncertainty subsets, plus a perturbation pipeline with visual, textual, and cross-modal perturbations. Proposed two metrics to quantify UQ sensitivity to perturbations and correlation with hallucinations.

Result: Existing UQ methods show strong modality-specific specialization and dependence on underlying VLM; modality-specific uncertainty co-occurs with hallucinations but current UQ scores provide weak risk signals; UQ methods can handle overt ambiguity but fail on subtle instance-level ambiguity.

Conclusion: Significant gap exists between current UQ practices and the fine-grained, modality-aware uncertainty needed for reliable VLM deployment, highlighting the need for better uncertainty localization methods.

Abstract: Uncertainty quantification (UQ) is vital for ensuring that vision-language models (VLMs) behave safely and reliably. A central challenge is to localize uncertainty to its source, determining whether it arises from the image, the text, or misalignment between the two. We introduce VLM-UQBench, a benchmark for modality-specific and cross-modal data uncertainty in VLMs, It consists of 600 real-world samples drawn from the VizWiz dataset, curated into clean, image-, text-, and cross-modal uncertainty subsets, and a scalable perturbation pipeline with 8 visual, 5 textual, and 3 cross-modal perturbations. We further propose two simple metrics that quantify the sensitivity of UQ scores to these perturbations and their correlation with hallucinations, and use them to evaluate a range of UQ methods across four VLMs and three datasets. Empirically, we find that: (i) existing UQ methods exhibit strong modality-specific specialization and substantial dependence on the underlying VLM, (ii) modality-specific uncertainty frequently co-occurs with hallucinations while current UQ scores provide only weak and inconsistent risk signals, and (iii) although UQ methods can rival reasoning-based chain-of-thought baselines on overt, group-level ambiguity, they largely fail to detect the subtle, instance-level ambiguity introduced by our perturbation pipeline. These results highlight a significant gap between current UQ practices and the fine-grained, modality-aware uncertainty required for reliable VLM deployment.

[117] Rethinking Global Text Conditioning in Diffusion Transformers

Nikita Starodubcev, Daniil Pakhomov, Zongze Wu, Ilya Drobyshevskiy, Yuchen Liu, Zhonghao Wang, Yuqian Zhou, Zhe Lin, Dmitry Baranchuk

Main category: cs.CV

TL;DR: Modulation-based text conditioning in diffusion transformers is not necessary for basic performance but can provide significant gains when used as guidance for controllable shifts toward desirable properties.

Details

Motivation: To investigate whether modulation-based text conditioning in diffusion transformers is necessary and whether it can provide performance advantages, given that recent approaches discard it in favor of attention-only mechanisms.

Method: Analyzes conventional usage of pooled text embeddings in diffusion transformers, reveals their limited contribution to overall performance, and proposes using them as guidance for controllable shifts instead of direct modulation.

Result: Attention alone is sufficient for faithful prompt propagation, but pooled embeddings provide significant gains when used as guidance for controllable shifts toward more desirable properties in text-to-image/video generation and image editing.

Conclusion: Modulation-based text conditioning is not necessary but can be repurposed as training-free guidance that enables controllable property shifts with negligible runtime overhead across various diffusion models and tasks.

Abstract: Diffusion transformers typically incorporate textual information via attention layers and a modulation mechanism using a pooled text embedding. Nevertheless, recent approaches discard modulation-based text conditioning and rely exclusively on attention. In this paper, we address whether modulation-based text conditioning is necessary and whether it can provide any performance advantage. Our analysis shows that, in its conventional usage, the pooled embedding contributes little to overall performance, suggesting that attention alone is generally sufficient for faithfully propagating prompt information. However, we reveal that the pooled embedding can provide significant gains when used from a different perspective-serving as guidance and enabling controllable shifts toward more desirable properties. This approach is training-free, simple to implement, incurs negligible runtime overhead, and can be applied to various diffusion models, bringing improvements across diverse tasks, including text-to-image/video generation and image editing.

[118] X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging

Pranav Kulkarni, Junfeng Guo, Heng Huang

Main category: cs.CV

TL;DR: X-Mark is a sample-specific clean-label watermarking method for chest X-ray copyright protection that generates unique perturbations within salient regions using a conditional U-Net, ensuring watermark efficacy, robustness to scaling, and preservation of diagnostic quality.

Details

Motivation: Medical imaging datasets are valuable for training deep learning models but face copyright and ethical concerns when used without authorization. Existing watermarking methods designed for natural images don't work well for medical images due to dynamic scaling, high resolution, limited visual diversity, and the need to preserve diagnostic quality.

Method: X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each chest X-ray sample. It employs a multi-component training objective with Laplacian regularization to ensure watermark efficacy, robustness against dynamic scaling, preservation of diagnostic quality, and visual distinguishability. Ownership verification is performed in a black-box setting by detecting characteristic behaviors in suspicious models.

Result: Extensive experiments on CheXpert dataset show X-Mark achieves 100% Watermark Success Rate (WSR) and reduces probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.

Conclusion: X-Mark provides an effective solution for medical image copyright protection that addresses the unique challenges of medical imaging, balancing watermark robustness with preservation of diagnostic quality through scale-invariant watermark generation.

Abstract: High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns. Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images, as static watermark patterns generated in fixed-scale images scale poorly dynamic and high-resolution scans with limited visual diversity and subtle anatomical structures, while preserving diagnostic quality. In this paper, we propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection. Specifically, X-Mark uses a conditional U-Net to generate unique perturbations within salient regions of each sample. We design a multi-component training objective to ensure watermark efficacy, robustness against dynamic scaling processes while preserving diagnostic quality and visual-distinguishability. We incorporate Laplacian regularization into our training objective to penalize high-frequency perturbations and achieve watermark scale-invariance. Ownership verification is performed in a black-box setting to detect characteristic behaviors in suspicious models. Extensive experiments on CheXpert verify the effectiveness of X-Mark, achieving WSR of 100% and reducing probability of false positives in Ind-M scenario by 12%, while demonstrating resistance to potential adaptive attacks.

[119] Assessing Identity Leakage in Talking Face Generation: Metrics and Evaluation Framework

Dogucan Yaman, Fevziye Irem Eyiokur, Hazım Kemal Ekenel, Alexander Waibel

Main category: cs.CV

TL;DR: Proposes systematic evaluation methodology to detect and quantify lip leakage in talking face generation models, where reference images influence generated lip motion instead of just audio.

Details

Motivation: Current talking face generation methods that use identity reference images can suffer from "lip leakage" - where the reference image influences generated lip motion rather than just the driving audio. This leakage is hard to detect with standard metrics and test setups, creating unreliable benchmarks.

Method: Develops three complementary test setups: 1) silent-input generation, 2) mismatched audio-video pairing, and 3) matched audio-video synthesis. Introduces derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. Also studies how different identity reference selections affect leakage.

Result: Proposes a model-agnostic evaluation framework that establishes more reliable benchmarks for talking face generation research by systematically quantifying lip leakage.

Conclusion: The methodology provides a systematic way to analyze and quantify lip leakage in talking face generation, offering insights into reference design and establishing better evaluation benchmarks for future research.

Abstract: Video editing-based talking face generation aims to preserve video details such as pose, lighting, and gestures while modifying only lip motion, often using an identity reference image to maintain speaker consistency. However, this mechanism can introduce lip leakage, where generated lips are influenced by the reference image rather than solely by the driving audio. Such leakage is difficult to detect with standard metrics and conventional test setup. To address this, we propose a systematic evaluation methodology to analyze and quantify lip leakage. Our framework employs three complementary test setups: silent-input generation, mismatched audio-video pairing, and matched audio-video synthesis. We also introduce derived metrics including lip-sync discrepancy and silent-audio-based lip-sync scores. In addition, we study how different identity reference selections affect leakage, providing insights into reference design. The proposed methodology is model-agnostic and establishes a more reliable benchmark for future research in talking face generation.

Subba Reddy Oota, Vijay Rowtula, Shahid Mohammed, Jeffrey Galitz, Minghsun Liu, Manish Gupta

Main category: cs.CV

TL;DR: A deep multimodal method using wound variables and images to predict patient hospitalization risk and wound healing trajectories

Details

Motivation: Hospitalization is a major cost factor in wound care, often resulting from delayed treatment, patient non-compliance, or comorbidities. Early detection of wound complications could prevent hospitalizations and reduce clinician diagnosis time.

Method: Transfer learning-based wound assessment solution that collectively uses wound variables and wound images to predict both wound variables from images and their healing trajectories. A deep multimodal approach for hospitalization risk prediction.

Result: The paper presents a novel model for early detection of wound complexities that might affect healing, with potential to reduce hospitalization rates and clinician diagnosis time.

Conclusion: The proposed multimodal deep learning approach can help in early detection of wound complications, potentially preventing hospitalizations and improving wound care efficiency.

Abstract: Hospitalization of patients is one of the major factors for high wound care costs. Most patients do not acquire a wound which needs immediate hospitalization. However, due to factors such as delay in treatment, patient’s non-compliance or existing co-morbid conditions, an injury can deteriorate and ultimately lead to patient hospitalization. In this paper, we propose a deep multi-modal method to predict the patient’s risk of hospitalization. Our goal is to predict the risk confidently by collectively using the wound variables and wound images of the patient. Existing works in this domain have mainly focused on healing trajectories based on distinct wound types. We developed a transfer learning-based wound assessment solution, which can predict both wound variables from wound images and their healing trajectories, which is our primary contribution. We argue that the development of a novel model can help in early detection of the complexities in the wound, which might affect the healing process and also reduce the time spent by a clinician to diagnose the wound.

[121] GAFR-Net: A Graph Attention and Fuzzy-Rule Network for Interpretable Breast Cancer Image Classification

Lin-Guo Gao, Suxing Liu

Main category: cs.CV

TL;DR: GAFRNet: Graph Attention and FuzzyRule Network for interpretable breast cancer histopathology image classification with limited annotations, using graph representations and explicit fuzzy logic rules.

Details

Motivation: Address limitations of conventional deep learning for medical image analysis: performance degradation with limited annotations and lack of interpretability ("blackbox" nature) that hinders clinical integration.

Method: Constructs similarity-driven graph representations to model intersample relationships, uses multi-head graph attention to capture relational features, and incorporates differentiable fuzzy-rule module encoding topological descriptors (node degree, clustering coefficient, label consistency) into explicit “IF-THEN” diagnostic logic.

Result: Outperforms state-of-the-art methods across three benchmark datasets (BreakHis, Mini-DDSM, ICIAR2018) for breast cancer histopathology classification, demonstrating superior generalization and practical utility for weakly supervised medical image analysis.

Conclusion: GAFRNet provides a robust and interpretable solution for medical image classification with scarce supervision, offering transparent reasoning that mimics expert diagnostic logic and enabling reliable clinical decision support.

Abstract: Accurate classification of breast cancer histopathology images is pivotal for early oncological diagnosis and therapeutic intervention.However, conventional deep learning architectures often encounter performance degradation under limited annotations and suffer from a “blackbox” nature, hindering their clinical integration. To mitigate these limitations, we propose GAFRNet, a robust and interpretable Graph Attention and FuzzyRule Network specifically engineered for histopathology image classification with scarce supervision. GAFRNet constructs a similarity-driven graph representation to model intersample relationships and employs a multihead graph attention mechanism to capture complex relational features across heterogeneous tissue structures.Concurrently, a differentiable fuzzy-rule module encodes intrinsic topological descriptorsincluding node degree, clustering coefficient, and label consistencyinto explicit, human-understandable diagnostic logic. This design establishes transparent “IF-THEN” mappings that mimic the heuristic deduction process of medical experts, providing clear reasoning behind each prediction without relying on post-hoc attribution methods. Extensive evaluations on three benchmark datasets (BreakHis, Mini-DDSM, and ICIAR2018) demonstrate that GAFR-Net consistently outperforms various state-of-the-art methods across multiple magnifications and classification tasks. These results validate the superior generalization and practical utility of GAFR-Net as a reliable decision-support tool for weakly supervised medical image analysis.

[122] MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team, :, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu Peng, Yang Gao, Yanru Huo, Ying Zhu, Yinze Luo, Yiyang Zhang, Yuerong Song, Zhe Xu, Zhiyu Zhang, Chenchen Yang, Cheng Chang, Chushu Zhou, Hanfu Chen, Hongnan Ma, Jiaxi Li, Jingqi Tong, Junxi Liu, Ke Chen, Shimin Li, Shiqi Jiang, Songlin Wang, Wei Jiang, Zhaoye Fei, Zhiyuan Ning, Chunguo Li, Chenhui Li, Ziwei He, Zengfeng Huang, Xie Chen, Xipeng Qiu

Main category: cs.CV

TL;DR: MOVA is an open-source 32B parameter Mixture-of-Experts model that jointly generates synchronized high-quality video and audio content from image-text inputs, addressing the gap in audio-visual generation research.

Details

Motivation: Current audio-visual generation relies on cascaded pipelines that increase costs, accumulate errors, and degrade quality. Closed-source systems like Veo 3 and Sora 2 limit research progress, creating a need for open-source joint audio-visual generation models.

Method: Uses Mixture-of-Experts architecture with 32B total parameters (18B active during inference) for IT2VA (Image-Text to Video-Audio) generation, supporting realistic lip-synced speech, environment-aware sound effects, and content-aligned music.

Result: Produces high-quality synchronized audio-visual content with comprehensive open-source release including model weights, code, efficient inference support, LoRA fine-tuning, and prompt enhancement tools.

Conclusion: MOVA advances audio-visual generation research by providing an open-source foundation for joint multimodal modeling, fostering community development and addressing limitations of current closed-source systems.

Abstract: Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

[123] Deep Modeling and Interpretation for Bladder Cancer Classification

Ahmad Chaddad, Yihang Wu, Xianrui Chen

Main category: cs.CV

TL;DR: Evaluation of 13 deep learning models (CNNs and transformers) for bladder cancer classification, analyzing classification performance, calibration, and interpretability using GradCAM++ on medical imaging data.

Details

Motivation: Vision transformers and CNNs perform well on natural images but may not generalize to medical imaging where abnormalities are small. Need to evaluate these models for bladder cancer classification tasks.

Method: 1) Standard classification using 13 models (4 CNNs, 8 transformer-based), 2) Calibration analysis, 3) Interpretability evaluation using GradCAM++. Conducted ~300 experiments on multicenter bladder cancer dataset.

Result: ConvNext series showed limited generalization (~60% accuracy). ViTs demonstrated better calibration than ConvNext and Swin transformers. Test time augmentation improved interpretability. No single model works for all cases - ConvNext for in-distribution, ViTs for out-of-distribution interpretation.

Conclusion: No one-size-fits-all solution exists for interpretable bladder cancer classification. Different models suit different scenarios: ConvNext for in-distribution samples, ViTs for interpreting out-of-distribution samples.

Abstract: Deep models based on vision transformer (ViT) and convolutional neural network (CNN) have demonstrated remarkable performance on natural datasets. However, these models may not be similar in medical imaging, where abnormal regions cover only a small portion of the image. This challenge motivates this study to investigate the latest deep models for bladder cancer classification tasks. We propose the following to evaluate these deep models: 1) standard classification using 13 models (four CNNs and eight transormer-based models), 2) calibration analysis to examine if these models are well calibrated for bladder cancer classification, and 3) we use GradCAM++ to evaluate the interpretability of these models for clinical diagnosis. We simulate $\sim 300$ experiments on a publicly multicenter bladder cancer dataset, and the experimental results demonstrate that the ConvNext series indicate limited generalization ability to classify bladder cancer images (e.g., $\sim 60%$ accuracy). In addition, ViTs show better calibration effects compared to ConvNext and swin transformer series. We also involve test time augmentation to improve the models interpretability. Finally, no model provides a one-size-fits-all solution for a feasible interpretable model. ConvNext series are suitable for in-distribution samples, while ViT and its variants are suitable for interpreting out-of-distribution samples.

[124] Kyrtos: A methodology for automatic deep analysis of graphic charts with curves in technical documents

Michail S. Alexiou, Nikolaos G. Bourbakis

Main category: cs.CV

TL;DR: Kyrtos methodology for automatic recognition and analysis of chart curves in technical documents, converting them to attributed graphs and natural language descriptions.

Details

Motivation: Deep understanding of technical documents requires accurate analysis of multimodal content including graphics, tables, diagrams, and their associations. Charts with curves in technical documents contain valuable information that needs to be automatically extracted and analyzed.

Method: Two-part approach: 1) Recognition uses clustering-based method to identify middle-points delimiting line-segments that construct curves; 2) Analysis parses extracted line-segments to capture behavioral features (direction, trend), converts segments into attributed graphs preserving structural characteristics, and expresses graph relations into natural language sentences.

Result: Extensive evaluation demonstrates accuracy of Kyrtos’ recognition and analysis methods by measuring structural similarity between input chart curves and Kyrtos-generated approximations for charts with multiple functions.

Conclusion: Kyrtos methodology successfully enables automatic recognition and analysis of chart curves in technical documents, facilitating conversion to attributed graphs and natural language descriptions for better document understanding.

Abstract: Deep Understanding of Technical Documents (DUTD) has become a very attractive field with great potential due to large amounts of accumulated documents and the valuable knowledge contained in them. In addition, the holistic understanding of technical documents depends on the accurate analysis of its particular modalities, such as graphics, tables, diagrams, text, etc. and their associations. In this paper, we introduce the Kyrtos methodology for the automatic recognition and analysis of charts with curves in graphics images of technical documents. The recognition processing part adopts a clustering based approach to recognize middle-points that delimit the line-segments that construct the illustrated curves. The analysis processing part parses the extracted line-segments of curves to capture behavioral features such as direction, trend and etc. These associations assist the conversion of recognized segments’ relations into attributed graphs, for the preservation of the curves’ structural characteristics. The graph relations are also are expressed into natural language (NL) text sentences, enriching the document’s text and facilitating their conversion into Stochastic Petri-net (SPN) graphs, which depict the internal functionality represented in the chart image. Extensive evaluation results demonstrate the accuracy of Kyrtos’ recognition and analysis methods by measuring the structural similarity between input chart curves and the approximations generated by Kyrtos for charts with multiple functions.

[125] Impact of domain adaptation in deep learning for medical image classifications

Yihang Wu, Ahmad Chaddad

Main category: cs.CV

TL;DR: Domain adaptation techniques applied to medical imaging show performance improvements, noise robustness, and better interpretability across various scenarios including multi-modality data, federated learning, and classifier calibration.

Details

Motivation: Domain adaptation aims to transfer knowledge from labeled source domains to unlabeled target domains, but its application in medical imaging needs comprehensive evaluation across diverse scenarios like multi-modality data, noisy conditions, federated learning, interpretability, and calibration.

Method: Evaluated 10 deep learning models simulating common DA techniques on four medical image datasets, testing scenarios including multi-modality, noisy data, federated learning, interpretability analysis (using GradCAM++), and classifier calibration.

Result: DA with ResNet34 improved brain tumor classification by 4.7%, reduced Gaussian noise impact (∼3% accuracy increase), showed limited FL improvement (∼0.3% for skin cancer), enhanced interpretability via GradCAM++, and improved calibration (∼2% lower ECE on multi-modality data).

Conclusion: Domain adaptation provides measurable benefits in medical imaging including performance gains, noise robustness, and improved interpretability, though its impact varies across different scenarios and applications.

Abstract: Domain adaptation (DA) is a quickly expanding area in machine learning that involves adjusting a model trained in one domain to perform well in another domain. While there have been notable progressions, the fundamental concept of numerous DA methodologies has persisted: aligning the data from various domains into a shared feature space. In this space, knowledge acquired from labeled source data can improve the model training on target data that lacks sufficient labels. In this study, we demonstrate the use of 10 deep learning models to simulate common DA techniques and explore their application in four medical image datasets. We have considered various situations such as multi-modality, noisy data, federated learning (FL), interpretability analysis, and classifier calibration. The experimental results indicate that using DA with ResNet34 in a brain tumor (BT) data set results in an enhancement of 4.7% in model performance. Similarly, the use of DA can reduce the impact of Gaussian noise, as it provides $\sim 3%$ accuracy increase using ResNet34 on a BT dataset. Furthermore, simply introducing DA into FL framework shows limited potential (e.g., $\sim 0.3%$ increase in performance) for skin cancer classification. In addition, the DA method can improve the interpretability of the models using the gradcam++ technique, which offers clinical values. Calibration analysis also demonstrates that using DA provides a lower expected calibration error (ECE) value $\sim 2%$ compared to CNN alone on a multi-modality dataset.

[126] Fully Differentiable Bidirectional Dual-Task Synergistic Learning for Semi-Supervised 3D Medical Image Segmentation

Jun Li

Main category: cs.CV

TL;DR: DBiSL is a differentiable bidirectional synergistic learning framework for semi-supervised medical image segmentation that enables online bidirectional cross-task collaboration between segmentation and regression tasks, outperforming existing methods.

Details

Motivation: Medical image analysis suffers from scarcity of high-quality labeled data due to high annotation costs and need for clinical expertise. Current dual-task collaborative learning methods are limited to unidirectional interactions (regression-to-segmentation) and fail to exploit online bidirectional cross-task collaboration.

Method: Proposes DBiSL framework that integrates four SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Enables differentiable bidirectional interaction between segmentation and regression tasks through synergistic learning.

Result: State-of-the-art performance on two benchmark datasets. Demonstrates effectiveness of the proposed bidirectional synergistic learning approach.

Conclusion: Provides new insights into unified SSL framework design and establishes architectural foundation for dual-task-driven SSL. Offers generic multitask learning framework applicable to broader computer vision applications.

Abstract: Semi-supervised learning relaxes the need of large pixel-wise labeled datasets for image segmentation by leveraging unlabeled data. The scarcity of high-quality labeled data remains a major challenge in medical image analysis due to the high annotation costs and the need for specialized clinical expertise. Semi-supervised learning has demonstrated significant potential in addressing this bottleneck, with pseudo-labeling and consistency regularization emerging as two predominant paradigms. Dual-task collaborative learning, an emerging consistency-aware paradigm, seeks to derive supplementary supervision by establishing prediction consistency between related tasks. However, current methodologies are limited to unidirectional interaction mechanisms (typically regression-to-segmentation), as segmentation results can only be transformed into regression outputs in an offline manner, thereby failing to fully exploit the potential benefits of online bidirectional cross-task collaboration. Thus, we propose a fully Differentiable Bidirectional Synergistic Learning (DBiSL) framework, which seamlessly integrates and enhances four critical SSL components: supervised learning, consistency regularization, pseudo-supervised learning, and uncertainty estimation. Experiments on two benchmark datasets demonstrate our method’s state-of-the-art performance. Beyond technical contributions, this work provides new insights into unified SSL framework design and establishes a new architectural foundation for dual-task-driven SSL, while offering a generic multitask learning framework applicable to broader computer vision applications. The code will be released on github upon acceptance.

[127] Single-Slice-to-3D Reconstruction in Medical Imaging and Natural Objects: A Comparative Benchmark with SAM 3D

Yan Luo, Advaith Ravishankar, Serena Liu, Yutong Yang, Mengyu Wang

Main category: cs.CV

TL;DR: Benchmark study evaluating 5 state-of-the-art image-to-3D foundation models on medical data reconstruction from single slices, revealing depth ambiguity challenges and recommending multi-view approaches.

Details

Motivation: 3D understanding is crucial for medical diagnosis but volumetric imaging is costly. Image-to-3D foundation models could help by reconstructing 3D from 2D, but it's unclear if geometric priors from natural images transfer to medical data.

Method: Controlled zero-shot benchmark evaluating SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG across 6 medical datasets and 2 natural datasets using voxel-based and point cloud distance metrics.

Result: Voxel-based overlap remains moderate across medical datasets, showing depth reconstruction failure. SAM3D achieves strongest topological similarity to ground truth, while other models tend to oversimplify reconstructions.

Conclusion: Single-slice medical reconstruction has limitations due to depth ambiguity from planar 2D data, motivating multi-view image-to-3D reconstruction for reliable medical 3D inference.

Abstract: A 3D understanding of anatomy is central to diagnosis and treatment planning, yet volumetric imaging remains costly with long wait times. Image-to-3D foundations models can solve this issue by reconstructing 3D data from 2D modalites. Current foundation models are trained on natural image distributions to reconstruct naturalistic objects from a single image by leveraging geometric priors across pixels. However, it is unclear whether these learned geometric priors transfer to medical data. In this study, we present a controlled zero-shot benchmark of single slice medical image-to-3D reconstruction across five state-of-the-art image-to-3D models: SAM3D, Hunyuan3D-2.1, Direct3D, Hi3DGen, and TripoSG. These are evaluated across six medical datasets spanning anatomical and pathological structures and two natrual datasets, using voxel based metrics and point cloud distance metrics. Across medical datasets, voxel based overlap remains moderate for all models, consistent with a depth reconstruction failure mode when inferring volume from a single slice. In contrast, global distance metrics show more separation between methods: SAM3D achieves the strongest overall topological similarity to ground truth medical 3D data, while alternative models are more prone to over-simplication of reconstruction. Our results quantify the limits of single-slice medical reconstruction and highlight depth ambiguity caused by the planar nature of 2D medical data, motivating multi-view image-to-3D reconstruction to enable reliable medical 3D inference.

[128] K-Sort Eval: Efficient Preference Evaluation for Visual Generation via Corrected VLM-as-a-Judge

Zhikai Li, Jiatong Li, Xuewen Liu, Wangbo Zhao, Pan Du, Kaicheng Zhou, Qingyi Gu, Yang You, Zhen Dong, Kurt Keutzer

Main category: cs.CV

TL;DR: K-Sort Eval: A VLM-based framework for evaluating visual generative models using posterior correction and dynamic matching to achieve human-aligned, efficient evaluation with fewer than 90 model runs.

Details

Motivation: Current evaluation of visual generative models relies on costly human preference assessments (Arena platforms) that lack scalability. While VLMs offer a promising alternative, their hallucinations and biases compromise alignment with human preferences, and static evaluation approaches are inefficient.

Method: Proposes K-Sort Eval framework with two key components: 1) Posterior correction method that adaptively corrects VLM predictions based on consistency with human supervision in Bayesian updating, 2) Dynamic matching strategy that balances uncertainty and diversity to maximize expected benefit of each comparison. Uses (K+1)-wise free-for-all comparisons between new and existing models.

Result: Extensive experiments show K-Sort Eval delivers evaluation results consistent with human-based K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both efficiency and reliability.

Conclusion: K-Sort Eval provides a scalable, human-aligned evaluation framework for visual generative models that overcomes limitations of both human-based evaluation and naive VLM-based approaches through posterior correction and dynamic matching.

Abstract: The rapid development of visual generative models raises the need for more scalable and human-aligned evaluation methods. While the crowdsourced Arena platforms offer human preference assessments by collecting human votes, they are costly and time-consuming, inherently limiting their scalability. Leveraging vision-language model (VLMs) as substitutes for manual judgments presents a promising solution. However, the inherent hallucinations and biases of VLMs hinder alignment with human preferences, thus compromising evaluation reliability. Additionally, the static evaluation approach lead to low efficiency. In this paper, we propose K-Sort Eval, a reliable and efficient VLM-based evaluation framework that integrates posterior correction and dynamic matching. Specifically, we curate a high-quality dataset from thousands of human votes in K-Sort Arena, with each instance containing the outputs and rankings of K models. When evaluating a new model, it undergoes (K+1)-wise free-for-all comparisons with existing models, and the VLM provide the rankings. To enhance alignment and reliability, we propose a posterior correction method, which adaptively corrects the posterior probability in Bayesian updating based on the consistency between the VLM prediction and human supervision. Moreover, we propose a dynamic matching strategy, which balances uncertainty and diversity to maximize the expected benefit of each comparison, thus ensuring more efficient evaluation. Extensive experiments show that K-Sort Eval delivers evaluation results consistent with K-Sort Arena, typically requiring fewer than 90 model runs, demonstrating both its efficiency and reliability.

[129] LARV: Data-Free Layer-wise Adaptive Rescaling Veneer for Model Merging

Xinyu Wang, Ke Deng, Fei Dou, Jinbo Bi, Jin Lu

Main category: cs.CV

TL;DR: LARV introduces layer-wise adaptive rescaling for task-vector merging in vision transformers, addressing layer heterogeneity by suppressing shallow-layer interference and amplifying deeper-layer alignment without training or data.

Details

Motivation: Existing task-vector merging methods treat all layers uniformly, overlooking the strong layer-wise heterogeneity in large vision transformers where shallow layers are sensitive to interference while deeper layers encode stable task-specific features.

Method: LARV is a training-free, data-free, merger-agnostic layer-wise adaptive rescaling veneer that plugs into any task-vector merger. It assigns per-layer scales to each task vector before aggregation using simple deterministic schedules based on data-free layer proxies, with options for tiered scaling or continuous mappings.

Result: LARV consistently improves all task-vector baselines across 8/14/20-task settings on FusionBench with Vision Transformers. For example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis shows it suppresses shallow-layer interference while amplifying deeper task-stable features.

Conclusion: LARV turns model merging into a robust, layer-aware procedure rather than a uniform one, is orthogonal to base mergers, adds negligible cost, and consistently boosts diverse merging rules for vision transformers.

Abstract: Model merging aims to combine multiple fine-tuned models into a single multi-task model without access to training data. Existing task-vector merging methods such as TIES, TSV-M, and Iso-C/CTS differ in their aggregation rules but treat all layers nearly uniformly. This assumption overlooks the strong layer-wise heterogeneity in large vision transformers, where shallow layers are sensitive to interference while deeper layers encode stable task-specific features. We introduce LARV, a training-free, data-free, merger-agnostic Layer-wise Adaptive Rescaling Veneer that plugs into any task-vector merger and assigns a per-layer scale to each task vector before aggregation, and show it consistently boosts diverse merging rules. LARV adaptively suppresses shallow-layer interference and amplifies deeper-layer alignment using a simple deterministic schedule, requiring no retraining or modification to existing mergers. To our knowledge, this is the first work to perform layer-aware scaling for task-vector merging. LARV computes simple data-free layer proxies and turns them into scales through a lightweight rule; we study several instantiations within one framework (e.g., tiered two/three-level scaling with fixed values, or continuous mappings) and show that tiered choices offer the best robustness, while continuous mappings remain an ablation. LARV is orthogonal to the base merger and adds negligible cost. On FusionBench with Vision Transformers, LARV consistently improves all task-vector baselines across 8/14/20-task settings; for example, Iso-C + LARV reaches 85.9% on ViT-B/32, 89.2% on ViT-B/16, and 92.6% on ViT-L/14. Layerwise analysis and corruption tests further indicate that LARV suppresses shallow-layer interference while modestly amplifying deeper, task-stable features, turning model merging into a robust, layer-aware procedure rather than a uniform one.

[130] Stability and Concentration in Nonlinear Inverse Problems with Block-Structured Parameters: Lipschitz Geometry, Identifiability, and an Application to Gaussian Splatting

Joe-Mei Feng, Hsin-Hsiung Kao

Main category: cs.CV

TL;DR: Operator-theoretic framework for stability and statistical concentration in nonlinear inverse problems with block-structured parameters, with application to Gaussian Splatting rendering.

Details

Motivation: To develop a unified theoretical framework for analyzing stability and statistical properties of nonlinear inverse problems in high-dimensional settings, particularly relevant to modern imaging and differentiable rendering where traditional linear methods fail.

Method: Develops operator-theoretic framework combining blockwise Lipschitz geometry, local identifiability, and sub-Gaussian noise assumptions. Establishes deterministic stability inequalities, global Lipschitz bounds for least-squares misfit functionals, and nonasymptotic concentration estimates.

Result: Derives high-probability parameter error bounds intrinsic to forward operators, independent of specific algorithms. Shows Gaussian Splatting rendering operator satisfies assumptions, revealing fundamental stability-resolution tradeoff where estimation error is constrained by image resolution to model complexity ratio.

Conclusion: Provides theoretical foundation for understanding operator-level limits in high-dimensional nonlinear inverse problems, with concrete application to differentiable rendering systems like Gaussian Splatting, establishing fundamental performance bounds.

Abstract: We develop an operator-theoretic framework for stability and statistical concentration in nonlinear inverse problems with block-structured parameters. Under a unified set of assumptions combining blockwise Lipschitz geometry, local identifiability, and sub-Gaussian noise, we establish deterministic stability inequalities, global Lipschitz bounds for least-squares misfit functionals, and nonasymptotic concentration estimates. These results yield high-probability parameter error bounds that are intrinsic to the forward operator and independent of any specific reconstruction algorithm. As a concrete instantiation, we verify that the Gaussian Splatting rendering operator satisfies the proposed assumptions and derive explicit constants governing its Lipschitz continuity and resolution-dependent observability. This leads to a fundamental stability–resolution tradeoff, showing that estimation error is inherently constrained by the ratio between image resolution and model complexity. Overall, the analysis characterizes operator-level limits for a broad class of high-dimensional nonlinear inverse problems arising in modern imaging and differentiable rendering.

[131] Bridging the Modality Gap in Roadside LiDAR: A Training-Free Vision-Language Model Framework for Vehicle Classification

Yiqiao Li, Bo Shang, Jie Wei

Main category: cs.CV

TL;DR: A framework that adapts off-the-shelf Vision-Language Models (VLMs) for fine-grained truck classification from sparse LiDAR point clouds by transforming them into depth-encoded 2D visual proxies, enabling few-shot learning without parameter fine-tuning.

Details

Motivation: Current LiDAR-based truck classification methods face scalability issues due to reliance on supervised deep learning and labor-intensive manual annotation. VLMs offer few-shot generalization but have a modality gap between sparse 3D point clouds and dense 2D imagery.

Method: Proposes a depth-aware image generation pipeline that transforms sparse LiDAR scans into depth-encoded 2D visual proxies using noise removal, spatial/temporal registration, orientation rectification, morphological operations, and anisotropic smoothing. Adapts off-the-shelf VLMs without parameter fine-tuning.

Result: Achieves competitive classification accuracy with 16-30 examples per class on a real-world dataset of 20 vehicle classes. Shows a “Semantic Anchor” effect where text-based guidance helps in ultra-low-shot regimes but degrades in more-shot settings. Achieves over 75% correct classification for specific container categories without training/fine-tuning.

Conclusion: The framework provides a scalable alternative to data-intensive supervised baselines, reduces manual labeling demands, and serves as an effective Cold Start strategy for bootstrapping lightweight supervised models in intelligent transportation systems.

Abstract: Fine-grained truck classification is critical for intelligent transportation systems (ITS), yet current LiDAR-based methods face scalability challenges due to their reliance on supervised deep learning and labor-intensive manual annotation. Vision-Language Models (VLMs) offer promising few-shot generalization, but their application to roadside LiDAR is limited by a modality gap between sparse 3D point clouds and dense 2D imagery. We propose a framework that bridges this gap by adapting off-the-shelf VLMs for fine-grained truck classification without parameter fine-tuning. Our new depth-aware image generation pipeline applies noise removal, spatial and temporal registration, orientation rectification, morphological operations, and anisotropic smoothing to transform sparse, occluded LiDAR scans into depth-encoded 2D visual proxies. Validated on a real-world dataset of 20 vehicle classes, our approach achieves competitive classification accuracy with as few as 16-30 examples per class, offering a scalable alternative to data-intensive supervised baselines. We further observe a “Semantic Anchor” effect: text-based guidance regularizes performance in ultra-low-shot regimes $k < 4$, but degrades accuracy in more-shot settings due to semantic mismatch. Furthermore, we demonstrate the efficacy of this framework as a Cold Start strategy, using VLM-generated labels to bootstrap lightweight supervised models. Notably, the few-shot VLM-based model achieves over correct classification rate of 75 percent for specific drayage categories (20ft, 40ft, and 53ft containers) entirely without the costly training or fine-tuning, significantly reducing the intensive demands of initial manual labeling, thus achieving a method of practical use in ITS applications.

[132] SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang, Jiang Bian

Main category: cs.CV

TL;DR: SceneReVis: A vision-grounded self-reflection framework for 3D scene synthesis that uses iterative diagnose-and-act loops to resolve spatial conflicts through multi-modal feedback.

Details

Motivation: Current one-pass 3D scene synthesis methods suffer from spatial hallucinations like collisions due to lack of deliberative reasoning, creating a need for frameworks that can explicitly intercept and resolve spatial conflicts.

Method: Introduces SceneReVis with iterative “diagnose-and-act” loops using multi-modal feedback, creates SceneChain-12k dataset via reverse engineering pipeline, and uses two-stage training: Supervised Fine-Tuning followed by Agentic Reinforcement Learning.

Result: Achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

Conclusion: SceneReVis effectively bridges the gap in 3D scene synthesis by introducing deliberative reasoning through vision-grounded self-reflection, significantly reducing spatial hallucinations.

Abstract: Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act’’ loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

[133] Fine-T2I: An Open, Large-Scale, and Diverse Dataset for High-Quality T2I Fine-Tuning

Xu Ma, Yitian Zhang, Qihua Dong, Yun Fu

Main category: cs.CV

TL;DR: Fine-T2I is a large-scale, high-quality open dataset for text-to-image fine-tuning, combining synthetic and real images with rigorous filtering, containing 6M text-image pairs across diverse tasks and styles.

Details

Motivation: There's a major bottleneck in text-to-image fine-tuning due to lack of high-quality open datasets. Most public datasets suffer from low resolution, poor text-image alignment, or limited diversity, creating a performance gap between open research models and enterprise-grade models.

Method: Created Fine-T2I dataset spanning 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates. Combines synthetic images from strong modern models with curated real images from professional photographers. All samples rigorously filtered for text-image alignment, visual fidelity, and prompt quality (over 95% removal rate).

Result: Final dataset contains over 6 million text-image pairs (~2 TB), approaching pretraining dataset scale while maintaining fine-tuning-level quality. Fine-tuning on Fine-T2I consistently improves generation quality and instruction adherence across diverse pretrained diffusion and autoregressive models, validated by human evaluation, visual comparison, and automatic metrics.

Conclusion: Fine-T2I helps close the data gap in text-to-image fine-tuning in the open community. The dataset is released under an open license to advance research in this area.

Abstract: High-quality and open datasets remain a major bottleneck for text-to-image (T2I) fine-tuning. Despite rapid progress in model architectures and training pipelines, most publicly available fine-tuning datasets suffer from low resolution, poor text-image alignment, or limited diversity, resulting in a clear performance gap between open research models and enterprise-grade models. In this work, we present Fine-T2I, a large-scale, high-quality, and fully open dataset for T2I fine-tuning. Fine-T2I spans 10 task combinations, 32 prompt categories, 11 visual styles, and 5 prompt templates, and combines synthetic images generated by strong modern models with carefully curated real images from professional photographers. All samples are rigorously filtered for text-image alignment, visual fidelity, and prompt quality, with over 95% of initial candidates removed. The final dataset contains over 6 million text-image pairs, around 2 TB on disk, approaching the scale of pretraining datasets while maintaining fine-tuning-level quality. Across a diverse set of pretrained diffusion and autoregressive models, fine-tuning on Fine-T2I consistently improves both generation quality and instruction adherence, as validated by human evaluation, visual comparison, and automatic metrics. We release Fine-T2I under an open license to help close the data gap in T2I fine-tuning in the open community.

[134] A Scoping Review of Deep Learning for Urban Visual Pollution and Proposal of a Real-Time Monitoring Framework with a Visual Pollution Index

Mohammad Masudur Rahman, Md. Rashedur Rahman, Ashraful Islam, Saadia B Alam, M Ashraful Amin

Main category: cs.CV

TL;DR: Scoping review of deep learning approaches for Urban Visual Pollution detection, classification, and management framework development.

Details

Motivation: Urban Visual Pollution (UVP) is a critical concern but research on automatic detection and application is fragmented, lacking unified systems for comprehensive visual pollution management.

Method: PRISMA-ScR guided systematic review of 7 academic databases, analyzing 26 articles on deep learning approaches (YOLO, Faster R-CNN, EfficientDet) for UVP detection and classification.

Result: Most research focuses on specific pollutant categories with limited datasets; few studies integrate detection into real-time systems; proposed framework includes visual pollution index for severity assessment.

Conclusion: Need for unified UVP management system with standardized taxonomy, cross-city benchmark dataset, generalized deep learning model, and assessment index for sustainable urban aesthetics and well-being.

Abstract: Urban Visual Pollution (UVP) has emerged as a critical concern, yet research on automatic detection and application remains fragmented. This scoping review maps the existing deep learning-based approaches for detecting, classifying, and designing a comprehensive application framework for visual pollution management. Following the PRISMA-ScR guidelines, seven academic databases (Scopus, Web of Science, IEEE Xplore, ACM DL, ScienceDirect, SpringerNatureLink, and Wiley) were systematically searched and reviewed, and 26 articles were found. Most research focuses on specific pollutant categories and employs variations of YOLO, Faster R-CNN, and EfficientDet architectures. Although several datasets exist, they are limited to specific areas and lack standardized taxonomies. Few studies integrate detection into real-time application systems, yet they tend to be geographically skewed. We proposed a framework for monitoring visual pollution that integrates a visual pollution index to assess the severity of visual pollution for a certain area. This review highlights the need for a unified UVP management system that incorporates pollutant taxonomy, a cross-city benchmark dataset, a generalized deep learning model, and an assessment index that supports sustainable urban aesthetics and enhances the well-being of urban dwellers.

[135] Look-Ahead and Look-Back Flows: Training-Free Image Generation with Trajectory Smoothing

Yan Luo, Henry Huang, Todd Y. Zhou, Mengyu Wang

Main category: cs.CV

TL;DR: Training-free latent trajectory smoothing methods (Look-Ahead and Look-Back) for diffusion models that adjust generative paths in latent space using velocity information to reduce error accumulation.

Details

Motivation: Existing training-free flow matching approaches adjust velocity fields, which introduces errors that propagate through the full generation path. Adjusting latent trajectories instead allows natural correction by pretrained velocity networks, reducing error accumulation.

Method: Two complementary training-free latent-trajectory adjustment approaches: 1) Look-Ahead averages current and next-step latents using curvature-gated weight, 2) Look-Back smoothes latents using exponential moving average with decay. Both refine generative paths directly in latent space using future and past velocity/latent information.

Result: Substantially outperforms various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K based on extensive experiments and comprehensive evaluation metrics.

Conclusion: Training-free latent trajectory smoothing effectively improves diffusion model generation by reducing error accumulation through direct adjustment of generative paths in latent space.

Abstract: Recent advances have reformulated diffusion models as deterministic ordinary differential equations (ODEs) through the framework of flow matching, providing a unified formulation for the noise-to-data generative process. Various training-free flow matching approaches have been developed to improve image generation through flow velocity field adjustment, eliminating the need for costly retraining. However, Modifying the velocity field $v$ introduces errors that propagate through the full generation path, whereas adjustments to the latent trajectory $z$ are naturally corrected by the pretrained velocity network, reducing error accumulation. In this paper, we propose two complementary training-free latent-trajectory adjustment approaches based on future and past velocity $v$ and latent trajectory $z$ information that refine the generative path directly in latent space. We propose two training-free trajectory smoothing schemes: \emph{Look-Ahead}, which averages the current and next-step latents using a curvature-gated weight, and \emph{Look-Back}, which smoothes latents using an exponential moving average with decay. We demonstrate through extensive experiments and comprehensive evaluation metrics that the proposed training-free trajectory smoothing models substantially outperform various state-of-the-art models across multiple datasets including COCO17, CUB-200, and Flickr30K.

[136] ArtifactLens: Hundreds of Labels Are Enough for Artifact Detection with VLMs

James Burgess, Rameen Abdal, Dan Stoddart, Sergey Tulyakov, Serena Yeung-Levy, Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: ArtifactLens: A VLM-based system that detects image generation artifacts using minimal labeled data through in-context learning and text instruction optimization

Details

Motivation: Current detectors require expensive fine-tuning on thousands of labeled images, which is impractical as generators evolve and new artifact types emerge. The authors aim to leverage pretrained VLMs' existing knowledge to detect artifacts with minimal labeled data.

Method: ArtifactLens uses a multi-component architecture with in-context learning and text instruction optimization. It unlocks pretrained VLMs’ artifact detection capabilities using only a few hundred labeled examples per artifact category, with novel improvements to each component.

Result: Achieves state-of-the-art on five human artifact benchmarks (first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. Generalizes to other artifact types (object morphology, animal anatomy, entity interactions) and AIGC detection.

Conclusion: Pretrained VLMs already encode artifact detection knowledge; with proper scaffolding, this capability can be unlocked with minimal labeled data, enabling efficient detection as generators evolve and new artifacts emerge.

Abstract: Modern image generators produce strikingly realistic images, where only artifacts like distorted hands or warped objects reveal their synthetic origin. Detecting these artifacts is essential: without detection, we cannot benchmark generators or train reward models to improve them. Current detectors fine-tune VLMs on tens of thousands of labeled images, but this is expensive to repeat whenever generators evolve or new artifact types emerge. We show that pretrained VLMs already encode the knowledge needed to detect artifacts - with the right scaffolding, this capability can be unlocked using only a few hundred labeled examples per artifact category. Our system, ArtifactLens, achieves state-of-the-art on five human artifact benchmarks (the first evaluation across multiple datasets) while requiring orders of magnitude less labeled data. The scaffolding consists of a multi-component architecture with in-context learning and text instruction optimization, with novel improvements to each. Our methods generalize to other artifact types - object morphology, animal anatomy, and entity interactions - and to the distinct task of AIGC detection.

[137] FD-DB: Frequency-Decoupled Dual-Branch Network for Unpaired Synthetic-to-Real Domain Translation

Chuanhai Zang, Jiabao Hu, XW Song

Main category: cs.CV

TL;DR: FD-DB: A frequency-decoupled dual-branch model for synthetic-to-real domain adaptation that separates appearance transfer into interpretable low-frequency editing and high-frequency residual compensation to balance photorealism and structural stability.

Details

Motivation: Synthetic data provides low-cost annotated samples for geometry-sensitive vision tasks, but domain shift between synthetic and real domains degrades performance. Existing unpaired translation methods face a trade-off between photorealism (which may introduce deformation) and structural stability (which limits adaptation to real-domain statistics).

Method: Proposes FD-DB with two branches: 1) interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, grain) for stable low-frequency appearance base, 2) free branch complements fine details through residual generation. Uses gated fusion mechanism with explicit frequency constraints to limit low-frequency drift, and two-stage training schedule that first stabilizes editing branch then releases residual branch.

Result: Experiments on YCB-V dataset show FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.

Conclusion: FD-DB effectively addresses the photorealism-structural stability trade-off in synthetic-to-real translation through frequency-decoupled design, achieving better domain adaptation for geometry-sensitive vision tasks.

Abstract: Synthetic data provide low-cost, accurately annotated samples for geometry-sensitive vision tasks, but appearance and imaging differences between synthetic and real domains cause severe domain shift and degrade downstream performance. Unpaired synthetic-to-real translation can reduce this gap without paired supervision, yet existing methods often face a trade-off between photorealism and structural stability: unconstrained generation may introduce deformation or spurious textures, while overly rigid constraints limit adaptation to real-domain statistics. We propose FD-DB, a frequency-decoupled dual-branch model that separates appearance transfer into low-frequency interpretable editing and high-frequency residual compensation. The interpretable branch predicts physically meaningful editing parameters (white balance, exposure, contrast, saturation, blur, and grain) to build a stable low-frequency appearance base with strong content preservation. The free branch complements fine details through residual generation, and a gated fusion mechanism combines the two branches under explicit frequency constraints to limit low-frequency drift. We further adopt a two-stage training schedule that first stabilizes the editing branch and then releases the residual branch to improve optimization stability. Experiments on the YCB-V dataset show that FD-DB improves real-domain appearance consistency and significantly boosts downstream semantic segmentation performance while preserving geometric and semantic structures.

[138] Weakly Supervised Contrastive Learning for Histopathology Patch Embeddings

Bodong Zhang, Xiwen Li, Hamid Manoochehri, Xiaoya Tang, Deepika Sirohi, Beatrice S. Knudsen, Tolga Tasdizen

Main category: cs.CV

TL;DR: Weakly supervised contrastive learning framework for digital histopathology that uses only slide-level labels to learn better patch features for multiple instance learning, improving downstream performance without instance-level annotations.

Details

Motivation: Digital histopathology analysis suffers from limited training labels due to the immense effort required for manual annotation of gigapixel whole slide images. While weakly supervised MIL using slide-level labels helps, most methods use frozen patch features and focus on aggregation, neglecting feature representation learning in MIL settings.

Method: Proposes WeakSupCon, a weakly supervised contrastive learning framework that incorporates bag-level label information during training. The method doesn’t rely on instance-level pseudo-labeling but effectively separates patches with different labels in feature space by leveraging slide-level supervision.

Result: Experimental results show that image features generated by WeakSupCon lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches across three datasets.

Conclusion: The proposed weakly supervised contrastive learning framework effectively learns better patch representations using only slide-level labels, addressing the feature representation learning gap in MIL settings for digital histopathology analysis.

Abstract: Digital histopathology whole slide images (WSIs) provide gigapixel-scale high-resolution images that are highly useful for disease diagnosis. However, digital histopathology image analysis faces significant challenges due to the limited training labels, since manually annotating specific regions or small patches cropped from large WSIs requires substantial time and effort. Weakly supervised multiple instance learning (MIL) offers a practical and efficient solution by requiring only bag-level (slide-level) labels, while each bag typically contains multiple instances (patches). Most MIL methods directly use frozen image patch features generated by various image encoders as inputs and primarily focus on feature aggregation. However, feature representation learning for encoder pretraining in MIL settings has largely been neglected. In our work, we propose a novel feature representation learning framework called weakly supervised contrastive learning (WeakSupCon) that incorporates bag-level label information during training. Our method does not rely on instance-level pseudo-labeling, yet it effectively separates patches with different labels in the feature space. Experimental results demonstrate that the image features generated by our WeakSupCon method lead to improved downstream MIL performance compared to self-supervised contrastive learning approaches in three datasets. Our related code is available at github.com/BzhangURU/Paper_WeakSupCon_for_MIL

[139] Beyond Next-Token Alignment: Distilling Multimodal Large Language Models via Token Interactions

Lin Chen, Xiaoke Zhao, Kun Ding, Weiwei Feng, Changtao Miao, Zili Wang, Wenxuan Guo, Ying Wang, Kaiyuan Zheng, Bo Zhang, Zhe Li, Shiming Xiang

Main category: cs.CV

TL;DR: Align-TI: A knowledge distillation framework for compressing multimodal LLMs by aligning token interactions, focusing on vision-instruction alignment and token transition probabilities.

Details

Motivation: Existing knowledge distillation methods for MLLMs rely on static next-token alignment, neglecting dynamic token interactions that are essential for multimodal understanding and generation capabilities.

Method: Align-TI introduces two components: IVA (Instruction-Visual Alignment) to imitate teacher’s visual information extraction capability by aligning on salient visual regions, and TPA (Token Transition Probability Alignment) to capture teacher’s dynamic generative logic by aligning sequential token-to-token transition probabilities.

Result: Achieves 2.6% relative improvement over Vanilla KD, and distilled Align-TI-2B outperforms LLaVA-1.5-7B (a much larger MLLM) by 7.0%, establishing new SOTA for parameter-efficient MLLM distillation.

Conclusion: Align-TI provides an effective knowledge distillation framework that captures essential token interactions for compressing multimodal LLMs while maintaining strong multimodal understanding and generation capabilities.

Abstract: Multimodal Large Language Models (MLLMs) demonstrate impressive cross-modal capabilities, yet their substantial size poses significant deployment challenges. Knowledge distillation (KD) is a promising solution for compressing these models, but existing methods primarily rely on static next-token alignment, neglecting the dynamic token interactions, which embed essential capabilities for multimodal understanding and generation. To this end, we introduce Align-TI, a novel KD framework designed from the perspective of Token Interactions. Our approach is motivated by the insight that MLLMs rely on two primary interactions: vision-instruction token interactions to extract relevant visual information, and intra-response token interactions for coherent generation. Accordingly, Align-TI introduces two components: IVA enables the student model to imitate the teacher’s instruction-relevant visual information extract capability by aligning on salient visual regions. TPA captures the teacher’s dynamic generative logic by aligning the sequential token-to-token transition probabilities. Extensive experiments demonstrate Align-TI’s superiority. Notably, our approach achieves $2.6%$ relative improvement over Vanilla KD, and our distilled Align-TI-2B even outperforms LLaVA-1.5-7B (a much larger MLLM) by $7.0%$, establishing a new state-of-the-art distillation framework for training parameter-efficient MLLMs. Code is available at https://github.com/lchen1019/Align-TI.

[140] OSI: One-step Inversion Excels in Extracting Diffusion Watermarks

Yuwei Chen, Zhenliang He, Jia Tang, Meina Kan, Shiguang Shan

Main category: cs.CV

TL;DR: OSI (One-step Inversion) is a fast, accurate method for extracting Gaussian Shading watermarks from diffusion-generated images by reformulating extraction as a learnable sign classification problem instead of multi-step diffusion inversion.

Details

Motivation: Current watermark extraction methods for diffusion-generated images require computationally expensive multi-step diffusion inversion to obtain precise initial noise, which is slow and inefficient. There's a need for faster, more accurate watermark extraction.

Method: Reformulates watermark extraction as a learnable sign classification problem rather than precise regression of initial noise. Initializes OSI model from diffusion backbone and fine-tunes on synthesized noise-image pairs with sign classification objective, enabling one-step extraction.

Result: OSI is 20x faster than multi-step diffusion inversion, achieves higher extraction accuracy, doubles watermark payload capacity, and shows consistent improvements across diverse schedulers, diffusion backbones, and cryptographic schemes.

Conclusion: OSI provides an efficient, general framework for watermark extraction from diffusion-generated images that significantly outperforms existing methods in speed, accuracy, and capacity.

Abstract: Watermarking is an important mechanism for provenance and copyright protection of diffusion-generated images. Training-free methods, exemplified by Gaussian Shading, embed watermarks into the initial noise of diffusion models with negligible impact on the quality of generated images. However, extracting this type of watermark typically requires multi-step diffusion inversion to obtain precise initial noise, which is computationally expensive and time-consuming. To address this issue, we propose One-step Inversion (OSI), a significantly faster and more accurate method for extracting Gaussian Shading style watermarks. OSI reformulates watermark extraction as a learnable sign classification problem, which eliminates the need for precise regression of the initial noise. Then, we initialize the OSI model from the diffusion backbone and finetune it on synthesized noise-image pairs with a sign classification objective. In this manner, the OSI model is able to accomplish the watermark extraction efficiently in only one step. Our OSI substantially outperforms the multi-step diffusion inversion method: it is 20x faster, achieves higher extraction accuracy, and doubles the watermark payload capacity. Extensive experiments across diverse schedulers, diffusion backbones, and cryptographic schemes consistently show improvements, demonstrating the generality of our OSI framework.

[141] Equilibrium contrastive learning for imbalanced image classification

Sumin Roh, Harim Kim, Ho Yun Lee, Il Yong Chun

Main category: cs.CV

TL;DR: ECL (Equilibrium Contrastive Learning) is a supervised contrastive learning framework that addresses geometric imbalances in representation space for imbalanced datasets by harmonizing class features, means, and classifiers.

Details

Motivation: Existing supervised contrastive learning methods for imbalanced datasets have two key limitations: 1) they don't align class means/prototypes with classifiers, leading to poor generalization, and 2) prototype-based methods treat prototypes as only one additional sample per class, causing unbalanced contributions across classes.

Method: ECL uses two main components: 1) representation geometric equilibrium that promotes regular simplex geometry with collapsed class samples and uniformly distributed class means while balancing contributions of class-average features and prototypes, and 2) classifier-class center geometric equilibrium that aligns classifier weights with class prototypes.

Result: ECL outperforms existing state-of-the-art supervised CL methods on three long-tailed datasets (CIFAR-10/100-LT, ImageNet-LT) and two imbalanced medical datasets (ISIC 2019 and LCCT dataset).

Conclusion: ECL successfully addresses geometric imbalances in supervised contrastive learning for imbalanced classification by establishing equilibrium between class features, means, and classifiers, leading to improved performance on various imbalanced datasets.

Abstract: Contrastive learning (CL) is a predominant technique in image classification, but they showed limited performance with an imbalanced dataset. Recently, several supervised CL methods have been proposed to promote an ideal regular simplex geometric configuration in the representation space-characterized by intra-class feature collapse and uniform inter-class mean spacing, especially for imbalanced datasets. In particular, existing prototype-based methods include class prototypes, as additional samples to consider all classes. However, the existing CL methods suffer from two limitations. First, they do not consider the alignment between the class means/prototypes and classifiers, which could lead to poor generalization. Second, existing prototype-based methods treat prototypes as only one additional sample per class, making their influence depend on the number of class instances in a batch and causing unbalanced contributions across classes. To address these limitations, we propose Equilibrium Contrastive Learning (ECL), a supervised CL framework designed to promote geometric equilibrium, where class features, means, and classifiers are harmoniously balanced under data imbalance. The proposed ECL framework uses two main components. First, ECL promotes the representation geometric equilibrium (i.e., a regular simplex geometry characterized by collapsed class samples and uniformly distributed class means), while balancing the contributions of class-average features and class prototypes. Second, ECL establishes a classifier-class center geometric equilibrium by aligning classifier weights and class prototypes. We ran experiments with three long-tailed datasets, the CIFAR-10(0)-LT, ImageNet-LT, and the two imbalanced medical datasets, the ISIC 2019 and our constructed LCCT dataset. Results show that ECL outperforms existing SOTA supervised CL methods designed for imbalanced classification.

[142] Federated Learning for Surgical Vision in Appendicitis Classification: Results of the FedSurg EndoVis 2024 Challenge

Max Kirchner, Hanna Hoffmann, Alexander C. Jenke, Oliver L. Saldanha, Kevin Pfeiffer, Weam Kanjo, Julia Alekseenko, Claas de Boer, Santhi Raj Kolamuri, Lorenzo Mazza, Nicolas Padoy, Sophia Bano, Annika Reinke, Lena Maier-Hein, Danail Stoyanov, Jakob N. Kather, Fiona R. Kolbinger, Sebastian Bodenstedt, Stefanie Speidel

Main category: cs.CV

TL;DR: FedSurg Challenge benchmarks federated learning for surgical video classification using Appendix300 dataset, evaluating generalization to unseen centers and adaptation via fine-tuning, with ViViT-based approach performing best.

Details

Motivation: To establish a benchmark for federated learning in surgical video classification, assessing how well methods generalize to unseen clinical centers and adapt through local fine-tuning while preserving patient privacy through collaborative model development without data sharing.

Method: Participants developed strategies to classify inflammation stages in appendicitis using Appendix300 video dataset. Evaluated two tasks: generalization to unseen center and center-specific adaptation after fine-tuning. Approaches included foundation models with linear probing, metric learning with triplet loss, and FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance assessed using F1-score and Expected Cost.

Result: In generalization task, performance across centers was limited. In adaptation task, all teams improved after fine-tuning, though ranking stability was low. ViViT-based submission achieved strongest overall performance. Challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training.

Conclusion: FedSurg Challenge establishes first benchmark for evaluating FL strategies in surgical video classification. Findings highlight trade-off between local personalization and global robustness, and underscore importance of architecture choice, preprocessing, and loss design. Provides reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.

Abstract: Purpose: The FedSurg challenge was designed to benchmark the state of the art in federated learning for surgical video classification. Its goal was to assess how well current methods generalize to unseen clinical centers and adapt through local fine-tuning while enabling collaborative model development without sharing patient data. Methods: Participants developed strategies to classify inflammation stages in appendicitis using a preliminary version of the multi-center Appendix300 video dataset. The challenge evaluated two tasks: generalization to an unseen center and center-specific adaptation after fine-tuning. Submitted approaches included foundation models with linear probing, metric learning with triplet loss, and various FL aggregation schemes (FedAvg, FedMedian, FedSAM). Performance was assessed using F1-score and Expected Cost, with ranking robustness evaluated via bootstrapping and statistical testing. Results: In the generalization task, performance across centers was limited. In the adaptation task, all teams improved after fine-tuning, though ranking stability was low. The ViViT-based submission achieved the strongest overall performance. The challenge highlighted limitations in generalization, sensitivity to class imbalance, and difficulties in hyperparameter tuning in decentralized training, while spatiotemporal modeling and context-aware preprocessing emerged as promising strategies. Conclusion: The FedSurg Challenge establishes the first benchmark for evaluating FL strategies in surgical video classification. Findings highlight the trade-off between local personalization and global robustness, and underscore the importance of architecture choice, preprocessing, and loss design. This benchmarking offers a reference point for future development of imbalance-aware, adaptive, and robust FL methods in clinical surgical AI.

[143] Robust Depth Super-Resolution via Adaptive Diffusion Sampling

Kun Wang, Yun Zhu, Pan Zhou, Na Zhao

Main category: cs.CV

TL;DR: AdaDS is a diffusion-based framework for depth super-resolution that robustly handles arbitrary degradations by adaptively selecting diffusion timesteps based on uncertainty estimation.

Details

Motivation: Conventional depth super-resolution methods struggle with severe or unknown degradations and often produce artifacts. There's a need for a robust approach that can handle diverse degradation patterns in real-world scenarios.

Method: AdaDS leverages the contraction property of Gaussian smoothing in diffusion models. It adaptively selects starting timesteps in the reverse diffusion trajectory based on estimated refinement uncertainty, then injects tailored noise to position samples within the high-probability region of the target posterior distribution.

Result: Extensive experiments on real-world and synthetic benchmarks show AdaDS achieves superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.

Conclusion: AdaDS provides a robust framework for depth super-resolution that can handle arbitrary degradations by leveraging diffusion model priors and adaptive uncertainty-based timestep selection.

Abstract: We propose AdaDS, a generalizable framework for depth super-resolution that robustly recovers high-resolution depth maps from arbitrarily degraded low-resolution inputs. Unlike conventional approaches that directly regress depth values and often exhibit artifacts under severe or unknown degradation, AdaDS capitalizes on the contraction property of Gaussian smoothing: as noise accumulates in the forward process, distributional discrepancies between degraded inputs and their pristine high-quality counterparts diminish, ultimately converging to isotropic Gaussian prior. Leveraging this, AdaDS adaptively selects a starting timestep in the reverse diffusion trajectory based on estimated refinement uncertainty, and subsequently injects tailored noise to position the intermediate sample within the high-probability region of the target posterior distribution. This strategy ensures inherent robustness, enabling generative prior of a pre-trained diffusion model to dominate recovery even when upstream estimations are imperfect. Extensive experiments on real-world and synthetic benchmarks demonstrate AdaDS’s superior zero-shot generalization and resilience to diverse degradation patterns compared to state-of-the-art methods.

[144] Energy-Efficient Fast Object Detection on Edge Devices for IoT Systems

Mas Nurul Achmadiah, Afaroj Ahamad, Chi-Chia Sun, Wen-Kai Kuo

Main category: cs.CV

TL;DR: A lightweight IoT object detection system using frame difference method achieves high accuracy and energy efficiency on edge devices for fast-moving objects like trains and airplanes.

Details

Motivation: IoT systems need energy-efficient fast object detection, but end-to-end methods are inefficient for fast-moving objects. The paper aims to develop a lightweight solution suitable for edge devices.

Method: Uses frame difference method for fast object detection, implemented on three edge devices (AMD AlveoT M U50, Jetson Orin Nano, Hailo-8T M AI Accelerator) with four models including MobileNet and YOLOX. Focuses on detecting fast-moving objects like birds, cars, trains, and airplanes.

Result: MobileNet with frame difference method achieves high accuracy, low latency, and high energy efficiency. YOLOX performs worst. Proposed method improves average accuracy by 28.314%, efficiency by 3.6x, and reduces latency by 39.305% compared to end-to-end methods. Fast objects (trains, airplanes) have lower accuracy.

Conclusion: Frame difference method is superior for fast object detection in IoT systems, offering lightweight, energy-efficient solution. End-to-end methods fail for fast-moving objects. MobileNet is recommended for such applications.

Abstract: This paper presents an Internet of Things (IoT) application that utilizes an AI classifier for fast-object detection using the frame difference method. This method, with its shorter duration, is the most efficient and suitable for fast-object detection in IoT systems, which require energy-efficient applications compared to end-to-end methods. We have implemented this technique on three edge devices: AMD AlveoT M U50, Jetson Orin Nano, and Hailo-8T M AI Accelerator, and four models with artificial neural networks and transformer models. We examined various classes, including birds, cars, trains, and airplanes. Using the frame difference method, the MobileNet model consistently has high accuracy, low latency, and is highly energy-efficient. YOLOX consistently shows the lowest accuracy, lowest latency, and lowest efficiency. The experimental results show that the proposed algorithm has improved the average accuracy gain by 28.314%, the average efficiency gain by 3.6 times, and the average latency reduction by 39.305% compared to the end-to-end method. Of all these classes, the faster objects are trains and airplanes. Experiments show that the accuracy percentage for trains and airplanes is lower than other categories. So, in tasks that require fast detection and accurate results, end-to-end methods can be a disaster because they cannot handle fast object detection. To improve computational efficiency, we designed our proposed method as a lightweight detection algorithm. It is well suited for applications in IoT systems, especially those that require fast-moving object detection and higher accuracy.

[145] A Universal Action Space for General Behavior Analysis

Hung-Shuo Chang, Yue-Cheng Yang, Yu-Hsi Chen, Wei-Hsin Chen, Chien-Yao Wang, James C. Liao, Chien-Chang Chen, Hen-Hsen Huang, Hong-Yuan Mark Liao

Main category: cs.CV

TL;DR: The paper proposes building a Universal Action Space (UAS) from existing human-action datasets to analyze and categorize mammalian and chimpanzee behavior, leveraging deep learning advancements post-ImageNet.

Details

Motivation: Animal and human behavior analysis has been challenging in computer vision, with early approaches relying on hand-crafted features and sparse tracking that lacked robustness. The ImageNet breakthrough enabled large-scale visual recognition through deep learning, creating an opportunity to build comprehensive action representations for behavior analysis.

Method: The authors build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets, then use this UAS as a foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets.

Result: The approach enables behavior analysis beyond traditional methods, with source code released on GitHub for the Universal Action Space implementation.

Conclusion: The paper demonstrates that leveraging deep learning-based action representations from human datasets can effectively analyze animal behavior, bridging the gap between human and animal behavior analysis in computer vision.

Abstract: Analyzing animal and human behavior has long been a challenging task in computer vision. Early approaches from the 1970s to the 1990s relied on hand-crafted edge detection, segmentation, and low-level features such as color, shape, and texture to locate objects and infer their identities-an inherently ill-posed problem. Behavior analysis in this era typically proceeded by tracking identified objects over time and modeling their trajectories using sparse feature points, which further limited robustness and generalization. A major shift occurred with the introduction of ImageNet by Deng and Li in 2010, which enabled large-scale visual recognition through deep neural networks and effectively served as a comprehensive visual dictionary. This development allowed object recognition to move beyond complex low-level processing toward learned high-level representations. In this work, we follow this paradigm to build a large-scale Universal Action Space (UAS) using existing labeled human-action datasets. We then use this UAS as the foundation for analyzing and categorizing mammalian and chimpanzee behavior datasets. The source code is released on GitHub at https://github.com/franktpmvu/Universal-Action-Space.

[146] Attention to details, logits to truth: visual-aware attention and logits enhancement to mitigate hallucinations in LVLMs

Jingyi Wang, Fei Li, Rujie Liu

Main category: cs.CV

TL;DR: Training-free attentional intervention algorithm for LVLMs that reduces hallucinations by enhancing attention to task-relevant visual tokens using visual-textual similarity and beam search modification.

Details

Motivation: Existing Large Vision-Language Models suffer from insufficient visual attention leading to hallucinations. Current methods that boost attention for all visual tokens increase attention to irrelevant tokens, creating a need for more selective attention enhancement.

Method: Proposes a training-free algorithm that: 1) Uses vision-text cross-attention submatrices (representing visual-textual correlations) to construct reweighting matrices for attention reallocation, 2) Injects visual attention values into beam search decoding to prioritize solutions with higher visual attention.

Result: Extensive experiments show the method significantly reduces hallucinations across mainstream LVLMs while preserving accuracy and coherence of generated content.

Conclusion: The proposed training-free attentional intervention effectively addresses hallucination issues in LVLMs by selectively enhancing attention to task-relevant visual tokens based on visual-textual similarity.

Abstract: Existing Large Vision-Language Models (LVLMs) exhibit insufficient visual attention, leading to hallucinations. To alleviate this problem, some previous studies adjust and amplify visual attention. These methods present a limitation that boosting attention for all visual tokens inevitably increases attention to task irrelevant tokens. To tackle this challenge, we propose a training free attentional intervention algorithm to enhance the attention of task-relevant tokens based on the argument that task-relevant tokens generally demonstrate high visual-textual similarities. Specifically, the vision-text cross-attention submatrices, which represent visual-textual correlations, are extracted to construct the reweighting matrices to reallocate attention. Besides, to enhance the contribution of visual tokens, we inject visual attention values into the beam search decoding to identify solutions with higher visual attention. Extensive experiments demonstrate that this method significantly reduces hallucinations across mainstream LVLMs, while preserving the accuracy and coherence of generated content.

[147] Singpath-VL Technical Report

Zhen Qiu, Kaiwen Xiao, Zhengwei Lu, Xiangyu Liu, Lei Zhao, Hao Zhang

Main category: cs.CV

TL;DR: Singpath-VL is a vision-language model for cervical cytology that uses synthetic data generation and fine-tuning to achieve superior performance in cell morphology analysis and diagnostic classification.

Details

Motivation: Despite advances in multimodal LLMs for computational pathology, their application in cervical cytopathology remains limited due to scarcity of large-scale, high-quality annotated datasets for cell morphology analysis.

Method: Three-stage pipeline: 1) Use general-purpose MLLMs as weak annotators, 2) Refine outputs through consensus fusion and expert knowledge injection, 3) Generate million-scale synthetic image-description dataset. Then fine-tune Qwen3-VL-4B model via multi-stage strategy to create specialized cytopathology MLLM.

Result: Singpath-VL demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. The authors will open-source a portion of the synthetic dataset and benchmark.

Conclusion: The work successfully addresses the data scarcity problem in cervical cytology through synthetic data generation and creates a specialized vision-language model that advances AI assistance in cytopathology.

Abstract: We present Singpath-VL, a vision-language large model, to fill the vacancy of AI assistant in cervical cytology. Recent advances in multi-modal large language models (MLLMs) have significantly propelled the field of computational pathology. However, their application in cytopathology, particularly cervical cytology, remains underexplored, primarily due to the scarcity of large-scale, high-quality annotated datasets. To bridge this gap, we first develop a novel three-stage pipeline to synthesize a million-scale image-description dataset. The pipeline leverages multiple general-purpose MLLMs as weak annotators, refines their outputs through consensus fusion and expert knowledge injection, and produces high-fidelity descriptions of cell morphology. Using this dataset, we then fine-tune the Qwen3-VL-4B model via a multi-stage strategy to create a specialized cytopathology MLLM. The resulting model, named Singpath-VL, demonstrates superior performance in fine-grained morphological perception and cell-level diagnostic classification. To advance the field, we will open-source a portion of the synthetic dataset and benchmark.

[148] HLGFA: High-Low Resolution Guided Feature Alignment for Unsupervised Anomaly Detection

Han Zhou, Yuxuan Gao, Yinchao Du, Xuezhe Zheng

Main category: cs.CV

TL;DR: HLGFA: A high-low resolution guided feature alignment framework for unsupervised industrial anomaly detection that learns normality by modeling cross-resolution feature consistency instead of pixel-level reconstruction.

Details

Motivation: Unsupervised industrial anomaly detection is crucial for manufacturing inspection where defect samples are scarce. Existing methods often rely on pixel-level reconstruction which can be suboptimal. The paper aims to develop a more effective approach by learning normality through cross-resolution feature consistency.

Method: Proposes HLGFA framework that processes dual-resolution inputs through a shared frozen backbone to extract multi-level features. High-resolution representations are decomposed into structure and detail priors to guide refinement of low-resolution features via conditional modulation and gated residual correction. Includes noise-aware data augmentation to suppress nuisance-induced responses.

Result: Achieves state-of-the-art performance: 97.9% pixel-level AUROC and 97.5% image-level AUROC on MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.

Conclusion: HLGFA provides an effective alternative to reconstruction-based methods by learning normality through cross-resolution feature alignment, demonstrating strong performance for industrial anomaly detection with practical noise-aware augmentation.

Abstract: Unsupervised industrial anomaly detection (UAD) is essential for modern manufacturing inspection, where defect samples are scarce and reliable detection is required. In this paper, we propose HLGFA, a high-low resolution guided feature alignment framework that learns normality by modeling cross-resolution feature consistency between high-resolution and low-resolution representations of normal samples, instead of relying on pixel-level reconstruction. Dual-resolution inputs are processed by a shared frozen backbone to extract multi-level features, and high-resolution representations are decomposed into structure and detail priors to guide the refinement of low-resolution features through conditional modulation and gated residual correction. During inference, anomalies are naturally identified as regions where cross-resolution alignment breaks down. In addition, a noise-aware data augmentation strategy is introduced to suppress nuisance-induced responses commonly observed in industrial environments. Extensive experiments on standard benchmarks demonstrate the effectiveness of HLGFA, achieving 97.9% pixel-level AUROC and 97.5% image-level AUROC on the MVTec AD dataset, outperforming representative reconstruction-based and feature-based methods.

[149] SchröMind: Mitigating Hallucinations in Multimodal Large Language Models via Solving the Schrödinger Bridge Problem

Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata

Main category: cs.CV

TL;DR: SchröMind is a novel framework that reduces hallucinations in multimodal LLMs by solving the Schrödinger bridge problem to map hallucinatory activations to truthful ones with minimal transport cost.

Details

Motivation: MLLMs have limited use in high-stakes fields like healthcare due to persistent hallucinations where generated text contradicts visual input. While MLLMs can comprehend images, they struggle to produce accurate token sequences, and minor perturbations can shift attention from truthful to untruthful states.

Method: Proposes SchröMind framework that reduces hallucinations via solving the Schrödinger bridge problem. It establishes token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training while preserving original model capabilities.

Result: Extensive experiments on POPE and MME benchmarks demonstrate state-of-the-art performance with minimal computational overhead.

Conclusion: SchröMind effectively addresses hallucination issues in MLLMs through principled mathematical formulation, enabling more reliable deployment in high-stakes applications.

Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have achieved significant success across various domains. However, their use in high-stakes fields like healthcare remains limited due to persistent hallucinations, where generated text contradicts or ignores visual input. We contend that MLLMs can comprehend images but struggle to produce accurate token sequences. Minor perturbations can shift attention from truthful to untruthful states, and the autoregressive nature of text generation often prevents error correction. To address this, we propose SchröMind-a novel framework reducing hallucinations via solving the Schrödinger bridge problem. It establishes a token-level mapping between hallucinatory and truthful activations with minimal transport cost through lightweight training, while preserving the model’s original capabilities. Extensive experiments on the POPE and MME benchmarks demonstrate the superiority of Schrödinger, which achieves state-of-the-art performance while introducing only minimal computational overhead.

[150] SCA-Net: Spatial-Contextual Aggregation Network for Enhanced Small Building and Road Change Detection

Emad Gholibeigi, Abbas Koochari, Azadeh ZamaniFar

Main category: cs.CV

TL;DR: SCA-Net is an enhanced architecture for building and road change detection in bi-temporal remote sensing images, featuring multi-scale analysis, attention mechanisms, and improved training strategies that outperform existing methods.

Details

Motivation: Current deep learning models for remote sensing change detection struggle with low sensitivity to small objects and high computational costs, limiting practical applications in urban management, environmental monitoring, and disaster assessment.

Method: Enhanced Change-Agent framework with Difference Pyramid Block for multi-scale change analysis, Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, multi-level attention mechanisms (PPM and CSAGate), dynamic composite loss function, and four-phase training strategy.

Result: Achieves 2.64% improvement in mIoU on LEVIR-MCI, 57.9% increase in IoU for small buildings, and reduces training time by 61% compared to Change-Agent and other state-of-the-art methods.

Conclusion: SCA-Net provides an efficient, accurate, and robust solution for practical change detection applications in remote sensing imagery.

Abstract: Automated change detection in remote sensing imagery is critical for urban management, environmental monitoring, and disaster assessment. While deep learning models have advanced this field, they often struggle with challenges like low sensitivity to small objects and high computational costs. This paper presents SCA-Net, an enhanced architecture built upon the Change-Agent framework for precise building and road change detection in bi-temporal images. Our model incorporates several key innovations: a novel Difference Pyramid Block for multi-scale change analysis, an Adaptive Multi-scale Processing module combining shape-aware and high-resolution enhancement blocks, and multi-level attention mechanisms (PPM and CSAGate) for joint contextual and detail processing. Furthermore, a dynamic composite loss function and a four-phase training strategy are introduced to stabilize training and accelerate convergence. Comprehensive evaluations on the LEVIR-CD and LEVIR-MCI datasets demonstrate SCA-Net’s superior performance over Change-Agent and other state-of-the-art methods. Our approach achieves a significant 2.64% improvement in mean Intersection over Union (mIoU) on LEVIR-MCI and a remarkable 57.9% increase in IoU for small buildings, while reducing the training time by 61%. This work provides an efficient, accurate, and robust solution for practical change detection applications.

Bohan Fu, Guanyi Qin, Fazhan Zhang, Zihao Huang, Mingxuan Li, Runze Hu

Main category: cs.CV

TL;DR: DR.Experts is a novel blind image quality assessment framework that incorporates distortion priors through a degradation-aware vision-language model and dynamic weighting mechanisms to better align with human perception.

Details

Motivation: Existing blind image quality assessment models fail to effectively capture subtle distortion cues, leading to misalignment with human subjective judgments due to lack of reliable distortion priors and shallow feature-quality relationships.

Method: Uses degradation-aware vision-language model for distortion-specific priors, Distortion-Saliency Differential Module to refine priors by distinguishing from semantic attention, and Dynamic Distortion Weighting Module (mixture-of-experts style) to weight distortion-specific features based on perceptual impact.

Result: Extensive experiments on five challenging BIQA benchmarks demonstrate superiority over current methods, with excellent generalization and data efficiency.

Conclusion: DR.Experts effectively addresses the distortion prior limitation in BIQA by explicitly incorporating and weighting distortion-specific features, achieving better alignment with human perception.

Abstract: Blind Image Quality Assessment, aiming to replicate human perception of visual quality without reference, plays a key role in vision tasks, yet existing models often fail to effectively capture subtle distortion cues, leading to a misalignment with human subjective judgments. We identify that the root cause of this limitation lies in the lack of reliable distortion priors, as methods typically learn shallow relationships between unified image features and quality scores, resulting in their insensitive nature to distortions and thus limiting their performance. To address this, we introduce DR.Experts, a novel prior-driven BIQA framework designed to explicitly incorporate distortion priors, enabling a reliable quality assessment. DR.Experts begins by leveraging a degradation-aware vision-language model to obtain distortion-specific priors, which are further refined and enhanced by the proposed Distortion-Saliency Differential Module through distinguishing them from semantic attentions, thereby ensuring the genuine representations of distortions. The refined priors, along with semantics and bridging representation, are then fused by a proposed mixture-of-experts style module named the Dynamic Distortion Weighting Module. This mechanism weights each distortion-specific feature as per its perceptual impact, ensuring that the final quality prediction aligns with human perception. Extensive experiments conducted on five challenging BIQA benchmarks demonstrate the superiority of DR.Experts over current methods and showcase its excellence in terms of generalization and data efficiency.

[152] RAD: Retrieval-Augmented Monocular Metric Depth Estimation for Underrepresented Classes

Michael Baltaxe, Dan Levi, Sagie Benaim

Main category: cs.CV

TL;DR: RAD is a retrieval-augmented framework for monocular metric depth estimation that uses retrieved RGB-D neighbors as geometric proxies to improve accuracy for underrepresented classes in complex scenes.

Details

Motivation: Accurate depth estimation for underrepresented classes in complex scenes remains challenging for monocular metric depth estimation. Current methods struggle with classes that have limited training data, leading to poor performance in real-world applications requiring physical intelligence.

Method: Proposes RAD with uncertainty-aware retrieval to identify low-confidence regions, retrieves RGB-D context samples with semantically similar content, processes input and context via dual-stream network, and fuses them using matched cross-attention that transfers geometric information only at reliable point correspondences.

Result: Significantly outperforms state-of-the-art baselines on underrepresented classes: reduces relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.

Conclusion: RAD effectively addresses the challenge of accurate depth estimation for underrepresented classes by leveraging retrieved geometric proxies, demonstrating the value of retrieval-augmented approaches for improving monocular depth estimation in complex scenes.

Abstract: Monocular Metric Depth Estimation (MMDE) is essential for physically intelligent systems, yet accurate depth estimation for underrepresented classes in complex scenes remains a persistent challenge. To address this, we propose RAD, a retrieval-augmented framework that approximates the benefits of multi-view stereo by utilizing retrieved neighbors as structural geometric proxies. Our method first employs an uncertainty-aware retrieval mechanism to identify low-confidence regions in the input and retrieve RGB-D context samples containing semantically similar content. We then process both the input and retrieved context via a dual-stream network and fuse them using a matched cross-attention module, which transfers geometric information only at reliable point correspondences. Evaluations on NYU Depth v2, KITTI, and Cityscapes demonstrate that RAD significantly outperforms state-of-the-art baselines on underrepresented classes, reducing relative absolute error by 29.2% on NYU Depth v2, 13.3% on KITTI, and 7.2% on Cityscapes, while maintaining competitive performance on standard in-domain benchmarks.

[153] AUHead: Realistic Emotional Talking Head Generation via Action Units Control

Jiayi Lyu, Leigang Qu, Wenjing Zhang, Hanyu Jiang, Kai Liu, Zhenglin Zhou, Xiaobo Xia, Jian Xue, Tat-Seng Chua

Main category: cs.CV

TL;DR: AUHead: A two-stage method for controllable talking-head video generation using Action Units (AUs) disentangled from audio via large audio-language models and AU-driven diffusion models.

Details

Motivation: Current talking-head video generation methods lack fine-grained emotion control, struggling with nuanced emotional expressions. The paper aims to address this by introducing a method that disentangles Action Units (AUs) from audio for precise emotional control in video generation.

Method: Two-stage approach: 1) Uses large audio-language models with spatial-temporal AU tokenization and “emotion-then-AU” chain-of-thought to disentangle AUs from speech. 2) AU-driven controllable diffusion model that maps AU sequences to 2D facial representations and models AU-vision interactions via cross-attention, with AU disentanglement guidance for flexible quality control.

Result: Achieves competitive performance on benchmark datasets in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques.

Conclusion: AUHead enables fine-grained emotion control in talking-head video generation by effectively disentangling Action Units from audio and using them to drive a controllable diffusion model, resulting in improved emotional expressiveness and identity consistency.

Abstract: Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems. Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control. To address this issue, we introduce a novel two-stage method (AUHead) to disentangle fine-grained emotion control, i.e. , Action Units (AUs), from audio and achieve controllable generation. In the first stage, we explore the AU generation abilities of large audio-language models (ALMs), by spatial-temporal AU tokenization and an “emotion-then-AU” chain-of-thought mechanism. It aims to disentangle AUs from raw speech, effectively capturing subtle emotional cues. In the second stage, we propose an AU-driven controllable diffusion model that synthesizes realistic talking-head videos conditioned on AU sequences. Specifically, we first map the AU sequences into the structured 2D facial representation to enhance spatial fidelity, and then model the AU-vision interaction within cross-attention modules. To achieve flexible AU-quality trade-off control, we introduce an AU disentanglement guidance strategy during inference, further refining the emotional expressiveness and identity consistency of the generated videos. Results on benchmark datasets demonstrate that our approach achieves competitive performance in emotional realism, accurate lip synchronization, and visual coherence, significantly surpassing existing techniques. Our implementation is available at https://github.com/laura990501/AUHead_ICLR

[154] Scalpel: Fine-Grained Alignment of Attention Activation Manifolds via Mixture Gaussian Bridges to Mitigate Multimodal Hallucination

Ziqiang Shi, Rujie Liu, Shanshan Yu, Satoshi Munakata, Koichi Shirahata

Main category: cs.CV

TL;DR: Scalpel reduces hallucinations in large vision-language models by refining attention activations toward credible regions using Gaussian mixture modeling and optimal transport mapping.

Details

Motivation: Large vision-language models often generate outputs inconsistent with visual content (hallucinations) due to strong language priors and misaligned attention across modalities.

Method: Predicts trusted attention directions for each Transformer head during inference, uses Gaussian mixture model to capture multi-peak distributions in trust/hallucination manifolds, and applies entropic optimal transport (Schrödinger bridge) to map components precisely. Dynamically adjusts intervention strength based on component membership.

Result: Extensive experiments across multiple datasets and benchmarks show Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance.

Conclusion: Scalpel is a model- and data-agnostic approach that reduces hallucinations in LVLMs without additional computation, requiring only a single decoding step.

Abstract: Rapid progress in large vision-language models (LVLMs) has achieved unprecedented performance in vision-language tasks. However, due to the strong prior of large language models (LLMs) and misaligned attention across modalities, LVLMs often generate outputs inconsistent with visual content - termed hallucination. To address this, we propose \textbf{Scalpel}, a method that reduces hallucination by refining attention activation distributions toward more credible regions. Scalpel predicts trusted attention directions for each head in Transformer layers during inference and adjusts activations accordingly. It employs a Gaussian mixture model to capture multi-peak distributions of attention in trust and hallucination manifolds, and uses entropic optimal transport (equivalent to Schrödinger bridge problem) to map Gaussian components precisely. During mitigation, Scalpel dynamically adjusts intervention strength and direction based on component membership and mapping relationships between hallucination and trust activations. Extensive experiments across multiple datasets and benchmarks demonstrate that Scalpel effectively mitigates hallucinations, outperforming previous methods and achieving state-of-the-art performance. Moreover, Scalpel is model- and data-agnostic, requiring no additional computation, only a single decoding step.

[155] Delving into Spectral Clustering with Vision-Language Representations

Bo Peng, Yuanwei Hu, Bo Liu, Ling Chen, Jie Lu, Zhen Fang

Main category: cs.CV

TL;DR: Proposes Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models for multi-modal spectral clustering, using positive nouns as anchors to formulate image affinity as a combination of visual proximity and semantic overlap.

Details

Motivation: Most spectral clustering approaches use single-modal data, leaving multi-modal information untapped. The paper aims to extend spectral clustering from single-modal to multi-modal regime by leveraging recent advances in vision-language pre-training.

Method: Uses Neural Tangent Kernel (NTK) anchored with positive nouns semantically close to images. Formulates image affinity as coupling of visual proximity and semantic overlap. Includes regularized affinity diffusion mechanism that adaptively ensembles affinity matrices from different prompts.

Result: Extensive experiments on 16 benchmarks (classical, large-scale, fine-grained, domain-shifted datasets) show the method consistently outperforms state-of-the-art by a large margin.

Conclusion: The method successfully extends spectral clustering to multi-modal regime using vision-language models, demonstrating superior performance across diverse datasets by leveraging cross-modal alignment.

Abstract: Spectral clustering is known as a powerful technique in unsupervised data analysis. The vast majority of approaches to spectral clustering are driven by a single modality, leaving the rich information in multi-modal representations untapped. Inspired by the recent success of vision-language pre-training, this paper enriches the landscape of spectral clustering from a single-modal to a multi-modal regime. Particularly, we propose Neural Tangent Kernel Spectral Clustering that leverages cross-modal alignment in pre-trained vision-language models. By anchoring the neural tangent kernel with positive nouns, i.e., those semantically close to the images of interest, we arrive at formulating the affinity between images as a coupling of their visual proximity and semantic overlap. We show that this formulation amplifies within-cluster connections while suppressing spurious ones across clusters, hence encouraging block-diagonal structures. In addition, we present a regularized affinity diffusion mechanism that adaptively ensembles affinity matrices induced by different prompts. Extensive experiments on \textbf{16} benchmarks – including classical, large-scale, fine-grained and domain-shifted datasets – manifest that our method consistently outperforms the state-of-the-art by a large margin.

[156] MieDB-100k: A Comprehensive Dataset for Medical Image Editing

Yongfan Lai, Wen Qian, Bo Liu, Hongyan Li, Hao Luo, Fan Wang, Bohan Zhuang, Shenda Hong

Main category: cs.CV

TL;DR: MieDB-100k is a large-scale, high-quality medical image editing dataset addressing data scarcity in multimodal generative models for medical applications.

Details

Motivation: The scarcity of high-quality data is a primary bottleneck for adapting multimodal generative models to medical image editing. Existing datasets have limited diversity, neglect medical image understanding, and struggle to balance quality with scalability.

Method: Created MieDB-100k dataset with categorization into Perception, Modification and Transformation tasks. Used a data curation pipeline leveraging modality-specific expert models and rule-based synthetic methods, followed by rigorous manual inspection for clinical fidelity.

Result: Models trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability.

Conclusion: MieDB-100k serves as a cornerstone for future advancements in specialized medical image editing, addressing critical data scarcity issues in medical multimodal generative models.

Abstract: The scarcity of high-quality data remains a primary bottleneck in adapting multimodal generative models for medical image editing. Existing medical image editing datasets often suffer from limited diversity, neglect of medical image understanding and inability to balance quality with scalability. To address these gaps, we propose MieDB-100k, a large-scale, high-quality and diverse dataset for text-guided medical image editing. It categorizes editing tasks into perspectives of Perception, Modification and Transformation, considering both understanding and generation abilities. We construct MieDB-100k via a data curation pipeline leveraging both modality-specific expert models and rule-based data synthetic methods, followed by rigorous manual inspection to ensure clinical fidelity. Extensive experiments demonstrate that model trained with MieDB-100k consistently outperform both open-source and proprietary models while exhibiting strong generalization ability. We anticipate that this dataset will serve as a cornerstone for future advancements in specialized medical image editing.

[157] Hand2World: Autoregressive Egocentric Interaction Generation via Free-Space Hand Gestures

Yuxi Wang, Wenqi Ouyang, Tianyi Wei, Yi Dong, Zhiqi Shen, Xingang Pan

Main category: cs.CV

TL;DR: Hand2World: An autoregressive framework for generating photorealistic egocentric interaction videos from single scene images using 3D hand mesh conditioning and camera geometry injection for stable, long-term synthesis.

Details

Motivation: Egocentric interactive world models are crucial for AR and embodied AI, requiring low-latency, geometrically consistent, and stable visual generation that responds to user input. The paper addresses challenges in generating egocentric interaction videos from single scene images under free-space hand gestures, including distribution shift between training data and real gestures, motion ambiguity in monocular views, and arbitrary-length video generation needs.

Method: Hand2World uses a unified autoregressive framework with occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility/occlusion inference from scene context. It injects explicit camera geometry via per-pixel Plücker-ray embeddings to disentangle camera motion from hand motion and prevent background drift. The approach includes an automated monocular annotation pipeline and distills a bidirectional diffusion model into a causal generator for arbitrary-length synthesis.

Result: Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.

Conclusion: Hand2World successfully addresses key challenges in egocentric interaction generation, providing a framework that enables photorealistic, geometrically consistent, and stable video synthesis from single scene images with hand gesture control, supporting applications in AR and embodied AI.

Abstract: Egocentric interactive world models are essential for augmented reality and embodied AI, where visual generation must respond to user input with low latency, geometric consistency, and long-term stability. We study egocentric interaction generation from a single scene image under free-space hand gestures, aiming to synthesize photorealistic videos in which hands enter the scene, interact with objects, and induce plausible world dynamics under head motion. This setting introduces fundamental challenges, including distribution shift between free-space gestures and contact-heavy training data, ambiguity between hand motion and camera motion in monocular views, and the need for arbitrary-length video generation. We present Hand2World, a unified autoregressive framework that addresses these challenges through occlusion-invariant hand conditioning based on projected 3D hand meshes, allowing visibility and occlusion to be inferred from scene context rather than encoded in the control signal. To stabilize egocentric viewpoint changes, we inject explicit camera geometry via per-pixel Plücker-ray embeddings, disentangling camera motion from hand motion and preventing background drift. We further develop a fully automated monocular annotation pipeline and distill a bidirectional diffusion model into a causal generator, enabling arbitrary-length synthesis. Experiments on three egocentric interaction benchmarks show substantial improvements in perceptual quality and 3D consistency while supporting camera control and long-horizon interactive generation.

[158] Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing

Jialun Liu, Yukuo Ma, Xiao Cao, Tian Li, Gonghu Shang, Haibin Huang, Chi Zhang, Xuelong Li, Cong Liu, Junqi Liu, Jiakui Hu, Robby T. Tan, Shiwen Zhang, Liying Yang, Xiaoyan Yang, Qizhen Weng, Xiangzhen Chang, Yuanzhi Liang, Yifan Xu, Zhiyong Huang, Zuoxin Li, Xuelong Li

Main category: cs.CV

TL;DR: Tele-Omni: A unified multimodal framework for video generation and editing that processes text, image, and video inputs through MLLM parsing and diffusion-based synthesis.

Details

Motivation: Existing video generation methods are task-specific, rely mainly on text instructions, and lack ability to handle multimodal inputs and diverse scenarios in a unified framework. Video editing methods often require specialized pipelines for each operation, limiting scalability.

Method: Uses pretrained multimodal large language models to parse heterogeneous instructions and infer structured intents, while diffusion-based generators perform video synthesis. Introduces task-aware data processing to unify multimodal inputs into structured instruction format while preserving task-specific constraints.

Result: Tele-Omni achieves competitive performance across multiple video tasks including text-to-video, image-to-video, first-last-frame video generation, in-context video generation, and in-context video editing.

Conclusion: By decoupling instruction parsing from video synthesis with task-aware data design, Tele-Omni enables flexible multimodal control while maintaining temporal coherence and visual consistency in a unified framework.

Abstract: Recent advances in diffusion-based video generation have substantially improved visual fidelity and temporal coherence. However, most existing approaches remain task-specific and rely primarily on textual instructions, limiting their ability to handle multimodal inputs, contextual references, and diverse video generation and editing scenarios within a unified framework. Moreover, many video editing methods depend on carefully engineered pipelines tailored to individual operations, which hinders scalability and composability. In this paper, we propose Tele-Omni, a unified multimodal framework for video generation and editing that follows multimodal instructions, including text, images, and reference videos, within a single model. Tele-Omni leverages pretrained multimodal large language models to parse heterogeneous instructions and infer structured generation or editing intents, while diffusion-based generators perform high-quality video synthesis conditioned on these structured signals. To enable joint training across heterogeneous video tasks, we introduce a task-aware data processing pipeline that unifies multimodal inputs into a structured instruction format while preserving task-specific constraints. Tele-Omni supports a wide range of video-centric tasks, including text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing. By decoupling instruction parsing from video synthesis and combining it with task-aware data design, Tele-Omni achieves flexible multimodal control while maintaining strong temporal coherence and visual consistency. Experimental results demonstrate that Tele-Omni achieves competitive performance across multiple tasks.

[159] AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

Main category: cs.CV

TL;DR: AGMark is an attention-guided dynamic watermarking framework for Large Vision-Language Models that preserves visual fidelity while embedding detectable signals through adaptive token selection based on attention weights and uncertainty awareness.

Details

Motivation: Current LVLM watermarking methods have limitations: vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding, while vision-specific watermarks use static weight estimation and ignore distribution density, failing to account for dynamic visual dependence changes during generation and potentially introducing low-quality tokens.

Method: AGMark dynamically identifies semantic-critical evidence at each decoding step using attention weights for visual relevance and context-aware coherence cues. It determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), enabling adaptive vocabulary partitioning to avoid irrelevant tokens.

Result: AGMark outperforms conventional methods, improving generation quality with strong gains in visual semantic fidelity in later generation stages. It maintains high detection accuracy (≥99.36% AUC) and robust attack resilience (≥88.61% AUC) without sacrificing inference efficiency.

Conclusion: AGMark establishes a new standard for reliability-preserving multimodal watermarking by embedding detectable signals while strictly preserving visual fidelity through dynamic, attention-guided adaptation to visual dependence changes during generation.

Abstract: Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36% AUC) and robust attack resilience (at least 88.61% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.

[160] VideoAfford: Grounding 3D Affordance from Human-Object-Interaction Videos via Multimodal Large Language Model

Hanqing Wang, Mingyu Liu, Xiaoyu Chen, Chengwei MA, Yiming Zhong, Wenti Yin, Yuhao Liu, Zhiqing Cui, Jiahao Yuan, Lu Dai, Zhiyuan Ma, Hui Xiong

Main category: cs.CV

TL;DR: VideoAfford: A multimodal LLM-based approach for 3D affordance grounding using video data to capture dynamic interaction context, achieving state-of-the-art performance with strong generalization.

Details

Motivation: Previous 3D affordance grounding methods rely on static cues (language/images) which lack dynamic interaction context revealing temporal and causal cues needed for robotic manipulation. There's a need for video-based approaches that capture how humans actually interact with objects.

Method: 1) Collect VIDA dataset with 38K human-object-interaction videos covering 16 affordance types; 2) Propose VideoAfford framework that activates multimodal LLMs with affordance segmentation capabilities; 3) Use latent action encoder to extract dynamic interaction priors from videos; 4) Introduce spatial-aware loss for comprehensive 3D spatial knowledge.

Result: Significantly outperforms established methods, exhibits strong open-world generalization with affordance reasoning abilities. The VIDA dataset contains 38K videos, 16 affordance types, 38 object categories, and 22K point clouds.

Conclusion: Video-based approaches provide crucial dynamic interaction context for 3D affordance grounding. The proposed VideoAfford framework successfully integrates multimodal LLMs with affordance segmentation, enabling both world knowledge reasoning and fine-grained grounding in a unified system.

Abstract: 3D affordance grounding aims to highlight the actionable regions on 3D objects, which is crucial for robotic manipulation. Previous research primarily focused on learning affordance knowledge from static cues such as language and images, which struggle to provide sufficient dynamic interaction context that can reveal temporal and causal cues. To alleviate this predicament, we collect a comprehensive video-based 3D affordance dataset, \textit{VIDA}, which contains 38K human-object-interaction videos covering 16 affordance types, 38 object categories, and 22K point clouds. Based on \textit{VIDA}, we propose a strong baseline: VideoAfford, which activates multimodal large language models with additional affordance segmentation capabilities, enabling both world knowledge reasoning and fine-grained affordance grounding within a unified framework. To enhance action understanding capability, we leverage a latent action encoder to extract dynamic interaction priors from HOI videos. Moreover, we introduce a \textit{spatial-aware} loss function to enable VideoAfford to obtain comprehensive 3D spatial knowledge. Extensive experimental evaluations demonstrate that our model significantly outperforms well-established methods and exhibits strong open-world generalization with affordance reasoning abilities. All datasets and code will be publicly released to advance research in this area.

[161] Time2General: Learning Spatiotemporal Invariant Representations for Domain-Generalization Video Semantic Segmentation

Siyu Chen, Ting Han, Haoling Huang, Chaolei Wang, Chengzheng Fu, Duxin Zhu, Guorong Cai, Jinhe Su

Main category: cs.CV

TL;DR: Time2General is a domain generalized video semantic segmentation framework that achieves temporally consistent predictions across unseen domains without target labels, addressing both domain shift and temporal-sampling shift issues.

Details

Motivation: Domain Generalized Video Semantic Segmentation (DGVSS) faces challenges with both domain shift and temporal-sampling shift, which break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions.

Method: Proposes Time2General with Stability Queries, featuring a Spatio-Temporal Memory Decoder that aggregates multi-frame context into clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. Also introduces Masked Temporal Consistency Loss to regularize temporal prediction discrepancies across different strides and randomizes training strides.

Result: Extensive experiments on multiple driving benchmarks show substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS.

Conclusion: Time2General effectively addresses domain generalization and temporal consistency challenges in video semantic segmentation for driving scenarios, achieving robust performance across unseen domains with consistent predictions.

Abstract: Domain Generalized Video Semantic Segmentation (DGVSS) is trained on a single labeled driving domain and is directly deployed on unseen domains without target labels and test-time adaptation while maintaining temporally consistent predictions over video streams. In practice, both domain shift and temporal-sampling shift break correspondence-based propagation and fixed-stride temporal aggregation, causing severe frame-to-frame flicker even in label-stable regions. We propose Time2General, a DGVSS framework built on Stability Queries. Time2General introduces a Spatio-Temporal Memory Decoder that aggregates multi-frame context into a clip-level spatio-temporal memory and decodes temporally consistent per-frame masks without explicit correspondence propagation. To further suppress flicker and improve robustness to varying sampling rates, the Masked Temporal Consistency Loss is proposed to regularize temporal prediction discrepancies across different strides, and randomize training strides to expose the model to diverse temporal gaps. Extensive experiments on multiple driving benchmarks show that Time2General achieves a substantial improvement in cross-domain accuracy and temporal stability over prior DGSS and VSS baselines while running at up to 18 FPS. Code will be released after the review process.

[162] TreeCUA: Efficiently Scaling GUI Automation with Tree-Structured Verifiable Evolution

Deyang Jiang, Jing Huang, Xuanle Zhao, Lei Chen, Liming Zheng, Fanfan Liu, Haibo Qiu, Peng Shi, Zhixiong Zeng

Main category: cs.CV

TL;DR: TreeCUA: A framework for scaling GUI automation using tree-structured trajectories and multi-agent collaboration to improve GUI planning for computer-use agents.

Details

Motivation: Existing GUI automation work focuses on GUI grounding rather than the more crucial GUI planning, which requires sophisticated data collection. Current exploration follows tree structures but isn't organized efficiently, leading to redundant data collection.

Method: Proposes TreeCUA with multi-agent collaborative framework for environment exploration, action verification, trajectory summarization, and quality evaluation. Uses tree-based topology to store/replay duplicate nodes, adaptive exploration algorithm to balance depth/breadth, world knowledge guidance, and global memory backtracking. Extends to TreeCUA-DPO using tree node information for improved planning.

Result: Experimental results show significant improvements, with out-of-domain studies demonstrating strong generalization capabilities.

Conclusion: TreeCUA efficiently scales GUI automation through tree-structured verifiable evolution, reducing data costs while improving planning capabilities for computer-use agents.

Abstract: Effectively scaling GUI automation is essential for computer-use agents (CUAs); however, existing work primarily focuses on scaling GUI grounding rather than the more crucial GUI planning, which requires more sophisticated data collection. In reality, the exploration process of a CUA across apps/desktops/web pages typically follows a tree structure, with earlier functional entry points often being explored more frequently. Thus, organizing large-scale trajectories into tree structures can reduce data cost and streamline the data scaling of GUI planning. In this work, we propose TreeCUA to efficiently scale GUI automation with tree-structured verifiable evolution. We propose a multi-agent collaborative framework to explore the environment, verify actions, summarize trajectories, and evaluate quality to generate high-quality and scalable GUI trajectories. To improve efficiency, we devise a novel tree-based topology to store and replay duplicate exploration nodes, and design an adaptive exploration algorithm to balance the depth (\emph{i.e.}, trajectory difficulty) and breadth (\emph{i.e.}, trajectory diversity). Moreover, we develop world knowledge guidance and global memory backtracking to avoid low-quality generation. Finally, we naturally extend and propose the TreeCUA-DPO method from abundant tree node information, improving GUI planning capability by referring to the branch information of adjacent trajectories. Experimental results show that TreeCUA and TreeCUA-DPO offer significant improvements, and out-of-domain (OOD) studies further demonstrate strong generalization. All trajectory node information and code will be available at https://github.com/UITron-hub/TreeCUA.

[163] Semi-supervised Liver Segmentation and Patch-based Fibrosis Staging with Registration-aided Multi-parametric MRI

Boya Wang, Ruizhe Li, Chao Chen, Xin Chen

Main category: cs.CV

TL;DR: A multi-task deep learning framework for liver segmentation and fibrosis staging using multiparametric MRI, addressing limited annotations and domain shifts through semi-supervised learning.

Details

Motivation: Liver fibrosis is clinically challenging, requiring precise segmentation and staging. Limited annotated medical images and complexities of multi-parametric MRI data create difficulties for accurate diagnosis.

Method: Two-phase approach: 1) LiSeg uses semi-supervised learning integrating segmentation and registration to handle limited labels and domain shifts; 2) LiFS employs patch-based classification for fibrosis staging with visualization capabilities.

Result: Method tested on independent dataset with both in-distribution and out-of-distribution cases using 3-channel and 7-channel MRI data. Code is publicly available.

Conclusion: The framework effectively handles multimodality imaging data, limited labels, and domain shifts for liver segmentation and fibrosis staging tasks.

Abstract: Liver fibrosis poses a substantial challenge in clinical practice, emphasizing the necessity for precise liver segmentation and accurate disease staging. Based on the CARE Liver 2025 Track 4 Challenge, this study introduces a multi-task deep learning framework developed for liver segmentation (LiSeg) and liver fibrosis staging (LiFS) using multiparametric MRI. The LiSeg phase addresses the challenge of limited annotated images and the complexities of multi-parametric MRI data by employing a semi-supervised learning model that integrates image segmentation and registration. By leveraging both labeled and unlabeled data, the model overcomes the difficulties introduced by domain shifts and variations across modalities. In the LiFS phase, we employed a patchbased method which allows the visualization of liver fibrosis stages based on the classification outputs. Our approach effectively handles multimodality imaging data, limited labels, and domain shifts. The proposed method has been tested by the challenge organizer on an independent test set that includes in-distribution (ID) and out-of-distribution (OOD) cases using three-channel MRIs (T1, T2, DWI) and seven-channel MRIs (T1, T2, DWI, GED1-GED4). The code is freely available. Github link: https://github.com/mileywang3061/Care-Liver

[164] GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

Sandesh Hegde, Jaison Saji Chacko, Debarshi Banerjee, Uma Mahesh

Main category: cs.CV

TL;DR: GenSeg-R1: A decoupled reason-then-segment pipeline for referring image segmentation using VLMs to generate spatial prompts and SAM 2 for segmentation, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: To address fine-grained referring image segmentation by decoupling reasoning and segmentation tasks, enabling VLMs to focus on scene understanding while leveraging specialized segmentation models for mask generation.

Method: Uses a two-stage pipeline: 1) Vision-language model (Qwen3-VL) receives image and query, reasons about scene, and outputs structured spatial prompts (bounding box + two interior keypoints per instance); 2) Frozen SAM 2 converts prompts into masks. Trained with Group Relative Policy Optimization (GRPO) without supervised reasoning-chain annotations.

Result: GenSeg-R1-8B achieves 0.7127 cIoU and 0.7382 mIoU on RefCOCOg validation, outperforming Qwen3-VL Instruct baselines by +15.3/+21.9 points and Seg-Zero-7B by +3.3 cIoU. GenSeg-R1-G variant achieves 76.69% target mIoU on GRefCOCO with 82.40% accuracy on negative prompts. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing competitors by 7.0-10.7 points.

Conclusion: The decoupled reason-then-segment approach effectively combines VLMs’ reasoning capabilities with specialized segmentation models, achieving state-of-the-art performance on referring image segmentation tasks while enabling no-target detection.

Abstract: We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.

[165] Stroke3D: Lifting 2D strokes into rigged 3D model via latent diffusion models

Ruisi Zhao, Haoren Zheng, Zongxin Yang, Hehe Fan, Yi Yang

Main category: cs.CV

TL;DR: Stroke3D generates rigged 3D meshes from 2D drawn strokes and text prompts using a two-stage pipeline: controllable skeleton generation via Sk-VAE/Sk-DiT, and enhanced mesh synthesis via TextuRig dataset and SKA-DPO optimization.

Details

Motivation: Existing 3D generation methods struggle with animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. There's a need for intuitive workflow to create ready-to-animate 3D content from simple user inputs.

Method: Two-stage pipeline: 1) Controllable Skeleton Generation using Skeletal Graph VAE (Sk-VAE) to encode skeleton graph structure, and Skeletal Graph DiT (Sk-DiT) to generate skeletal embedding conditioned on text and 2D strokes; 2) Enhanced Mesh Synthesis using TextuRig dataset (textured rigged meshes from Objaverse-XL) and SKA-DPO preference optimization guided by skeleton-mesh alignment score.

Result: Stroke3D produces plausible skeletons and high-quality meshes, enabling intuitive creation of ready-to-animate 3D content from 2D strokes and text prompts.

Conclusion: Stroke3D is the first framework to generate rigged 3D meshes conditioned on user-drawn 2D strokes, providing fine-grained structural control and addressing limitations of existing 3D generation and rigging methods.

Abstract: Rigged 3D assets are fundamental to 3D deformation and animation. However, existing 3D generation methods face challenges in generating animatable geometry, while rigging techniques lack fine-grained structural control over skeleton creation. To address these limitations, we introduce Stroke3D, a novel framework that directly generates rigged meshes from user inputs: 2D drawn strokes and a descriptive text prompt. Our approach pioneers a two-stage pipeline that separates the generation into: 1) Controllable Skeleton Generation, we employ the Skeletal Graph VAE (Sk-VAE) to encode the skeleton’s graph structure into a latent space, where the Skeletal Graph DiT (Sk-DiT) generates a skeletal embedding. The generation process is conditioned on both the text for semantics and the 2D strokes for explicit structural control, with the VAE’s decoder reconstructing the final high-quality 3D skeleton; and 2) Enhanced Mesh Synthesis via TextuRig and SKA-DPO, where we then synthesize a textured mesh conditioned on the generated skeleton. For this stage, we first enhance an existing skeleton-to-mesh model by augmenting its training data with TextuRig: a dataset of textured and rigged meshes with captions, curated from Objaverse-XL. Additionally, we employ a preference optimization strategy, SKA-DPO, guided by a skeleton-mesh alignment score, to further improve geometric fidelity. Together, our framework enables a more intuitive workflow for creating ready to animate 3D content. To the best of our knowledge, our work is the first to generate rigged 3D meshes conditioned on user-drawn 2D strokes. Extensive experiments demonstrate that Stroke3D produces plausible skeletons and high-quality meshes.

[166] From Lightweight CNNs to SpikeNets: Benchmarking Accuracy-Energy Tradeoffs with Pruned Spiking SqueezeNet

Radib Bin Kabir, Tawsif Tashwar Dipto, Mehedi Ahamed, Sabbir Ahmed, Md Hasanul Kabir

Main category: cs.CV

TL;DR: First systematic benchmark of lightweight SNNs converted from compact CNNs shows up to 15.7x higher energy efficiency than CNNs while maintaining competitive accuracy, with SqueezeNet SNN variant performing best.

Details

Motivation: SNNs are energy-efficient alternatives to CNNs for edge intelligence, but prior work focused on large models, leaving lightweight CNN-to-SNN conversion pipelines underexplored.

Method: Converted compact CNN architectures (ShuffleNet, SqueezeNet, MnasNet, MixNet) to spiking networks using LIF neurons and surrogate gradient descent. Evaluated on CIFAR-10/100 and TinyImageNet, measuring accuracy, F1-score, parameters, complexity, and energy. Applied structured pruning to remove redundant modules.

Result: SNNs achieved up to 15.7x higher energy efficiency than CNNs with competitive accuracy. SNN-SqueezeNet performed best. Pruned SNN-SqueezeNet-P improved CIFAR-10 accuracy by 6%, reduced parameters by 19%, and achieved nearly same accuracy as CNN-SqueezeNet (only 1% lower) with 88.1% energy reduction.

Conclusion: Lightweight SNNs are practical, low-power alternatives for edge deployment, providing a viable path toward high-performance, low-power edge intelligence.

Abstract: Spiking Neural Networks (SNNs) are increasingly studied as energy-efficient alternatives to Convolutional Neural Networks (CNNs), particularly for edge intelligence. However, prior work has largely emphasized large-scale models, leaving the design and evaluation of lightweight CNN-to-SNN pipelines underexplored. In this paper, we present the first systematic benchmark of lightweight SNNs obtained by converting compact CNN architectures into spiking networks, where activations are modeled with Leaky-Integrate-and-Fire (LIF) neurons and trained using surrogate gradient descent under a unified setup. We construct spiking variants of ShuffleNet, SqueezeNet, MnasNet, and MixNet, and evaluate them on CIFAR-10, CIFAR-100, and TinyImageNet, measuring accuracy, F1-score, parameter count, computational complexity, and energy consumption. Our results show that SNNs can achieve up to 15.7x higher energy efficiency than their CNN counterparts while retaining competitive accuracy. Among these, the SNN variant of SqueezeNet consistently outperforms other lightweight SNNs. To further optimize this model, we apply a structured pruning strategy that removes entire redundant modules, yielding a pruned architecture, SNN-SqueezeNet-P. This pruned model improves CIFAR-10 accuracy by 6% and reduces parameters by 19% compared to the original SNN-SqueezeNet. Crucially, it narrows the gap with CNN-SqueezeNet, achieving nearly the same accuracy (only 1% lower) but with an 88.1% reduction in energy consumption due to sparse spike-driven computations. Together, these findings establish lightweight SNNs as practical, low-power alternatives for edge deployment, highlighting a viable path toward deploying high-performance, low-power intelligence on the edge.

[167] Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings

Laura Paul, Holger Rauhut, Martin Burger, Samira Kabri, Tim Roith

Main category: cs.CV

TL;DR: A hybrid approach combining deep generative models with variational methods for automated crack detection in digitized paintings, separating cracks from artistic features.

Details

Motivation: Automated detection of craquelure (cracks) in digitized paintings is crucial for art conservation and restoration, but challenging due to visual similarity between cracks and artistic features like brush strokes or hair.

Method: Models crack detection as an inverse problem, decomposing images into crack-free painting and crack components. Uses deep generative model as prior for underlying artwork, and Mumford-Shah-type variational functional with crack prior for crack structures through joint optimization.

Result: Produces pixel-level crack localization maps in paintings, enabling detailed analysis of degradation to support conservation efforts.

Conclusion: The hybrid approach effectively addresses the challenge of distinguishing cracks from artistic features in paintings, providing valuable tools for art documentation and conservation.

Abstract: Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford–Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.

[168] Toward Fine-Grained Facial Control in 3D Talking Head Generation

Shaoyang Xie, Xiaofeng Cong, Baosheng Yu, Zhipeng Gui, Jie Gui, Yuan Yan Tang, James Tin-Yau Kwok

Main category: cs.CV

TL;DR: FG-3DGS: A novel framework for audio-driven talking head generation using 3D Gaussian Splatting with frequency-aware disentanglement for precise facial motion control and lip synchronization.

Details

Motivation: Current 3D Gaussian Splatting methods for talking head generation struggle with precise control over fine-grained facial movements, particularly lip-synchronization inaccuracies and facial jitter, which contribute to the uncanny valley effect.

Method: Proposes frequency-aware disentanglement strategy: low-frequency regions (cheeks, nose, forehead) modeled with standard MLP; high-frequency regions (eyes, mouth) captured separately with dedicated network guided by facial area masks. Uses Gaussian deltas for motion dynamics applied to static Gaussians, rendered via rasterizer. Includes high-frequency-refined post-rendering alignment mechanism learned from audio-video pairs.

Result: Outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos on widely used datasets.

Conclusion: FG-3DGS enables temporally consistent and high-fidelity talking head generation with improved lip synchronization and reduced facial jitter through frequency-aware modeling and post-rendering alignment.

Abstract: Audio-driven talking head generation is a core component of digital avatars, and 3D Gaussian Splatting has shown strong performance in real-time rendering of high-fidelity talking heads. However, achieving precise control over fine-grained facial movements remains a significant challenge, particularly due to lip-synchronization inaccuracies and facial jitter, both of which can contribute to the uncanny valley effect. To address these challenges, we propose Fine-Grained 3D Gaussian Splatting (FG-3DGS), a novel framework that enables temporally consistent and high-fidelity talking head generation. Our method introduces a frequency-aware disentanglement strategy to explicitly model facial regions based on their motion characteristics. Low-frequency regions, such as the cheeks, nose, and forehead, are jointly modeled using a standard MLP, while high-frequency regions, including the eyes and mouth, are captured separately using a dedicated network guided by facial area masks. The predicted motion dynamics, represented as Gaussian deltas, are applied to the static Gaussians to generate the final head frames, which are rendered via a rasterizer using frame-specific camera parameters. Additionally, a high-frequency-refined post-rendering alignment mechanism, learned from large-scale audio-video pairs by a pretrained model, is incorporated to enhance per-frame generation and achieve more accurate lip synchronization. Extensive experiments on widely used datasets for talking head generation demonstrate that our method outperforms recent state-of-the-art approaches in producing high-fidelity, lip-synced talking head videos.

[169] Robust Vision Systems for Connected and Autonomous Vehicles: Security Challenges and Attack Vectors

Sandeep Gupta, Roberto Passerone

Main category: cs.CV

TL;DR: Analysis of vision system robustness in autonomous vehicles, focusing on security vulnerabilities and attack vectors against CAV vision systems.

Details

Motivation: The paper is motivated by the critical need for robust vision systems in Connected and Autonomous Vehicles (CAVs) to achieve Level-5 autonomy, as safe navigation depends on accurate detection of objects, lanes, and traffic signs. There's a security concern about potential vulnerabilities in these vision systems.

Method: The authors analyze key sensors and vision components for CAV navigation to derive a reference architecture for CAV vision systems (CAVVS). They then identify potential attack surfaces within this architecture and elaborate on attack vectors targeting each surface, evaluating their implications for confidentiality, integrity, and availability (CIA triad).

Result: The study provides a comprehensive understanding of attack vector dynamics in CAV vision systems, identifying specific vulnerabilities and their security implications. This establishes a basis for formulating robust security measures to protect autonomous vehicle vision systems.

Conclusion: Robust security measures are crucial for CAV vision systems to maintain confidentiality, integrity, and availability, which are essential for achieving safe Level-5 autonomous driving capabilities.

Abstract: This article investigates the robustness of vision systems in Connected and Autonomous Vehicles (CAVs), which is critical for developing Level-5 autonomous driving capabilities. Safe and reliable CAV navigation undeniably depends on robust vision systems that enable accurate detection of objects, lane markings, and traffic signage. We analyze the key sensors and vision components essential for CAV navigation to derive a reference architecture for CAV vision system (CAVVS). This reference architecture provides a basis for identifying potential attack surfaces of CAVVS. Subsequently, we elaborate on identified attack vectors targeting each attack surface, rigorously evaluating their implications for confidentiality, integrity, and availability (CIA). Our study provides a comprehensive understanding of attack vector dynamics in vision systems, which is crucial for formulating robust security measures that can uphold the principles of the CIA triad.

[170] Self-Supervised Learning as Discrete Communication

Kawtar Zaher, Ilyass Moummad, Olivier Buisson, Alexis Joly

Main category: cs.CV

TL;DR: Self-supervised learning framed as discrete communication between teacher/student networks using binary codes, improving representation structure and performance over continuous methods.

Details

Motivation: Most SSL methods learn continuous visual representations by aligning views, offering limited control over information structure across dimensions. The authors want to create more structured, interpretable representations through discrete communication.

Method: Frames SSL as discrete communication where teacher transmits semantic information through fixed-capacity binary channel to student. Student predicts multi-label binary messages from teacher. Uses element-wise binary cross-entropy for discrete agreement and coding-rate regularization for efficient channel use. Periodically reinitializes projection head to encourage embeddings predictive across multiple discrete encodings.

Result: Extensive experiments show consistent improvements over continuous agreement baselines on image classification, retrieval, dense visual prediction, and domain shift adaptation. Learned binary codes form compact, informative discrete language capturing semantic factors reusable across classes.

Conclusion: Discrete communication framework for SSL produces more structured representations than continuous alignment methods, with binary codes serving as interpretable semantic language that improves performance across vision tasks.

Abstract: Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.

[171] Where Do Images Come From? Analyzing Captions to Geographically Profile Datasets

Abhipsa Basu, Yugam Bahl, Kirti Bhagat, Preethi Seshadri, R. Venkatesh Babu, Danish Pruthi

Main category: cs.CV

TL;DR: Analysis of geographic bias in multimodal datasets reveals severe under-representation of South American and African countries, with strong correlation between GDP and representation, affecting text-to-image model outputs.

Details

Motivation: Text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating investigation into the geographic origins of training examples.

Method: Geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Study English captions from three datasets (Re-LAION, DataComp1B, Conceptual Captions) across 20 common entities, and analyze non-English subsets from Re-LAION.

Result: US, UK, and Canada account for 48.0% of samples, while South American and African countries are severely under-represented (1.8% and 3.8% respectively). Strong correlation between GDP and representation (ρ=0.82). Non-English subsets skew toward countries where languages are predominantly spoken. Higher representation doesn’t translate to greater diversity. Stable Diffusion generations appear realistic but have limited coverage compared to real-world images.

Conclusion: Multimodal datasets exhibit significant geographic bias favoring wealthy Western countries, which affects text-to-image model outputs and highlights the need for more geographically diverse training data.

Abstract: Recent studies show that text-to-image models often fail to generate geographically representative images, raising concerns about the representativeness of their training data and motivating the question: which parts of the world do these training examples come from? We geographically profile large-scale multimodal datasets by mapping image-caption pairs to countries based on location information extracted from captions using LLMs. Studying English captions from three widely used datasets (Re-LAION, DataComp1B, and Conceptual Captions) across $20$ common entities (e.g., house, flag), we find that the United States, the United Kingdom, and Canada account for $48.0%$ of samples, while South American and African countries are severely under-represented with only $1.8%$ and $3.8%$ of images, respectively. We observe a strong correlation between a country’s GDP and its representation in the data ($ρ= 0.82$). Examining non-English subsets for $4$ languages from the Re-LAION dataset, we find that representation skews heavily toward countries where these languages are predominantly spoken. Additionally, we find that higher representation does not necessarily translate to greater visual or semantic diversity. Finally, analyzing country-specific images generated by Stable Diffusion v1.3 trained on Re-LAION, we show that while generations appear realistic, they are severely limited in their coverage compared to real-world images.

[172] SciFlow-Bench: Evaluating Structure-Aware Scientific Diagram Generation via Inverse Parsing

Tong Zhang, Honglin Lin, Zhou Liu, Chong Chen, Wentao Zhang

Main category: cs.CV

TL;DR: SciFlow-Bench is a structure-first benchmark for evaluating scientific diagram generation that focuses on structural correctness rather than visual similarity, using a round-trip protocol to parse generated images back into graphs for comparison.

Details

Motivation: Current text-to-image models often produce visually plausible but structurally incorrect scientific diagrams, and existing benchmarks either use image-centric metrics or evaluate intermediate representations rather than final pixel outputs, leaving a gap in evaluating structural correctness of generated diagrams.

Method: Built from real scientific PDFs, SciFlow-Bench pairs source figures with ground-truth graphs and uses a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This is enabled by a hierarchical multi-agent system coordinating planning, perception, and structural reasoning.

Result: Experiments show that preserving structural correctness remains a fundamental challenge for text-to-image models, particularly for diagrams with complex topology, highlighting the need for structure-aware evaluation.

Conclusion: The paper introduces a novel benchmark that emphasizes structural correctness over visual similarity for scientific diagram generation, revealing significant limitations in current models and advocating for structure-aware evaluation approaches.

Abstract: Scientific diagrams convey explicit structural information, yet modern text-to-image models often produce visually plausible but structurally incorrect results. Existing benchmarks either rely on image-centric or subjective metrics insensitive to structure, or evaluate intermediate symbolic representations rather than final rendered images, leaving pixel-based diagram generation underexplored. We introduce SciFlow-Bench, a structure-first benchmark for evaluating scientific diagram generation directly from pixel-level outputs. Built from real scientific PDFs, SciFlow-Bench pairs each source framework figure with a canonical ground-truth graph and evaluates models as black-box image generators under a closed-loop, round-trip protocol that inverse-parses generated diagram images back into structured graphs for comparison. This design enforces evaluation by structural recoverability rather than visual similarity alone, and is enabled by a hierarchical multi-agent system that coordinates planning, perception, and structural reasoning. Experiments show that preserving structural correctness remains a fundamental challenge, particularly for diagrams with complex topology, underscoring the need for structure-aware evaluation.

[173] CompSplat: Compression-aware 3D Gaussian Splatting for Real-world Video

Hojun Song, Heejung Choi, Aro Kim, Chae-yeong Song, Gahyeon Kim, Soo Ye Kim, Jaehyup Lee, Sang-hyo Park

Main category: cs.CV

TL;DR: CompSplat: A compression-aware training framework for novel view synthesis from real-world videos that addresses compression artifacts and geometric inconsistencies in long sequences.

Details

Motivation: Real-world videos for novel view synthesis face challenges from long sequences with irregular camera trajectories, unknown poses, and lossy compression that introduces inconsistencies degrading geometry and rendering quality. Current approaches don't adequately handle diverse compression patterns in long videos.

Method: Proposes CompSplat framework with compression-aware frame weighting and adaptive pruning strategy to explicitly model frame-wise compression characteristics, mitigating inter-frame inconsistency and accumulated geometric errors, especially under heavy compression.

Result: Extensive experiments on Tanks and Temples, Free, and Hike benchmarks show CompSplat achieves state-of-the-art rendering quality and pose accuracy, significantly surpassing recent NVS approaches under severe compression conditions.

Conclusion: CompSplat effectively addresses compression artifacts in long video sequences for novel view synthesis, improving robustness and geometric consistency through compression-aware modeling.

Abstract: High-quality novel view synthesis (NVS) from real-world videos is crucial for applications such as cultural heritage preservation, digital twins, and immersive media. However, real-world videos typically contain long sequences with irregular camera trajectories and unknown poses, leading to pose drift, feature misalignment, and geometric distortion during reconstruction. Moreover, lossy compression amplifies these issues by introducing inconsistencies that gradually degrade geometry and rendering quality. While recent studies have addressed either long-sequence NVS or unposed reconstruction, compression-aware approaches still focus on specific artifacts or limited scenarios, leaving diverse compression patterns in long videos insufficiently explored. In this paper, we propose CompSplat, a compression-aware training framework that explicitly models frame-wise compression characteristics to mitigate inter-frame inconsistency and accumulated geometric errors. CompSplat incorporates compression-aware frame weighting and an adaptive pruning strategy to enhance robustness and geometric consistency, particularly under heavy compression. Extensive experiments on challenging benchmarks, including Tanks and Temples, Free, and Hike, demonstrate that CompSplat achieves state-of-the-art rendering quality and pose accuracy, significantly surpassing most recent state-of-the-art NVS approaches under severe compression conditions.

[174] SAKED: Mitigating Hallucination in Large Vision-Language Models via Stability-Aware Knowledge Enhanced Decoding

Zhaoxu Li, Chenqi Kong, Peijun Bao, Song Xia, Yi Tu, Yi Yu, Xinghao Jiang, Xudong Jiang

Main category: cs.CV

TL;DR: SAKED is a training-free method that mitigates hallucinations in Large Vision-Language Models by quantifying knowledge stability across layers and using the most reliable internal knowledge for faithful token generation.

Details

Motivation: Hallucinations in LVLMs pose security and reliability risks. Inspired by human uncertainty patterns, the authors investigate how instability in a model's internal knowledge contributes to hallucinations.

Method: Proposes Stability-Aware Knowledge-Enhanced Decoding (SAKED) which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability. Contrasts most stability-aware and stability-agnostic layers to suppress decoding noise and dynamically leverage reliable internal knowledge.

Result: SAKED achieves state-of-the-art performance for hallucination mitigation across various models, tasks, and benchmarks. The method is training-free and can be seamlessly integrated into different architectures.

Conclusion: The paper successfully addresses LVLM hallucinations by analyzing knowledge instability patterns and proposing an effective, training-free decoding strategy that improves model reliability.

Abstract: Hallucinations in Large Vision-Language Models (LVLMs) pose significant security and reliability risks in real-world applications. Inspired by the observation that humans are more error-prone when uncertain or hesitant, we investigate how instability in a model ’s internal knowledge contributes to LVLM hallucinations. We conduct extensive empirical analyses from three perspectives, namely attention heads, model layers, and decoding tokens, and identify three key hallucination patterns: (i) visual activation drift across attention heads, (ii) pronounced knowledge fluctuations across layers, and (iii) visual focus distraction between neighboring output tokens. Building on these findings, we propose Stability-Aware Knowledge-Enhanced Decoding (SAKED), which introduces a layer-wise Knowledge Stability Score (KSS) to quantify knowledge stability throughout the model. By contrasting the most stability-aware and stability-agnostic layers, SAKED suppresses decoding noise and dynamically leverages the most reliable internal knowledge for faithful token generation. Moreover, SAKED is training-free and can be seamlessly integrated into different architectures. Extensive experiments demonstrate that SAKED achieves state-of-the-art performance for hallucination mitigation on various models, tasks, and benchmarks.

[175] ARK: A Dual-Axis Multimodal Retrieval Benchmark along Reasoning and Knowledge

Yijie Lin, Guofeng Ding, Haochen Zhou, Haobin Li, Mouxing Yang, Xi Peng

Main category: cs.CV

TL;DR: ARK: A multimodal retrieval benchmark focusing on professional knowledge domains and complex reasoning skills across 16 visual data types, with targeted hard negatives to prevent shortcut matching.

Details

Motivation: Existing multimodal retrieval benchmarks focus on daily-life images and lack diagnostics for professional knowledge and complex reasoning, creating a gap in evaluating sophisticated multimodal understanding.

Method: Introduces ARK benchmark with two complementary perspectives: (1) knowledge domains (5 domains with 17 subtypes) characterizing content expertise, and (2) reasoning skills (6 categories) characterizing inference types. Evaluates retrieval with unimodal and multimodal queries across 16 visual data types, using targeted hard negatives to prevent shortcut matching.

Result: Evaluation of 23 representative text-based and multimodal retrievers shows pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning as persistent bottlenecks. Simple enhancements like re-ranking and rewriting yield improvements but substantial headroom remains.

Conclusion: ARK addresses limitations of existing benchmarks by focusing on professional knowledge and complex reasoning, revealing significant challenges in multimodal retrieval that current models struggle with, particularly in fine-grained visual and spatial reasoning tasks.

Abstract: Existing multimodal retrieval benchmarks largely emphasize semantic matching on daily-life images and offer limited diagnostics of professional knowledge and complex reasoning. To address this gap, we introduce ARK, a benchmark designed to analyze multimodal retrieval from two complementary perspectives: (i) knowledge domains (five domains with 17 subtypes), which characterize the content and expertise retrieval relies on, and (ii) reasoning skills (six categories), which characterize the type of inference over multimodal evidence required to identify the correct candidate. Specifically, ARK evaluates retrieval with both unimodal and multimodal queries and candidates, covering 16 heterogeneous visual data types. To avoid shortcut matching during evaluation, most queries are paired with targeted hard negatives that require multi-step reasoning. We evaluate 23 representative text-based and multimodal retrievers on ARK and observe a pronounced gap between knowledge-intensive and reasoning-intensive retrieval, with fine-grained visual and spatial reasoning emerging as persistent bottlenecks. We further show that simple enhancements such as re-ranking and rewriting yield consistent improvements, but substantial headroom remains.

[176] Kelix Technique Report

Boyang Ding, Chenglong Chu, Dunju Zang, Han Li, Jiangxia Cao, Kun Gai, Muhao Wei, Ruiming Tang, Shiyao Wang, Siyang Mao, Xinchen Luo, Yahui Liu, Zhixin Ling, Zhuoran Yang, Ziming Li, Chengru Song, Guorui Zhou, Guowang Zhang, Hao Peng, Hao Wang, Jiaxin Deng, Jin Ouyang, Jinghao Zhang, Lejian Ren, Qianqian Wang, Qigen Hu, Tao Wang, Xingmei Wang, Yiping Yang, Zixing Zhang, Ziqi Wang

Main category: cs.CV

TL;DR: Kelix is a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations for multimodal LLMs.

Details

Motivation: Current vision-language models use hybrid interfaces (discrete text tokens + continuous ViT features), which creates biases toward understanding over generation and prevents full leverage of self-supervised learning on non-text data. Existing discrete visual tokenization methods lose information due to limited code capacity, resulting in weaker understanding than continuous-feature models.

Method: Presents Kelix, a fully discrete autoregressive unified model that uses improved discrete visual tokenization to maintain information fidelity while enabling unified autoregressive modeling across modalities.

Result: Kelix closes the understanding gap between discrete and continuous visual representations, achieving comparable or better performance than hybrid models while enabling unified understanding and generation.

Conclusion: Fully discrete autoregressive modeling is viable for multimodal LLMs when using improved discrete tokenization that preserves information, enabling unified comprehension and generation under self-supervision.

Abstract: Autoregressive large language models (LLMs) scale well by expressing diverse tasks as sequences of discrete natural-language tokens and training with next-token prediction, which unifies comprehension and generation under self-supervision. Extending this paradigm to multimodal data requires a shared, discrete representation across modalities. However, most vision-language models (VLMs) still rely on a hybrid interface: discrete text tokens paired with continuous Vision Transformer (ViT) features. Because supervision is largely text-driven, these models are often biased toward understanding and cannot fully leverage large-scale self-supervised learning on non-text data. Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling, showing promising progress toward unified understanding and generation. Yet existing discrete vision tokens frequently lose information due to limited code capacity, resulting in noticeably weaker understanding than continuous-feature VLMs. We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.

[177] Reason-IAD: Knowledge-Guided Dynamic Latent Reasoning for Explainable Industrial Anomaly Detection

Peng Chen, Chao Huang, Yunkang Cao, Chengliang Liu, Wenqiang Wang, Mingbo Yang, Li Shen, Wenqi Ren, Xiaochun Cao

Main category: cs.CV

TL;DR: Reason-IAD: A knowledge-guided dynamic latent reasoning framework for industrial anomaly detection that incorporates category-specific textual knowledge and uses entropy-driven iterative reasoning in latent space with selective visual patch injection.

Details

Motivation: Existing multimodal large language models struggle with industrial anomaly detection because they're pretrained on general-domain data and fail to capture category-specific anomalies, limiting both detection accuracy and interpretability.

Method: 1) Retrieval-augmented knowledge module incorporating category-specific textual descriptions; 2) Entropy-driven latent reasoning mechanism using optimizable latent think tokens for iterative exploration; 3) Dynamic visual injection strategy that selectively incorporates informative image patches into the latent sequence.

Result: Extensive experiments show Reason-IAD consistently outperforms state-of-the-art methods in industrial anomaly detection.

Conclusion: Reason-IAD provides an effective framework for explainable industrial anomaly detection by combining domain-specific knowledge guidance with dynamic latent reasoning and selective visual attention.

Abstract: Industrial anomaly detection demands precise reasoning over fine-grained defect patterns. However, existing multimodal large language models (MLLMs), pretrained on general-domain data, often struggle to capture category-specific anomalies, thereby limiting both detection accuracy and interpretability. To address these limitations, we propose Reason-IAD, a knowledge-guided dynamic latent reasoning framework for explainable industrial anomaly detection. Reason-IAD comprises two core components. First, a retrieval-augmented knowledge module incorporates category-specific textual descriptions into the model input, enabling context-aware reasoning over domain-specific defects. Second, an entropy-driven latent reasoning mechanism conducts iterative exploration within a compact latent space using optimizable latent think tokens, guided by an entropy-based reward that encourages confident and stable predictions. Furthermore, a dynamic visual injection strategy selectively incorporates the most informative image patches into the latent sequence, directing the reasoning process toward regions critical for anomaly detection. Extensive experimental results demonstrate that Reason-IAD consistently outperforms state-of-the-art methods. The code will be publicly available at https://github.com/chenpeng052/Reason-IAD.

[178] Code2World: A GUI World Model via Renderable Code Generation

Yuhao Zheng, Li’an Zhong, Yi Wang, Rui Dai, Kaikui Liu, Xiangxiang Chu, Linyuan Lv, Philip Torr, Kevin Qinghong Lin

Main category: cs.CV

TL;DR: Code2World: A vision-language coder that predicts next GUI states via renderable HTML code generation, addressing limitations of text/pixel-based approaches for autonomous GUI agents.

Details

Motivation: Existing text- and pixel-based approaches for GUI world models struggle to achieve both high visual fidelity and fine-grained structural controllability simultaneously. Autonomous GUI agents need better simulation capabilities for human-like foresight in interface interactions.

Method: 1) Construct AndroidCode dataset by translating GUI trajectories into high-fidelity HTML with visual-feedback revision (80K+ screen-action pairs). 2) Adapt VLMs via SFT for format layout following, then apply Render-Aware Reinforcement Learning using rendered outcome as reward signal for visual semantic fidelity and action consistency.

Result: Code2World-8B achieves top-performing next UI prediction, rivaling GPT-5 and Gemini-3-Pro-Image. Significantly enhances downstream navigation success rates, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation.

Conclusion: Code2World demonstrates that renderable code generation effectively addresses the visual fidelity vs. structural controllability trade-off in GUI world modeling, enabling better autonomous GUI agents through improved simulation capabilities.

Abstract: Autonomous GUI agents interact with environments by perceiving interfaces and executing actions. As a virtual sandbox, the GUI World model empowers agents with human-like foresight by enabling action-conditioned prediction. However, existing text- and pixel-based approaches struggle to simultaneously achieve high visual fidelity and fine-grained structural controllability. To this end, we propose Code2World, a vision-language coder that simulates the next visual state via renderable code generation. Specifically, to address the data scarcity problem, we construct AndroidCode by translating GUI trajectories into high-fidelity HTML and refining synthesized code through a visual-feedback revision mechanism, yielding a corpus of over 80K high-quality screen-action pairs. To adapt existing VLMs into code prediction, we first perform SFT as a cold start for format layout following, then further apply Render-Aware Reinforcement Learning which uses rendered outcome as the reward signal by enforcing visual semantic fidelity and action consistency. Extensive experiments demonstrate that Code2World-8B achieves the top-performing next UI prediction, rivaling the competitive GPT-5 and Gemini-3-Pro-Image. Notably, Code2World significantly enhances downstream navigation success rates in a flexible manner, boosting Gemini-2.5-Flash by +9.5% on AndroidWorld navigation. The code is available at https://github.com/AMAP-ML/Code2World.

[179] Free-GVC: Towards Training-Free Extreme Generative Video Compression with Temporal Coherence

Xiaoyue Ling, Chuqin Zhou, Chunyi Li, Yunuo Chen, Yuan Tian, Guo Lu, Wenjun Zhang

Main category: cs.CV

TL;DR: Free-GVC: A training-free generative video compression framework using latent trajectory compression guided by video diffusion prior, achieving superior perceptual quality and temporal coherence at ultra-low bitrates.

Details

Motivation: Existing generative video compression methods have limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. There's a need for better video compression that maintains perceptual quality and temporal consistency.

Method: Proposes Free-GVC, a training-free framework that reformulates video coding as latent trajectory compression guided by video diffusion prior. Operates at GOP level with Adaptive Quality Control module for optimal diffusion step prediction and Inter-GOP Alignment module for temporal coherence through frame overlap and latent fusion.

Result: Achieves 93.29% BD-Rate reduction in DISTS over DCVC-RT neural codec. User study confirms superior perceptual quality and temporal coherence at ultra-low bitrates.

Conclusion: Free-GVC effectively addresses flicker and temporal coherence issues in ultra-low bitrate video compression through diffusion-guided latent trajectory compression and inter-GOP alignment mechanisms.

Abstract: Building on recent advances in video generation, generative video compression has emerged as a new paradigm for achieving visually pleasing reconstructions. However, existing methods exhibit limited exploitation of temporal correlations, causing noticeable flicker and degraded temporal coherence at ultra-low bitrates. In this paper, we propose Free-GVC, a training-free generative video compression framework that reformulates video coding as latent trajectory compression guided by a video diffusion prior. Our method operates at the group-of-pictures (GOP) level, encoding video segments into a compact latent space and progressively compressing them along the diffusion trajectory. To ensure perceptually consistent reconstruction across GOPs, we introduce an Adaptive Quality Control module that dynamically constructs an online rate-perception surrogate model to predict the optimal diffusion step for each GOP. In addition, an Inter-GOP Alignment module establishes frame overlap and performs latent fusion between adjacent groups, thereby mitigating flicker and enhancing temporal coherence. Experiments show that Free-GVC achieves an average of 93.29% BD-Rate reduction in DISTS over the latest neural codec DCVC-RT, and a user study further confirms its superior perceptual quality and temporal coherence at ultra-low bitrates.

[180] BabyMamba-HAR: Lightweight Selective State Space Models for Efficient Human Activity Recognition on Resource Constrained Devices

Mridankan Mandal

Main category: cs.CV

TL;DR: BabyMamba-HAR introduces two lightweight Mamba-inspired architectures for human activity recognition on resource-constrained devices, achieving competitive accuracy with significantly reduced computational costs.

Details

Motivation: Human activity recognition on wearable/mobile devices faces memory and computational constraints while needing to maintain accuracy across heterogeneous sensor configurations. Selective state space models offer linear-time sequence processing but their design for TinyML applications remains unexplored.

Method: Two novel architectures: (1) CI-BabyMamba-HAR with channel-independent stem processing each sensor channel through shared weights but instance-independent transformations, and (2) Crossover-BiDir-BabyMamba-HAR with early fusion stem achieving channel count independent complexity. Both use weight-tied bidirectional scanning and lightweight temporal attention pooling.

Result: Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with ~27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Bidirectional scanning improves F1-score by up to 8.42%, and gated temporal attention provides up to 8.94% gain over mean pooling.

Conclusion: The paper establishes practical design principles for deploying selective state space models as efficient TinyML backbones for human activity recognition, demonstrating that lightweight Mamba architectures can achieve competitive performance with significantly reduced computational requirements.

Abstract: Human activity recognition (HAR) on wearable and mobile devices is constrained by memory footprint and computational budget, yet competitive accuracy must be maintained across heterogeneous sensor configurations. Selective state space models (SSMs) offer linear time sequence processing with input dependent gating, presenting a compelling alternative to quadratic complexity attention mechanisms. However, the design space for deploying SSMs in the TinyML regime remains largely unexplored. In this paper, BabyMamba-HAR is introduced, a framework comprising two novel lightweight Mamba inspired architectures optimized for resource constrained HAR: (1) CI-BabyMamba-HAR, using a channel independent stem that processes each sensor channel through shared weight, but instance independent transformations to prevent cross channel noise propagation, and (2) Crossover-BiDir-BabyMamba-HAR, using an early fusion stem that achieves channel count independent computational complexity. Both variants incorporate weight tied bidirectional scanning and lightweight temporal attention pooling. Through evaluation across eight diverse benchmarks, it is demonstrated that Crossover-BiDir-BabyMamba-HAR achieves 86.52% average macro F1-score with approximately 27K parameters and 2.21M MACs, matching TinyHAR (86.16%) while requiring 11x fewer MACs on high channel datasets. Systematic ablation studies reveal that bidirectional scanning contributes up to 8.42% F1-score improvement, and gated temporal attention provides up to 8.94% F1-score gain over mean pooling. These findings establish practical design principles for deploying selective state space models as efficient TinyML backbones for HAR.

[181] MVISTA-4D: View-Consistent 4D World Model with Test-Time Action Inference for Robotic Manipulation

Jiaxu Wang, Yicheng Jiang, Tianlun He, Jingkai Sun, Qiang Zhang, Junhao He, Jiahang Cao, Zesen Gan, Mingyuan Sun, Qiming Shao, Xiangyu Yue

Main category: cs.CV

TL;DR: A 4D world model for robotic manipulation that generates geometrically consistent multi-view RGBD scenes from single-view input, enabling complete 4D scene dynamics prediction and action optimization.

Details

Motivation: Existing world-model approaches for robotic manipulation are limited to either purely image-based forecasting or reasoning over partial 3D geometry, lacking the ability to predict complete 4D scene dynamics needed for effective manipulation.

Method: Proposes an embodied 4D world model with cross-view and cross-modality feature fusion for consistent RGBD generation from single-view input. Uses test-time action optimization through backpropagation in the generative model and a residual inverse dynamics model to convert predicted futures into executable actions.

Result: Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation tasks, with ablations providing insights into key design choices.

Conclusion: The proposed 4D world model enables geometrically consistent scene generation and effective action planning for robotic manipulation, addressing limitations of existing approaches.

Abstract: World-model-based imagine-then-act becomes a promising paradigm for robotic manipulation, yet existing approaches typically support either purely image-based forecasting or reasoning over partial 3D geometry, limiting their ability to predict complete 4D scene dynamics. This work proposes a novel embodied 4D world model that enables geometrically consistent, arbitrary-view RGBD generation: given only a single-view RGBD observation as input, the model imagines the remaining viewpoints, which can then be back-projected and fused to assemble a more complete 3D structure across time. To efficiently learn the multi-view, cross-modality generation, we explicitly design cross-view and cross-modality feature fusion that jointly encourage consistency between RGB and depth and enforce geometric alignment across views. Beyond prediction, converting generated futures into actions is often handled by inverse dynamics, which is ill-posed because multiple actions can explain the same transition. We address this with a test-time action optimization strategy that backpropagates through the generative model to infer a trajectory-level latent best matching the predicted future, and a residual inverse dynamics model that turns this trajectory prior into accurate executable actions. Experiments on three datasets demonstrate strong performance on both 4D scene generation and downstream manipulation, and ablations provide practical insights into the key design choices.

[182] AdaTSQ: Pushing the Pareto Frontier of Diffusion Transformers via Temporal-Sensitivity Quantization

Shaoqiu Zhang, Zizhong Ding, Kaicheng Yang, Junyi Wu, Xianglong Yan, Xi Li, Bingnan Duan, Jianping Fang, Yulun Zhang

Main category: cs.CV

TL;DR: AdaTSQ is a post-training quantization framework for Diffusion Transformers that optimizes efficiency-quality trade-offs by exploiting temporal sensitivity in diffusion processes through dynamic bit-width allocation and Fisher-guided calibration.

Details

Motivation: Diffusion Transformers (DiTs) have become state-of-the-art for image/video generation but suffer from massive computational cost and memory footprint that hinder edge deployment. Existing PTQ methods designed for LLMs perform poorly on DiTs because they ignore the unique temporal dynamics of diffusion processes.

Method: 1) Pareto-aware timestep-dynamic bit-width allocation: Models quantization policy search as a constrained pathfinding problem using beam search guided by end-to-end reconstruction error to assign layer-wise bit-widths across timesteps. 2) Fisher-guided temporal calibration: Leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, integrated with Hessian-based weight optimization.

Result: Extensive experiments on four advanced DiTs (Flux-Dev, Flux-Schnell, Z-Image, Wan2.1) show AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q in efficiency-quality trade-offs.

Conclusion: AdaTSQ effectively addresses the unique challenges of quantizing Diffusion Transformers by exploiting temporal sensitivity, pushing the Pareto frontier of efficiency and quality for edge deployment of DiTs.

Abstract: Diffusion Transformers (DiTs) have emerged as the state-of-the-art backbone for high-fidelity image and video generation. However, their massive computational cost and memory footprint hinder deployment on edge devices. While post-training quantization (PTQ) has proven effective for large language models (LLMs), directly applying existing methods to DiTs yields suboptimal results due to the neglect of the unique temporal dynamics inherent in diffusion processes. In this paper, we propose AdaTSQ, a novel PTQ framework that pushes the Pareto frontier of efficiency and quality by exploiting the temporal sensitivity of DiTs. First, we propose a Pareto-aware timestep-dynamic bit-width allocation strategy. We model the quantization policy search as a constrained pathfinding problem. We utilize a beam search algorithm guided by end-to-end reconstruction error to dynamically assign layer-wise bit-widths across different timesteps. Second, we propose a Fisher-guided temporal calibration mechanism. It leverages temporal Fisher information to prioritize calibration data from highly sensitive timesteps, seamlessly integrating with Hessian-based weight optimization. Extensive experiments on four advanced DiTs (e.g., Flux-Dev, Flux-Schnell, Z-Image, and Wan2.1) demonstrate that AdaTSQ significantly outperforms state-of-the-art methods like SVDQuant and ViDiT-Q. Our code will be released at https://github.com/Qiushao-E/AdaTSQ.

[183] SARS: A Novel Face and Body Shape and Appearance Aware 3D Reconstruction System extends Morphable Models

Gulraiz Khan, Kenneth Y. Wertheim, Kevin Pimbblet, Waqas Ahmed

Main category: cs.CV

TL;DR: SARS is a modular 3D reconstruction system that extracts body and face information from single images to create detailed 3D human models, addressing limitations of previous 3DMMs that ignored semantic facial features.

Details

Motivation: Previous 3D Morphable Models (3DMMs) focused only on global face structure and geometry while ignoring important semantic facial features like age, gender, and facial landmarks. There was a need for a system that could accommodate these high-level facial characteristics in 3D human reconstruction.

Method: SARS is a modular pipeline that extracts both body and face information from a single image. It combines identity and expression blendshapes with a basic face mesh, controlling variability through diverse parameters including shape, texture, illumination, and camera parameters.

Result: The system properly rebuilds 3D models of the human full body from single images, incorporating both structural geometry and semantic facial features that were previously ignored.

Conclusion: SARS represents an advancement in 3D human reconstruction by addressing the limitations of previous 3DMMs through a modular approach that captures both geometric structure and semantic facial characteristics.

Abstract: Morphable Models (3DMMs) are a type of morphable model that takes 2D images as inputs and recreates the structure and physical appearance of 3D objects, especially human faces and bodies. 3DMM combines identity and expression blendshapes with a basic face mesh to create a detailed 3D model. The variability in the 3D Morphable models can be controlled by tuning diverse parameters. They are high-level image descriptors, such as shape, texture, illumination, and camera parameters. Previous research in 3D human reconstruction concentrated solely on global face structure or geometry, ignoring face semantic features such as age, gender, and facial landmarks characterizing facial boundaries, curves, dips, and wrinkles. In order to accommodate changes in these high-level facial characteristics, this work introduces a shape and appearance-aware 3D reconstruction system (named SARS by us), a c modular pipeline that extracts body and face information from a single image to properly rebuild the 3D model of the human full body.

[184] A benchmark for video-based laparoscopic skill analysis and assessment

Isabel Funke, Sebastian Bodenstedt, Felix von Bechtolsheim, Florian Oehme, Michael Maruschke, Stefanie Herrlich, Jürgen Weitz, Marius Distler, Sören Torge Mees, Stefanie Speidel

Main category: cs.CV

TL;DR: Introduces LASANA dataset for laparoscopic skill assessment with 1270 stereo videos, skill ratings, and error labels to address limited training data for deep learning models.

Details

Motivation: Laparoscopic surgery requires extensive training, and while deep learning shows promise for automatic video-based skill assessment, development is hindered by limited annotated datasets.

Method: Created LASANA dataset with 1270 stereo video recordings of four basic laparoscopic training tasks, annotated with structured skill ratings from three independent raters and binary error labels, with predefined data splits for benchmarking.

Result: Provides a comprehensive dataset reflecting natural skill variation from a training course, along with baseline deep learning model results for future comparisons.

Conclusion: LASANA dataset addresses the data scarcity problem in surgical skill assessment research and enables benchmarking of video-based deep learning approaches for skill evaluation and error recognition.

Abstract: Laparoscopic surgery is a complex surgical technique that requires extensive training. Recent advances in deep learning have shown promise in supporting this training by enabling automatic video-based assessment of surgical skills. However, the development and evaluation of deep learning models is currently hindered by the limited size of available annotated datasets. To address this gap, we introduce the Laparoscopic Skill Analysis and Assessment (LASANA) dataset, comprising 1270 stereo video recordings of four basic laparoscopic training tasks. Each recording is annotated with a structured skill rating, aggregated from three independent raters, as well as binary labels indicating the presence or absence of task-specific errors. The majority of recordings originate from a laparoscopic training course, thereby reflecting a natural variation in the skill of participants. To facilitate benchmarking of both existing and novel approaches for video-based skill assessment and error recognition, we provide predefined data splits for each task. Furthermore, we present baseline results from a deep learning model as a reference point for future comparisons.

[185] Monocular Normal Estimation via Shading Sequence Estimation

Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai

Main category: cs.CV

TL;DR: RoSE reformulates monocular normal estimation as shading sequence estimation using image-to-video generative models, achieving state-of-the-art performance by better capturing geometric details through shading variations.

Details

Motivation: Existing monocular normal estimation methods suffer from 3D misalignment - while normal maps appear correct, reconstructed surfaces fail to align with geometric details. This stems from the difficulty in distinguishing varying geometry through subtle color variations in normal maps.

Method: Proposes RoSE which reformulates normal estimation as shading sequence estimation. Uses image-to-video generative models to predict shading sequences under varying lighting, then converts to normal maps via ordinary least-squares. Trained on MultiShade synthetic dataset with diverse shapes, materials, and lighting.

Result: RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation, demonstrating improved geometric alignment.

Conclusion: Reformulating normal estimation as shading sequence estimation addresses 3D misalignment issues, with shading sequences being more sensitive to geometric information than direct normal map prediction.

Abstract: Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a method that leverages image-to-video generative models to predict shading sequences. The predicted shading sequences are then converted into normal maps by solving a simple ordinary least-squares problem. To enhance robustness and better handle complex objects, RoSE is trained on a synthetic dataset, MultiShade, with diverse shapes, materials, and light conditions. Experiments demonstrate that RoSE achieves state-of-the-art performance on real-world benchmark datasets for object-based monocular normal estimation.

[186] GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

Han Jinzhen, JinByeong Lee, JiSung Kim, MinKyung Cho, DaHee Kim, HongSik Yun

Main category: cs.CV

TL;DR: GeoFormer is a Swin Transformer framework that jointly estimates building height and footprint from Sentinel-1/2 imagery and DEM data, achieving state-of-the-art accuracy across 54 cities with strong cross-continent generalization.

Details

Motivation: Accurate 3D urban data is crucial for climate modeling, disaster risk assessment, and urban planning, but remains scarce due to proprietary sensors and poor cross-city generalization of existing methods.

Method: Uses Swin Transformer framework with geo-blocked splitting strategy for spatial independence; fuses Sentinel-1 SAR, Sentinel-2 optical, and DEM data; jointly estimates building height and footprint on 100m grid.

Result: Achieves BH RMSE of 3.19m and BF RMSE of 0.05 across 54 diverse cities, improving 7.5% and 15.3% over CNN baselines; maintains under 3.5m BH RMSE in cross-continent transfer; DEM is crucial for height estimation.

Conclusion: GeoFormer provides accurate, generalizable 3D urban mapping using only open-source data, with multi-source fusion (optical+SAR+DEM) yielding best results; all code and global products are publicly released.

Abstract: Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.

[187] Unbalanced optimal transport for robust longitudinal lesion evolution with registration-aware and appearance-guided priors

Melika Qahqaie, Dominik Neumann, Tobias Heimann, Andreas Maier, Veronika A. Zimmer

Main category: cs.CV

TL;DR: A registration-aware lesion matching method using unbalanced optimal transport for tracking lesion evolution in longitudinal CT scans, handling appearance/disappearance, merging/splitting without retraining.

Details

Motivation: Evaluating lesion evolution in cancer patients' longitudinal CT scans is crucial for treatment assessment, but establishing reliable lesion correspondence across time is challenging due to lesions appearing, disappearing, merging, or splitting, which standard geometric matchers struggle with.

Method: Proposes a registration-aware matcher based on unbalanced optimal transport (UOT) that accommodates unequal lesion mass and adapts priors to patient-level tumor-load changes. The transport cost blends size-normalized geometry, local registration trust from deformation-field Jacobian, and optional patch-level appearance consistency. The transport plan is sparsified by relative pruning.

Result: On longitudinal CT data, the approach achieves consistently higher edge-detection precision and recall, improved lesion-state recall, and superior lesion-graph component F1 scores compared to distance-only baselines.

Conclusion: The proposed UOT-based matcher effectively handles complex lesion evolution scenarios (appearance/disappearance, merging/splitting) without retraining or heuristic rules, outperforming geometric-only approaches for longitudinal CT analysis.

Abstract: Evaluating lesion evolution in longitudinal CT scans of can cer patients is essential for assessing treatment response, yet establishing reliable lesion correspondence across time remains challenging. Standard bipartite matchers, which rely on geometric proximity, struggle when lesions appear, disappear, merge, or split. We propose a registration-aware matcher based on unbalanced optimal transport (UOT) that accommodates unequal lesion mass and adapts priors to patient-level tumor-load changes. Our transport cost blends (i) size-normalized geometry, (ii) local registration trust from the deformation-field Jacobian, and (iii) optional patch-level appearance consistency. The resulting transport plan is sparsified by relative pruning, yielding one-to-one matches as well as new, disappearing, merging, and splitting lesions without retraining or heuristic rules. On longitudinal CT data, our approach achieves consistently higher edge-detection precision and recall, improved lesion-state recall, and superior lesion-graph component F1 scores versus distance-only baselines.

[188] VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization

Yikun Liu, Yuan Liu, Shangzhe Di, Haicheng Wang, Zhongyin Zhao, Le Tian, Xiao Zhou, Jie Zhou, Jiangchao Yao, Yanfeng Wang, Weidi Xie

Main category: cs.CV

TL;DR: VersaViT: A method to improve MLLM vision encoders for dense prediction tasks through multi-task collaborative post-training, creating a versatile vision backbone for both language reasoning and pixel-level understanding.

Details

Motivation: MLLMs show strong high-level semantic alignment but their vision encoders perform poorly on dense prediction tasks like semantic segmentation and depth estimation. The authors want to create a versatile vision backbone that works well for both language-mediated reasoning and pixel-level understanding.

Method: Propose VersaViT, a multi-task collaborative post-training framework that optimizes the vision backbone using lightweight task heads with multi-granularity supervision. This improves dense feature representations while maintaining language alignment capabilities.

Result: Extensive experiments across various downstream tasks demonstrate VersaViT’s effectiveness, creating a versatile vision backbone suitable for both language-mediated reasoning and pixel-level understanding tasks.

Conclusion: The proposed VersaViT framework successfully addresses deficiencies in MLLM vision encoders for dense prediction tasks, yielding a well-rounded vision transformer that serves as a versatile backbone for multimodal understanding.

Abstract: Multimodal Large Language Models (MLLMs) have recently achieved remarkable success in visual-language understanding, demonstrating superior high-level semantic alignment within their vision encoders. An important question thus arises: Can these encoders serve as versatile vision backbones, capable of reliably performing classic vision-centric tasks as well? To address the question, we make the following contributions: (i) we identify that the vision encoders within MLLMs exhibit deficiencies in their dense feature representations, as evidenced by their suboptimal performance on dense prediction tasks (e.g., semantic segmentation, depth estimation); (ii) we propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training. This framework facilitates the optimization of the vision backbone via lightweight task heads with multi-granularity supervision; (iii) extensive experiments across various downstream tasks demonstrate the effectiveness of our method, yielding a versatile vision backbone suited for both language-mediated reasoning and pixel-level understanding.

[189] Bladder Vessel Segmentation using a Hybrid Attention-Convolution Framework

Franziska Krauß, Matthias Ege, Zoltan Lovasz, Albrecht Bartz-Schmidt, Igor Tsaur, Oliver Sawodny, Carina Veil

Main category: cs.CV

TL;DR: Hybrid Attention-Convolution architecture for bladder vessel segmentation in endoscopic videos, combining Transformers for global topology and CNNs for fine details, with physics-aware pretraining to handle sparse labels and artifacts.

Details

Motivation: Bladder cancer surveillance requires tracking tumor sites across interventions, but the deformable bladder lacks stable landmarks. Blood vessels offer a patient-specific "vascular fingerprint" for navigation, but automated segmentation is challenged by imperfect endoscopic data including sparse labels, artifacts, continuous deformation, and mucosal folds that mimic vessels.

Method: Hybrid Attention-Convolution (HAC) architecture combining Transformers to capture global vessel topology with CNNs to learn residual refinement maps for thin-vessel details. Transformer trained on optimized ground truth excluding short/terminal branches. Physics-aware pretraining using clinically grounded augmentations on unlabeled data to address data scarcity.

Result: Achieves high accuracy (0.94), superior precision (0.61) and clDice (0.66) on BlaVeS dataset compared to state-of-the-art medical segmentation models. Successfully suppresses false positives from mucosal folds that dynamically appear and vanish during surgery.

Conclusion: HAC provides reliable structural stability required for clinical navigation in bladder cancer surveillance by effectively segmenting vascular fingerprints despite challenging endoscopic conditions.

Abstract: Urinary bladder cancer surveillance requires tracking tumor sites across repeated interventions, yet the deformable and hollow bladder lacks stable landmarks for orientation. While blood vessels visible during endoscopy offer a patient-specific “vascular fingerprint” for navigation, automated segmentation is challenged by imperfect endoscopic data, including sparse labels, artifacts like bubbles or variable lighting, continuous deformation, and mucosal folds that mimic vessels. State-of-the-art vessel segmentation methods often fail to address these domain-specific complexities. We introduce a Hybrid Attention-Convolution (HAC) architecture that combines Transformers to capture global vessel topology prior with a CNN that learns a residual refinement map to precisely recover thin-vessel details. To prioritize structural connectivity, the Transformer is trained on optimized ground truth data that exclude short and terminal branches. Furthermore, to address data scarcity, we employ a physics-aware pretraining, that is a self-supervised strategy using clinically grounded augmentations on unlabeled data. Evaluated on the BlaVeS dataset, consisting of endoscopic video frames, our approach achieves high accuracy (0.94) and superior precision (0.61) and clDice (0.66) compared to state-of-the-art medical segmentation models. Crucially, our method successfully suppresses false positives from mucosal folds that dynamically appear and vanish as the bladder fills and empties during surgery. Hence, HAC provides the reliable structural stability required for clinical navigation.

[190] Learning to Detect Baked Goods with Limited Supervision

Thomas H. Schmitt, Maximilian Bundscherer, Tobias Bocklet

Main category: cs.CV

TL;DR: Automated leftover product monitoring for German bakeries using object detection with limited supervision, achieving strong performance despite scarce annotations.

Details

Motivation: Automating leftover product monitoring in bakeries to reduce labor costs and improve accuracy, addressing the broader challenge of deploying computer vision in specialized industries with scarce annotated datasets.

Method: Two training workflows: 1) Weakly supervised training combining OWLv2 and Grounding DINO localization with image-level supervision, 2) Fine-tuning on video frames using Segment Anything 2 for pseudo-label propagation to improve viewpoint robustness. Uses YOLOv11 for detection.

Result: Model achieves mAP of 0.91 with only image-level supervision. Fine-tuning with pseudo-labels improves performance by 19.3% under non-ideal conditions. Combined approach surpasses fully-supervised baseline under non-ideal deployment conditions.

Conclusion: Effective object detection for specialized bakery monitoring can be achieved with limited supervision, demonstrating practical solutions for industries with scarce annotated data.

Abstract: Monitoring leftover products provides valuable insights that can be used to optimize future production. This is especially important for German bakeries because freshly baked goods have a very short shelf life. Automating this process can reduce labor costs, improve accuracy, and streamline operations. We propose automating this process using an object detection model to identify baked goods from images. However, the large diversity of German baked goods makes fully supervised training prohibitively expensive and limits scalability. Although open-vocabulary detectors (e.g., OWLv2, Grounding DINO) offer lexibility, we demonstrate that they are insufficient for our task. While motivated by bakeries, our work addresses the broader challenges of deploying computer vision in industries, where tasks are specialized and annotated datasets are scarce. We compile dataset splits with varying supervision levels, covering 19 classes of baked goods. We propose two training workflows to train an object detection model with limited supervision. First, we combine OWLv2 and Grounding DINO localization with image-level supervision to train the model in a weakly supervised manner. Second, we improve viewpoint robustness by fine-tuning on video frames annotated using Segment Anything 2 as a pseudo-label propagation model. Using these workflows, we train YOLOv11 for our detection task due to its favorable speed accuracy tradeoff. Relying solely on image-level supervision, the model achieves a mean Average Precision (mAP) of 0.91. Finetuning with pseudo-labels raises model performance by 19.3% under non-ideal deployment conditions. Combining these workflows trains a model that surpasses our fully-supervised baseline model under non-ideal deployment conditions, despite relying only on image-level supervision.

[191] Coupled Inference in Diffusion Models for Semantic Decomposition

Calvin Yeung, Ali Zakeri, Zhuowen Zou, Mohsen Imani

Main category: cs.CV

TL;DR: A framework for semantic decomposition using coupled inference in diffusion models that outperforms resonator networks on synthetic tasks.

Details

Motivation: Visual scenes can be described as compositions of latent factors, requiring effective decomposition for recognition, reasoning, and editing. While resonator networks (coupled Hopfield networks) were proposed for this, recent connections between Hopfield networks and diffusion models motivate a diffusion-based approach.

Method: Frames semantic decomposition as an inverse problem and couples diffusion processes using reconstruction-driven guidance that encourages factor estimates to compose to match the bound vector. Introduces a novel iterative sampling scheme and shows attention-based resonator networks are a special case.

Result: Empirically demonstrates that the coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.

Conclusion: Proposes a novel diffusion-based framework for semantic decomposition that generalizes resonator networks and shows superior performance on decomposition tasks.

Abstract: Many visual scenes can be described as compositions of latent factors. Effective recognition, reasoning, and editing often require not only forming such compositional representations, but also solving the decomposition problem. One popular choice for constructing these representations is through the binding operation. Resonator networks, which can be understood as coupled Hopfield networks, were proposed as a way to perform decomposition on such bound representations. Recent works have shown notable similarities between Hopfield networks and diffusion models. Motivated by these observations, we introduce a framework for semantic decomposition using coupled inference in diffusion models. Our method frames semantic decomposition as an inverse problem and couples the diffusion processes using a reconstruction-driven guidance term that encourages the composition of factor estimates to match the bound vector. We also introduce a novel iterative sampling scheme that improves the performance of our model. Finally, we show that attention-based resonator networks are a special case of our framework. Empirically, we demonstrate that our coupled inference framework outperforms resonator networks across a range of synthetic semantic decomposition tasks.

[192] Efficient Special Stain Classification

Oskar Thaeter, Christian Grashei, Anette Haas, Elisa Schmoeckel, Han Li, Peter J. Schüffler

Main category: cs.CV

TL;DR: Automated classification of histopathology stains using whole slide images, comparing Multi-Instance Learning vs lightweight thumbnail-based approaches for quality control in digital pathology.

Details

Motivation: Pathologists use various special stains beyond standard H&E for diagnosis, requiring accurate metadata maintenance for clinical archives and computational pathology datasets. Automated stain classification is needed for quality control.

Method: Two approaches compared: 1) Multi-Instance Learning (MIL) pipeline using patch-level features, and 2) proposed lightweight thumbnail-based approach using downsampled whole slide images. Evaluated on 14 most common special stains plus standard and frozen-section H&E.

Result: MIL achieved highest performance on internal data (macro F1: 0.941 for 16 classes), while thumbnail approach remained competitive (0.897). On external TCGA data, thumbnail model generalized better (weighted F1: 0.843 vs 0.807). Thumbnail approach increased throughput by two orders of magnitude (5.635 vs 0.018 slides/s).

Conclusion: Thumbnail-based classification provides a scalable and robust solution for routine visual quality control in digital pathology workflows, balancing performance with computational efficiency.

Abstract: Stains are essential in histopathology to visualize specific tissue characteristics, with Haematoxylin and Eosin (H&E) serving as the clinical standard. However, pathologists frequently utilize a variety of special stains for the diagnosis of specific morphologies. Maintaining accurate metadata for these slides is critical for quality control in clinical archives and for the integrity of computational pathology datasets. In this work, we compare two approaches for automated classification of stains using whole slide images, covering the 14 most commonly used special stains in our institute alongside standard and frozen-section H&E. We evaluate a Multi-Instance Learning (MIL) pipeline and a proposed lightweight thumbnail-based approach. On internal test data, MIL achieved the highest performance (macro F1: 0.941 for 16 classes; 0.969 for 14 merged classes), while the thumbnail approach remained competitive (0.897 and 0.953, respectively). On external TCGA data, the thumbnail model generalized best (weighted F1: 0.843 vs. 0.807 for MIL). The thumbnail approach also increased throughput by two orders of magnitude (5.635 vs. 0.018 slides/s for MIL with all patches). We conclude that thumbnail-based classification provides a scalable and robust solution for routine visual quality control in digital pathology workflows.

[193] Faster-GS: Analyzing and Improving Gaussian Splatting Optimization

Florian Hahlbohm, Linus Franke, Martin Eisemann, Marcus Magnor

Main category: cs.CV

TL;DR: Faster-GS: A consolidated and optimized 3D Gaussian Splatting system that achieves 5× faster training while maintaining visual quality, with extensions to 4D non-rigid scene reconstruction.

Details

Motivation: Current 3DGS research suffers from fragmented implementations where algorithmic improvements are entangled with implementation-level optimizations, making fair comparisons difficult. There's a need to consolidate the most effective strategies and explore underexplored aspects of the framework.

Method: Consolidates and evaluates the most effective strategies from prior 3DGS research, adds novel optimizations, and investigates underexplored aspects including numerical stability, Gaussian truncation, and gradient approximation. Extends optimizations to 4D Gaussian reconstruction for non-rigid scenes.

Result: Achieves up to 5× faster training while maintaining visual quality across comprehensive benchmarks. Establishes a new cost-effective and resource-efficient baseline for 3DGS optimization. Successfully applies optimizations to 4D Gaussian reconstruction for efficient non-rigid scene optimization.

Conclusion: Faster-GS provides a rigorously optimized algorithm that significantly accelerates 3D Gaussian Splatting training while preserving quality, offering a solid baseline for future research and enabling efficient 4D scene reconstruction.

Abstract: Recent advances in 3D Gaussian Splatting (3DGS) have focused on accelerating optimization while preserving reconstruction quality. However, many proposed methods entangle implementation-level improvements with fundamental algorithmic modifications or trade performance for fidelity, leading to a fragmented research landscape that complicates fair comparison. In this work, we consolidate and evaluate the most effective and broadly applicable strategies from prior 3DGS research and augment them with several novel optimizations. We further investigate underexplored aspects of the framework, including numerical stability, Gaussian truncation, and gradient approximation. The resulting system, Faster-GS, provides a rigorously optimized algorithm that we evaluate across a comprehensive suite of benchmarks. Our experiments demonstrate that Faster-GS achieves up to 5$\times$ faster training while maintaining visual quality, establishing a new cost-effective and resource efficient baseline for 3DGS optimization. Furthermore, we demonstrate that optimizations can be applied to 4D Gaussian reconstruction, leading to efficient non-rigid scene optimization.

[194] Perception with Guarantees: Certified Pose Estimation via Reachability Analysis

Tobias Ladner, Yasser Shoukry, Matthias Althoff

Main category: cs.CV

TL;DR: Certified 3D pose estimation from camera images using formal verification methods to guarantee safety in worst-case scenarios

Details

Motivation: Safety-critical agents in cyber-physical systems need precise pose localization for formal safety guarantees, but existing methods (lidar, cameras, GPS) may provide insufficient accuracy or be untrustworthy for worst-case safety analysis

Method: Uses formal bounding of pose computed through reachability analysis and formal neural network verification, leveraging camera images and known target geometry

Result: Approach efficiently and accurately localizes agents in both synthetic and real-world experiments with certified bounds

Conclusion: Provides certified pose estimation for safety-critical applications using only camera images and geometry, enabling formal safety guarantees without relying on external services

Abstract: Agents in cyber-physical systems are increasingly entrusted with safety-critical tasks. Ensuring safety of these agents often requires localizing the pose for subsequent actions. Pose estimates can, e.g., be obtained from various combinations of lidar sensors, cameras, and external services such as GPS. Crucially, in safety-critical domains, a rough estimate is insufficient to formally determine safety, i.e., guaranteeing safety even in the worst-case scenario, and external services might additionally not be trustworthy. We address this problem by presenting a certified pose estimation in 3D solely from a camera image and a well-known target geometry. This is realized by formally bounding the pose, which is computed by leveraging recent results from reachability analysis and formal neural network verification. Our experiments demonstrate that our approach efficiently and accurately localizes agents in both synthetic and real-world experiments.

[195] Fake-HR1: Rethinking reasoning of vision language model for synthetic image detection

Changjiang Jiang, Xinkuan Sha, Fengchang Yu, Jingjing Liu, Jian Liu, Mingqi Fang, Chenfeng Zhang, Wei Lu

Main category: cs.CV

TL;DR: Fake-HR1 is a hybrid-reasoning model for synthetic image detection that adaptively decides when to use Chain-of-Thought reasoning to balance accuracy and efficiency.

Details

Motivation: While Chain-of-Thought reasoning improves synthetic image detection, excessive reasoning causes resource overhead and latency, especially for obvious forgeries. The paper aims to create an adaptive system that only uses reasoning when necessary.

Method: Two-stage training: 1) Hybrid Fine-Tuning for cold-start initialization, 2) Online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization to learn when to select appropriate reasoning modes adaptively based on image characteristics.

Result: Fake-HR1 adaptively performs reasoning across different query types, surpassing existing LLMs in both reasoning ability and generative detection performance while significantly improving response efficiency.

Conclusion: The proposed adaptive hybrid-reasoning approach effectively balances detection accuracy and computational efficiency for synthetic image detection tasks.

Abstract: Recent studies have demonstrated that incorporating Chain-of-Thought (CoT) reasoning into the detection process can enhance a model’s ability to detect synthetic images. However, excessively lengthy reasoning incurs substantial resource overhead, including token consumption and latency, which is particularly redundant when handling obviously generated forgeries. To address this issue, we propose Fake-HR1, a large-scale hybrid-reasoning model that, to the best of our knowledge, is the first to adaptively determine whether reasoning is necessary based on the characteristics of the generative detection task. To achieve this, we design a two-stage training framework: we first perform Hybrid Fine-Tuning (HFT) for cold-start initialization, followed by online reinforcement learning with Hybrid-Reasoning Grouped Policy Optimization (HGRPO) to implicitly learn when to select an appropriate reasoning mode. Experimental results show that Fake-HR1 adaptively performs reasoning across different types of queries, surpassing existing LLMs in both reasoning ability and generative detection performance, while significantly improving response efficiency.

[196] Simple Image Processing and Similarity Measures Can Link Data Samples across Databases through Brain MRI

Gaurang Sharma, Harri Polonen, Juha Pajula, Jutta Suksi, Jussi Tohka

Main category: cs.CV

TL;DR: Skull-stripped brain MRIs can be linked across databases using simple image similarity methods, posing privacy risks despite regulatory safeguards

Details

Motivation: To demonstrate that even after skull stripping and regulatory compliance, brain MRIs contain unique signatures that can be used to re-identify individuals across databases, challenging current privacy frameworks

Method: Used standard preprocessing followed by image similarity computation to match skull-stripped T1-weighted MRIs across different time intervals, scanner types, resolutions, and acquisition protocols

Result: Achieved nearly perfect linkage accuracy in matching data samples across various conditions, showing brain MRIs can be reliably linked even with cognitive decline

Conclusion: Brain MRI privacy risks are more significant than currently recognized, requiring updated policies for medical data sharing that account for these re-identification vulnerabilities

Abstract: Head Magnetic Resonance Imaging (MRI) is routinely collected and shared for research under strict regulatory frameworks. These frameworks require removing potential identifiers before sharing. But, even after skull stripping, the brain parenchyma contains unique signatures that can match other MRIs from the same participants across databases, posing a privacy risk if additional data features are available. Current regulatory frameworks often mandate evaluating such risks based on the assessment of a certain level of reasonableness. Prior studies have already suggested that a brain MRI could enable participant linkage, but they have relied on training-based or computationally intensive methods. Here, we demonstrate that linking an individual’s skull-stripped T1-weighted MRI, which may lead to re-identification if other identifiers are available, is possible using standard preprocessing followed by image similarity computation. Nearly perfect linkage accuracy was achieved in matching data samples across various time intervals, scanner types, spatial resolutions, and acquisition protocols, despite potential cognitive decline, simulating MRI matching across databases. These results aim to contribute meaningfully to the development of thoughtful, forward-looking policies in medical data sharing.

[197] Conformal Prediction Sets for Instance Segmentation

Kerri Lu, Dan M. Kluger, Stephen Bates, Sherrie Wang

Main category: cs.CV

TL;DR: Conformal prediction algorithm for instance segmentation that generates adaptive confidence sets with provable coverage guarantees for instance mask predictions.

Details

Motivation: Current instance segmentation models lack proper uncertainty quantification - their outputs are not calibrated, and there's no guarantee that predicted masks are close to ground truth. This paper addresses the need for principled uncertainty quantification in instance segmentation tasks.

Method: Introduces a conformal prediction algorithm that generates adaptive confidence sets for instance segmentation. Given an image and pixel coordinate query, the algorithm produces confidence sets of instance predictions with provable guarantees for the probability that at least one prediction has high IoU with the true mask. Provides both asymptotic and finite sample guarantee versions.

Result: The algorithm’s prediction sets vary in size based on query difficulty and achieve target coverage, outperforming existing baselines like Learn Then Test, Conformal Risk Control, and morphological dilation-based methods. Applied successfully to agricultural field delineation, cell segmentation, and vehicle detection.

Conclusion: The proposed conformal prediction approach provides principled uncertainty quantification for instance segmentation with provable guarantees, addressing a key limitation of current models while maintaining practical applicability across diverse domains.

Abstract: Current instance segmentation models achieve high performance on average predictions, but lack principled uncertainty quantification: their outputs are not calibrated, and there is no guarantee that a predicted mask is close to the ground truth. To address this limitation, we introduce a conformal prediction algorithm to generate adaptive confidence sets for instance segmentation. Given an image and a pixel coordinate query, our algorithm generates a confidence set of instance predictions for that pixel, with a provable guarantee for the probability that at least one of the predictions has high Intersection-Over-Union (IoU) with the true object instance mask. We apply our algorithm to instance segmentation examples in agricultural field delineation, cell segmentation, and vehicle detection. Empirically, we find that our prediction sets vary in size based on query difficulty and attain the target coverage, outperforming existing baselines such as Learn Then Test, Conformal Risk Control, and morphological dilation-based methods. We provide versions of the algorithm with asymptotic and finite sample guarantees.

[198] Spatio-Temporal Attention for Consistent Video Semantic Segmentation in Automated Driving

Serin Varghese, Kevin Ross, Fabian Hueger, Kira Maag

Main category: cs.CV

TL;DR: STA extends transformer attention to incorporate multi-frame context for video semantic segmentation, improving temporal consistency and accuracy over single-frame baselines.

Details

Motivation: Existing semantic segmentation models process video frames independently, failing to leverage temporal consistency which could improve accuracy and stability in dynamic scenes.

Method: Proposes Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, modifying standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency.

Result: Substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union on Cityscapes and BDD100k datasets compared to single-frame baselines.

Conclusion: STA is an effective architectural enhancement for video-based semantic segmentation applications that demonstrates broad applicability across diverse transformer architectures.

Abstract: Deep neural networks, especially transformer-based architectures, have achieved remarkable success in semantic segmentation for environmental perception. However, existing models process video frames independently, thus failing to leverage temporal consistency, which could significantly improve both accuracy and stability in dynamic scenes. In this work, we propose a Spatio-Temporal Attention (STA) mechanism that extends transformer attention blocks to incorporate multi-frame context, enabling robust temporal feature representations for video semantic segmentation. Our approach modifies standard self-attention to process spatio-temporal feature sequences while maintaining computational efficiency and requiring minimal changes to existing architectures. STA demonstrates broad applicability across diverse transformer architectures and remains effective across both lightweight and larger-scale models. A comprehensive evaluation on the Cityscapes and BDD100k datasets shows substantial improvements of 9.20 percentage points in temporal consistency metrics and up to 1.76 percentage points in mean intersection over union compared to single-frame baselines. These results demonstrate STA as an effective architectural enhancement for video-based semantic segmentation applications.

[199] Can Image Splicing and Copy-Move Forgery Be Detected by the Same Model? Forensim: An Attention-Based State-Space Approach

Soumyaroop Nandi, Prem Natarajan

Main category: cs.CV

TL;DR: Forensim is an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions, addressing limitations of traditional approaches that only detect forged areas without understanding duplication patterns.

Details

Motivation: Traditional image forgery detection methods focus only on detecting manipulated regions using artifact cues, which can be misleading in scenarios like protest imagery where understanding duplication patterns (both source and target regions) is crucial for proper interpretation. Current approaches fail to capture the full context of forgeries.

Method: Forensim uses a visual state-space model with normalized attention maps to identify internal similarities in images. It incorporates a region-based block attention module to distinguish manipulated regions, enabling joint localization of source and target areas. The framework supports both splicing and copy-move forgeries within a unified architecture and allows end-to-end training.

Result: Forensim achieves state-of-the-art performance on standard benchmarks for image forgery detection. The authors also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.

Conclusion: Forensim provides a comprehensive solution for image forgery detection by jointly localizing source and target regions, offering better context understanding than traditional methods. The unified architecture handles multiple forgery types and the new dataset addresses existing limitations in the field.

Abstract: We introduce Forensim, an attention-based state-space framework for image forgery detection that jointly localizes both manipulated (target) and source regions. Unlike traditional approaches that rely solely on artifact cues to detect spliced or forged areas, Forensim is designed to capture duplication patterns crucial for understanding context. In scenarios such as protest imagery, detecting only the forged region, for example a duplicated act of violence inserted into a peaceful crowd, can mislead interpretation, highlighting the need for joint source-target localization. Forensim outputs three-class masks (pristine, source, target) and supports detection of both splicing and copy-move forgeries within a unified architecture. We propose a visual state-space model that leverages normalized attention maps to identify internal similarities, paired with a region-based block attention module to distinguish manipulated regions. This design enables end-to-end training and precise localization. Forensim achieves state-of-the-art performance on standard benchmarks. We also release CMFD-Anything, a new dataset addressing limitations of existing copy-move forgery datasets.

[200] Causality in Video Diffusers is Separable from Denoising

Xingjian Bai, Guande He, Zhengqi Li, Eli Shechtman, Xun Huang, Zongze Wu

Main category: cs.CV

TL;DR: SCD decouples causal temporal reasoning from iterative denoising in video diffusion models, improving efficiency while maintaining quality.

Details

Motivation: Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers at every denoising step, leading to computational redundancy and inefficiency.

Method: Separable Causal Diffusion (SCD) architecture with two components: (1) causal transformer encoder for once-per-frame temporal reasoning, and (2) lightweight diffusion decoder for multi-step frame-wise rendering.

Result: SCD significantly improves throughput and per-frame latency while matching or surpassing generation quality of strong causal diffusion baselines on both synthetic and real benchmarks.

Conclusion: Causal reasoning in video diffusion models is separable from the denoising process, enabling more efficient architectures without sacrificing generation quality.

Abstract: Causality – referring to temporal, uni-directional cause-effect relationships between components – underlies many complex generative processes, including videos, language, and robot trajectories. Current causal diffusion models entangle temporal reasoning with iterative denoising, applying causal attention across all layers, at every denoising step, and over the entire context. In this paper, we show that the causal reasoning in these models is separable from the multi-step denoising process. Through systematic probing of autoregressive video diffusers, we uncover two key regularities: (1) early layers produce highly similar features across denoising steps, indicating redundant computation along the diffusion trajectory; and (2) deeper layers exhibit sparse cross-frame attention and primarily perform intra-frame rendering. Motivated by these findings, we introduce Separable Causal Diffusion (SCD), a new architecture that explicitly decouples once-per-frame temporal reasoning, via a causal transformer encoder, from multi-step frame-wise rendering, via a lightweight diffusion decoder. Extensive experiments on both pretraining and post-training tasks across synthetic and real benchmarks show that SCD significantly improves throughput and per-frame latency while matching or surpassing the generation quality of strong causal diffusion baselines.

[201] 4RC: 4D Reconstruction via Conditional Querying Anytime and Anywhere

Yihang Luo, Shangchen Zhou, Yushi Lan, Xingang Pan, Chen Change Loy

Main category: cs.CV

TL;DR: 4RC is a unified feed-forward framework for 4D reconstruction from monocular videos that jointly captures dense scene geometry and motion dynamics using a novel encode-once, query-anywhere paradigm.

Details

Motivation: Existing approaches for 4D reconstruction typically decouple motion from geometry or produce limited 4D attributes like sparse trajectories or two-view scene flow, lacking a holistic representation of both dense geometry and motion dynamics.

Method: 4RC introduces an encode-once, query-anywhere paradigm where a transformer backbone encodes the entire video into a compact spatio-temporal latent space. A conditional decoder can then query 3D geometry and motion for any frame at any timestamp. Per-view 4D attributes are represented in a minimally factorized form by decomposing into base geometry and time-dependent relative motion.

Result: Extensive experiments show that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.

Conclusion: 4RC provides a unified framework for holistic 4D reconstruction that jointly captures dense scene geometry and motion dynamics from monocular videos.

Abstract: We present 4RC, a unified feed-forward framework for 4D reconstruction from monocular videos. Unlike existing approaches that typically decouple motion from geometry or produce limited 4D attributes such as sparse trajectories or two-view scene flow, 4RC learns a holistic 4D representation that jointly captures dense scene geometry and motion dynamics. At its core, 4RC introduces a novel encode-once, query-anywhere and anytime paradigm: a transformer backbone encodes the entire video into a compact spatio-temporal latent space, from which a conditional decoder can efficiently query 3D geometry and motion for any query frame at any target timestamp. To facilitate learning, we represent per-view 4D attributes in a minimally factorized form by decomposing them into base geometry and time-dependent relative motion. Extensive experiments demonstrate that 4RC outperforms prior and concurrent methods across a wide range of 4D reconstruction tasks.

[202] VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin

Main category: cs.CV

TL;DR: VideoWorld 2 learns transferable world knowledge from raw real-world videos using a dynamic-enhanced Latent Dynamics Model that decouples action dynamics from visual appearance, enabling improved task performance on real-world handcraft making and robotics manipulation tasks.

Details

Motivation: To enable intelligent agents to learn transferable knowledge directly from unlabeled real-world video data and apply it in new environments, overcoming limitations of prior video generation and latent-dynamics models that struggle with real-world tasks.

Method: Introduces dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: uses pretrained video diffusion model for visual appearance modeling while dLDM learns compact, meaningful task-related latent codes, which are then modeled autoregressively for task policies and long-horizon reasoning.

Result: Achieves up to 70% improvement in task success rate on real-world handcraft making tasks, produces coherent long execution videos, and shows effective manipulation knowledge acquisition from Open-X dataset that substantially improves task performance on CALVIN robotics benchmark.

Conclusion: Demonstrates the potential of learning transferable world knowledge directly from raw videos, with promising applications in real-world tasks and robotics, and will open-source all code, data, and models.

Abstract: Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and offers the first investigation into learning transferable knowledge directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamic-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire effective manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models to be open-sourced for further research.

[203] Olaf-World: Orienting Latent Actions for Video World Modeling

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

Main category: cs.CV

TL;DR: SeqΔ-REPA learns latent actions from unlabeled video by aligning integrated latent actions with temporal feature differences, enabling better action transfer across contexts.

Details

Motivation: Scaling action-controllable world models is limited by scarce action labels. Latent action learning from unlabeled video often fails to transfer across contexts due to entanglement of scene-specific cues and lack of shared coordinate system.

Method: Introduces SeqΔ-REPA, a sequence-level control-effect alignment objective that anchors integrated latent actions to temporal feature differences from a frozen self-supervised video encoder. Presents Olaf-World pipeline for pretraining action-conditioned video world models from large-scale passive video.

Result: Extensive experiments show the method learns more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

Conclusion: The approach successfully addresses the cross-context transfer problem in latent action learning by using observable semantic effects as shared reference, enabling better scaling of action-controllable world models.

Abstract: Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

[204] ConsID-Gen: View-Consistent and Identity-Preserving Image-to-Video Generation

Mingyang Wu, Ashirbad Mishra, Soumik Dey, Shuo Xing, Naveen Ravipati, Hansi Wu, Binbin Li, Zhengzhong Tu

Main category: cs.CV

TL;DR: ConsID-Gen: A view-assisted image-to-video generation framework that uses auxiliary views and dual-stream encoding to preserve object identity and geometric consistency when animating static images with text instructions.

Details

Motivation: Existing image-to-video (I2V) models suffer from appearance drift and geometric distortion when animating static images, especially when viewpoints change. This is due to sparse single-view 2D observations and weak cross-modal alignment between text and visual inputs.

Method: 1) Created ConsIDVid dataset with high-quality, temporally aligned videos and ConsIDVid-Bench for multi-view consistency evaluation. 2) Proposed ConsID-Gen framework that augments input image with unposed auxiliary views, uses dual-stream visual-geometric encoder for semantic and structural cues, and text-visual connector for unified conditioning of Diffusion Transformer backbone.

Result: ConsID-Gen outperforms leading video generation models (Wan2.1, HunyuanVideo) on ConsIDVid-Bench metrics, achieving superior identity fidelity and temporal coherence in challenging real-world scenarios.

Conclusion: The proposed view-assisted approach with auxiliary views and dual-stream encoding effectively addresses identity preservation challenges in image-to-video generation, enabling better multi-view consistency and object identity fidelity.

Abstract: Image-to-Video generation (I2V) animates a static image into a temporally coherent video sequence following textual instructions, yet preserving fine-grained object identity under changing viewpoints remains a persistent challenge. Unlike text-to-video models, existing I2V pipelines often suffer from appearance drift and geometric distortion, artifacts we attribute to the sparsity of single-view 2D observations and weak cross-modal alignment. Here we address this problem from both data and model perspectives. First, we curate ConsIDVid, a large-scale object-centric dataset built with a scalable pipeline for high-quality, temporally aligned videos, and establish ConsIDVid-Bench, where we present a novel benchmarking and evaluation framework for multi-view consistency using metrics sensitive to subtle geometric and appearance deviations. We further propose ConsID-Gen, a view-assisted I2V generation framework that augments the first frame with unposed auxiliary views and fuses semantic and structural cues via a dual-stream visual-geometric encoder as well as a text-visual connector, yielding unified conditioning for a Diffusion Transformer backbone. Experiments across ConsIDVid-Bench demonstrate that ConsID-Gen consistently outperforms in multiple metrics, with the best overall performance surpassing leading video generation models like Wan2.1 and HunyuanVideo, delivering superior identity fidelity and temporal coherence under challenging real-world scenarios. We will release our model and dataset at https://myangwu.github.io/ConsID-Gen.

[205] Quantum Multiple Rotation Averaging

Shuteng Wang, Natacha Kuete Meli, Michael Möller, Vladislav Golyanik

Main category: cs.CV

TL;DR: IQARS is a quantum annealing approach for multiple rotation averaging that reformulates the problem as local quadratic sub-problems for quantum hardware, achieving better accuracy than classical methods despite current hardware limitations.

Details

Motivation: Classical methods for multiple rotation averaging (MRA) like L1-IRLS and Shonan have limitations including susceptibility to local minima and reliance on convex relaxations that don't preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios.

Method: IQARS reformulates MRA as a sequence of local quadratic non-convex sub-problems that can be executed on quantum annealers after binarization. It removes dependence on convex relaxations, better preserves non-Euclidean rotation manifold geometry, and leverages quantum tunneling and parallelism for solution space exploration.

Result: On D-Wave annealers, IQARS achieves approximately 12% higher accuracy than Shonan (the best-performing classical method evaluated), despite current quantum annealers being in their nascent phase with limited scale and constrained performance.

Conclusion: IQARS demonstrates the potential of quantum annealing for 3D vision optimization problems, showing improved accuracy over classical methods even with current hardware limitations, suggesting promising future applications as quantum hardware matures.

Abstract: Multiple rotation averaging (MRA) is a fundamental optimization problem in 3D vision and robotics that aims to recover globally consistent absolute rotations from noisy relative measurements. Established classical methods, such as L1-IRLS and Shonan, face limitations including local minima susceptibility and reliance on convex relaxations that fail to preserve the exact manifold geometry, leading to reduced accuracy in high-noise scenarios. We introduce IQARS (Iterative Quantum Annealing for Rotation Synchronization), the first algorithm that reformulates MRA as a sequence of local quadratic non-convex sub-problems executable on quantum annealers after binarization, to leverage inherent hardware advantages. IQARS removes convex relaxation dependence and better preserves non-Euclidean rotation manifold geometry while leveraging quantum tunneling and parallelism for efficient solution space exploration. We evaluate IQARS’s performance on synthetic and real-world datasets. While current annealers remain in their nascent phase and only support solving problems of limited scale with constrained performance, we observed that IQARS on D-Wave annealers can already achieve ca. 12% higher accuracy than Shonan, i.e., the best-performing classical method evaluated empirically.

[206] SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, Fangyin Wei

Main category: cs.CV

TL;DR: SAGE is an agentic framework that automatically generates simulation-ready 3D environments for embodied AI tasks through iterative reasoning and tool selection.

Details

Motivation: Real-world data collection for embodied agents is costly and unsafe, while existing scene-generation systems produce artifacts and physically invalid scenes, creating a need for scalable, realistic simulator-ready environments.

Method: Agentic framework that couples multiple generators for layout and object composition with critics evaluating semantic plausibility, visual realism, and physical stability. Uses iterative reasoning and adaptive tool selection to self-refine scenes until meeting user intent and physical validity.

Result: Generated environments are realistic, diverse, and directly deployable in modern simulators. Policies trained on this data show clear scaling trends and generalize to unseen objects and layouts.

Conclusion: SAGE demonstrates promise for simulation-driven scaling in embodied AI, enabling scalable environment generation for policy training without real-world data collection.

Abstract: Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., “pick up a bowl and place it on the table”), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.

[207] Deep Learning-Based Object Pose Estimation: A Comprehensive Survey

Jian Liu, Wei Sun, Hui Yang, Zhiwen Zeng, Chongpei Liu, Jin Zheng, Xingyu Liu, Hossein Rahmani, Nicu Sebe, Ajmal Mian

Main category: cs.CV

TL;DR: Survey paper on deep learning-based object pose estimation covering instance-level, category-level, and unseen object pose estimation across multiple data modalities and applications.

Details

Motivation: To provide a comprehensive survey of recent advances in deep learning-based object pose estimation, addressing the lack of recent surveys covering all problem formulations, challenges, and future directions in this rapidly evolving field.

Method: Systematic literature review covering three main problem formulations (instance-level, category-level, unseen object pose estimation), multiple input data modalities, degrees-of-freedom, object properties, training paradigms, inference modes, and evaluation metrics.

Result: Comprehensive survey that organizes the field, reports state-of-the-art performance on benchmark datasets, identifies key challenges, reviews current trends, and provides guidance for method selection and future research directions.

Conclusion: The survey fills an important gap in the literature by providing a holistic understanding of deep learning-based object pose estimation, facilitating method selection, and identifying promising research directions for addressing current challenges.

Abstract: Object pose estimation is a fundamental computer vision problem with broad applications in augmented reality and robotics. Over the past decade, deep learning models, due to their superior accuracy and robustness, have increasingly supplanted conventional algorithms reliant on engineered point pair features. Nevertheless, several challenges persist in contemporary methods, including their dependency on labeled training data, model compactness, robustness under challenging conditions, and their ability to generalize to novel unseen objects. A recent survey discussing the progress made on different aspects of this area, outstanding challenges, and promising future directions, is missing. To fill this gap, we discuss the recent advances in deep learning-based object pose estimation, covering all three formulations of the problem, \emph{i.e.}, instance-level, category-level, and unseen object pose estimation. Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks, providing the readers with a holistic understanding of this field. Additionally, it discusses training paradigms of different domains, inference modes, application areas, evaluation metrics, and benchmark datasets, as well as reports the performance of current state-of-the-art methods on these benchmarks, thereby facilitating the readers in selecting the most suitable method for their application. Finally, the survey identifies key challenges, reviews the prevailing trends along with their pros and cons, and identifies promising directions for future research. We also keep tracing the latest works at https://github.com/CNJianLiu/Awesome-Object-Pose-Estimation.

[208] MpoxSLDNet: A Novel CNN Model for Detecting Monkeypox Lesions and Performance Comparison with Pre-trained Models

Fatema Jannat Dihan, Saydul Akbar Murad

Main category: cs.CV

TL;DR: A lightweight CNN model called MpoxSLDNet is proposed for detecting monkeypox skin lesions from digital images, achieving high accuracy with significantly reduced storage requirements compared to traditional pre-trained models.

Details

Motivation: Early detection of monkeypox lesions is crucial but challenging due to similarity with other skin diseases. Existing deep learning models require high storage space, making them impractical for resource-constrained healthcare settings.

Method: Proposes MpoxSLDNet, a CNN model designed for monkeypox lesion detection using a dataset of 1428 monkeypox lesion images and 1764 non-monkeypox lesion images. The model is optimized for storage efficiency while maintaining detection performance.

Result: MpoxSLDNet achieved 94.56% validation accuracy, outperforming VGG16 (86.25%), DenseNet121 (84.38%), and ResNet50 (67.19%). The model also requires significantly less storage space than traditional models.

Conclusion: MpoxSLDNet provides a practical solution for early monkeypox detection in resource-limited settings by balancing high accuracy with low storage requirements, though dataset limitations may affect generalization.

Abstract: Monkeypox virus (MPXV) is a zoonotic virus that poses a significant threat to public health, particularly in remote parts of Central and West Africa. Early detection of monkeypox lesions is crucial for effective treatment. However, due to its similarity with other skin diseases, monkeypox lesion detection is a challenging task. To detect monkeypox, many researchers used various deep-learning models such as MobileNetv2, VGG16, ResNet50, InceptionV3, DenseNet121, EfficientNetB3, MobileNetV2, and Xception. However, these models often require high storage space due to their large size. This study aims to improve the existing challenges by introducing a CNN model named MpoxSLDNet (Monkeypox Skin Lesion Detector Network) to facilitate early detection and categorization of Monkeypox lesions and Non-Monkeypox lesions in digital images. Our model represents a significant advancement in the field of monkeypox lesion detection by offering superior performance metrics, including precision, recall, F1-score, accuracy, and AUC, compared to traditional pre-trained models such as VGG16, ResNet50, and DenseNet121. The key novelty of our approach lies in MpoxSLDNet’s ability to achieve high detection accuracy while requiring significantly less storage space than existing models. By addressing the challenge of high storage requirements, MpoxSLDNet presents a practical solution for early detection and categorization of monkeypox lesions in resource-constrained healthcare settings. In this study, we have used “Monkeypox Skin Lesion Dataset” comprising 1428 skin images of monkeypox lesions and 1764 skin images of Non-Monkeypox lesions. Dataset’s limitations could potentially impact the model’s ability to generalize to unseen cases. However, the MpoxSLDNet model achieved a validation accuracy of 94.56%, compared to 86.25%, 84.38%, and 67.19% for VGG16, DenseNet121, and ResNet50, respectively.

[209] RS-Agent: Automating Remote Sensing Tasks through Intelligent Agent

Wenjia Xu, Zijian Yu, Boyang Mu, Zhiwei Wei, Yuanben Zhang, Guangzuo Li, Jiuniu Wang, Mugen Peng

Main category: cs.CV

TL;DR: RS-Agent: An AI agent framework for remote sensing applications that integrates LLMs with specialized tools and domain knowledge to perform complex remote sensing tasks beyond basic MLLM capabilities.

Details

Motivation: Current MLLMs are limited to basic instruction-following and descriptive tasks, but real-world remote sensing applications require specialized tools and domain knowledge for complex analysis tasks.

Method: RS-Agent integrates four components: Central Controller (LLM-based), dynamic toolkit, Solution Space (task-specific expert guidance), and Knowledge Space (domain-level reasoning). Introduces Task-Aware Retrieval for expert-guided tool selection and DualRAG for weighted dual-path knowledge retrieval.

Result: Achieves over 95% task planning accuracy and superior performance across 9 datasets and 18 remote sensing tasks including scene classification, object counting, and remote sensing VQA, outperforming state-of-the-art MLLMs.

Conclusion: RS-Agent provides a robust, extensible framework for intelligent automation in remote sensing analysis, demonstrating that specialized AI agents can overcome limitations of general-purpose MLLMs in domain-specific applications.

Abstract: The unprecedented advancements in Multimodal Large Language Models (MLLMs) have demonstrated strong potential in interacting with humans through both language and visual inputs to perform downstream tasks such as visual question answering and scene understanding. However, these models are constrained to basic instruction-following or descriptive tasks, facing challenges in complex real-world remote sensing applications that require specialized tools and knowledge. To address these limitations, we propose RS-Agent, an AI agent designed to interact with human users and autonomously leverage specialized models to address the demands of real-world remote sensing applications. RS-Agent integrates four key components: a Central Controller based on large language models, a dynamic toolkit for tool execution, a Solution Space for task-specific expert guidance, and a Knowledge Space for domain-level reasoning, enabling it to interpret user queries and orchestrate tools for accurate remote sensing task. We introduce two novel mechanisms: Task-Aware Retrieval, which improves tool selection accuracy through expert-guided planning, and DualRAG, a retrieval-augmented generation method that enhances knowledge relevance through weighted, dual-path retrieval. RS-Agent supports flexible integration of new tools and is compatible with both open-source and proprietary LLMs. Extensive experiments across 9 datasets and 18 remote sensing tasks demonstrate that RS-Agent significantly outperforms state-of-the-art MLLMs, achieving over 95% task planning accuracy and delivering superior performance in tasks such as scene classification, object counting, and remote sensing visual question answering. Our work presents RS-Agent as a robust and extensible framework for advancing intelligent automation in remote sensing analysis.

[210] Story-Iter: A Training-free Iterative Paradigm for Long Story Visualization

Jiawei Mao, Xiaoke Huang, Yunfei Xie, Yuanqi Chang, Mude Hui, Bingjie Xu, Zeyu Zheng, Zirui Wang, Cihang Xie, Yuyin Zhou

Main category: cs.CV

TL;DR: Story-Iter is a training-free iterative paradigm for long-story generation that uses a novel external iterative approach with a plug-and-play global reference cross-attention module to ensure semantic consistency across up to 100 frames.

Details

Motivation: Existing methods for story visualization rely on fixed reference images and lack mechanisms for maintaining semantic consistency in long sequences. There's a need for approaches that can handle long-story generation (up to 100 frames) while ensuring both semantic consistency and fine-grained interactions across the entire sequence.

Method: Proposes Story-Iter with two key components: 1) An external iterative paradigm that extends beyond diffusion model’s internal denoising steps, continuously refining each generated image by incorporating all reference images from previous rounds, and 2) A training-free global reference cross-attention (GRCA) module that models all reference frames with global embeddings to ensure semantic consistency in long sequences.

Result: Extensive experiments on official story visualization datasets and a new long story benchmark demonstrate state-of-the-art performance in long-story visualization (up to 100 frames), excelling in both semantic consistency and fine-grained interactions.

Conclusion: Story-Iter provides an effective training-free solution for long-story generation that maintains semantic consistency across extended sequences through iterative refinement and global reference modeling, addressing limitations of existing fixed-reference approaches.

Abstract: This paper introduces Story-Iter, a new training-free iterative paradigm to enhance long-story generation. Unlike existing methods that rely on fixed reference images to construct a complete story, our approach features a novel external iterative paradigm, extending beyond the internal iterative denoising steps of diffusion models, to continuously refine each generated image by incorporating all reference images from the previous round. To achieve this, we propose a plug-and-play, training-free global reference cross-attention (GRCA) module, modeling all reference frames with global embeddings, ensuring semantic consistency in long sequences. By progressively incorporating holistic visual context and text constraints, our iterative paradigm enables precise generation with fine-grained interactions, optimizing the story visualization step-by-step. Extensive experiments in the official story visualization dataset and our long story benchmark demonstrate that Story-Iter’s state-of-the-art performance in long-story visualization (up to 100 frames) excels in both semantic consistency and fine-grained interactions.

[211] Constant Rate Scheduling: A General Framework for Optimizing Diffusion Noise Schedule via Distributional Change

Shuntaro Okada, Kenji Doi, Ryota Yoshihashi, Hirokatsu Kataoka, Tomohiro Tanaka

Main category: cs.CV

TL;DR: A framework for optimizing noise schedules in diffusion models that enforces constant rate of distributional change throughout the diffusion process using user-defined discrepancy measures.

Details

Motivation: Current diffusion models use heuristic noise schedules that may not be optimal. The paper aims to develop a principled framework for optimizing noise schedules in both training and sampling phases to improve diffusion model performance.

Method: Proposes a framework that enforces constant rate of change in the probability distribution of diffused data throughout diffusion process. Introduces three user-defined discrepancy measures to quantify distributional change that can be flexibly selected or combined based on domain and model architecture.

Result: The method consistently improves performance of both pixel-space and latent-space diffusion models across various datasets, samplers, and function evaluations (5-250 steps). Achieves state-of-the-art FID score of 2.03 on LSUN Horse 256×256 when applied to both training and sampling schedules.

Conclusion: Provides a general-purpose scheduling framework for diffusion models that empirically improves performance across diverse settings, establishing a principled approach to noise schedule optimization.

Abstract: We propose a general framework for optimizing noise schedules in diffusion models, applicable to both training and sampling. Our method enforces a constant rate of change in the probability distribution of diffused data throughout the diffusion process, where the rate of change is quantified using a user-defined discrepancy measure. We introduce three such measures, which can be flexibly selected or combined depending on the domain and model architecture. While our framework is inspired by theoretical insights, we do not aim to provide a complete theoretical justification of how distributional change affects sample quality. Instead, we focus on establishing a general-purpose scheduling framework and validating its empirical effectiveness. Through extensive experiments, we demonstrate that our approach consistently improves the performance of both pixel-space and latent-space diffusion models, across various datasets, samplers, and a wide range of number of function evaluations from 5 to 250. In particular, when applied to both training and sampling schedules, our method achieves a state-of-the-art FID score of 2.03 on LSUN Horse 256$\times$256, without compromising mode coverage.

[212] Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition

Julia Lee Romero, Kyle Min, Subarna Tripathi, Morteza Karimzadeh

Main category: cs.CV

TL;DR: Graph-based framework for fine-grained keystep recognition in egocentric videos using multimodal features and exocentric video alignment

Details

Motivation: Egocentric videos present challenges for keystep recognition due to dynamic backgrounds, frequent motion, and occlusions. Existing methods struggle with these issues and need better ways to leverage long-term dependencies and multimodal information.

Method: Proposes a flexible graph-learning framework where each video clip is a node. During training, includes exocentric video clips as additional nodes. Examines various connection strategies between nodes and treats keystep recognition as node classification. Also explores multimodal features including narrations, depth, and object class labels in a heterogeneous graph.

Result: Outperforms existing methods by more than 12 points in accuracy on Ego-Exo4D dataset. Constructed graphs are sparse and computationally efficient. Multimodal features contribute to improved performance.

Conclusion: The graph-based framework effectively addresses challenges in egocentric keystep recognition by leveraging long-term dependencies and multimodal alignment, achieving state-of-the-art performance with computational efficiency.

Abstract: Egocentric videos capture scenes from a wearer’s viewpoint, resulting in dynamic backgrounds, frequent motion, and occlusions, posing challenges to accurate keystep recognition. We propose a flexible graph-learning framework for fine-grained keystep recognition that is able to effectively leverage long-term dependencies in egocentric videos, and leverage alignment between egocentric and exocentric videos during training for improved inference on egocentric videos. Our approach consists of constructing a graph where each video clip of the egocentric video corresponds to a node. During training, we consider each clip of each exocentric video (if available) as additional nodes. We examine several strategies to define connections across these nodes and pose keystep recognition as a node classification task on the constructed graphs. We perform extensive experiments on the Ego-Exo4D dataset and show that our proposed flexible graph-based framework notably outperforms existing methods by more than 12 points in accuracy. Furthermore, the constructed graphs are sparse and compute efficient. We also present a study examining on harnessing several multimodal features, including narrations, depth, and object class labels, on a heterogeneous graph and discuss their corresponding contribution to the keystep recognition performance.

[213] Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

Xiaomeng Yang, Mengping Yang, Jia Gong, Luozheng Qin, Zhiyu Tan, Hao Li

Main category: cs.CV

TL;DR: Dual-IPO: An iterative framework that co-optimizes reward models and video generation models to improve video synthesis quality and human preference alignment without manual annotations.

Details

Motivation: Current video generation models produce realistic videos but often fail to align with users' authentic demands and preferences. There's a need for systematic optimization that improves synthesis quality (subject consistency, motion smoothness, aesthetic quality) while ensuring human preference alignment without requiring tedious manual annotations.

Method: Dual-Iterative Optimization (Dual-IPO) framework with sequential optimization of both reward models and video generation models. Reward models are improved via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation. Video foundation models are then optimized using reward model feedback signals. The two models complement each other through multi-round iterations.

Result: Comprehensive experiments show Dual-IPO effectively improves video generation quality across various architectures and sizes. A model with only 2B parameters can surpass a 5B model. Analysis confirms the rationale of systematic design and efficacy of each component.

Conclusion: Dual-IPO provides an effective iterative paradigm for improving video generation quality and human preference alignment without manual annotations, demonstrating scalability across different model architectures and sizes.

Abstract: Recent advances in video generation have enabled thrilling experiences in producing realistic videos driven by scalable diffusion transformers. However, they usually fail to produce satisfactory outputs that are aligned to users’ authentic demands and preferences. In this work, we introduce Dual-Iterative Optimization (Dual-IPO), an iterative paradigm that sequentially optimizes both the reward model and the video generation model for improved synthesis quality and human preference alignment. For the reward model, our framework ensures reliable and robust reward signals via CoT-guided reasoning, voting-based self-consistency, and preference certainty estimation. Given this, we optimize video foundation models with guidance of signals from reward model’s feedback, thus improving the synthesis quality in subject consistency, motion smoothness and aesthetic quality, etc. The reward model and video generation model complement each other and are progressively improved in the multi-round iteration, without requiring tediously manual preference annotations. Comprehensive experiments demonstrate that the proposed Dual-IPO can effectively and consistently improve the video generation quality of base model with various architectures and sizes, even help a model with only 2B parameters surpass a 5B one. Moreover, our analysis experiments and ablation studies identify the rational of our systematic design and the efficacy of each component.

[214] Wandering around: A bioinspired approach to visual attention through object motion sensitivity

Giulia D’Angelo, Victoria Clerico, Chiara Bartolozzi, Matej Hoffmann, P. Michael Furlong, Alexander Hadjiivanov

Main category: cs.CV

TL;DR: Bioinspired spiking neural network attention system using event-based cameras for real-time object motion segmentation and selective attention in dynamic environments

Details

Motivation: To develop efficient, low-latency vision systems inspired by biological attention mechanisms that can operate in real-time with reduced computational demands compared to traditional static feedforward architectures

Method: Spiking Convolutional Neural Network attention system using Dynamic Vision Sensor (event-based camera) integrated with Speck neuromorphic hardware on Pan-Tilt unit, generating events via fixational eye movements to identify Regions of Interest and saccade toward them

Result: Achieved 82.2% mean IoU and 96% mean SSIM in multi-object motion segmentation, 88.8% accuracy in office scenarios, 89.8% in low-light conditions, with 0.12s response time in real-time demonstrations

Conclusion: The learning-free, bioinspired system provides robust real-time performance for robotic applications, serving as foundation for more complex architectures in dynamic environments

Abstract: Active vision enables dynamic visual perception, offering an alternative to static feedforward architectures in computer vision, which rely on large datasets and high computational resources. Biological selective attention mechanisms allow agents to focus on salient Regions of Interest (ROIs), reducing computational demand while maintaining real-time responsiveness. Event-based cameras, inspired by the mammalian retina, enhance this capability by capturing asynchronous scene changes enabling efficient low-latency processing. To distinguish moving objects while the event-based camera is in motion the agent requires an object motion segmentation mechanism to accurately detect targets and center them in the visual field (fovea). Integrating event-based sensors with neuromorphic algorithms represents a paradigm shift, using Spiking Neural Networks to parallelize computation and adapt to dynamic environments. This work presents a Spiking Convolutional Neural Network bioinspired attention system for selective attention through object motion sensitivity. The system generates events via fixational eye movements using a Dynamic Vision Sensor integrated into the Speck neuromorphic hardware, mounted on a Pan-Tilt unit, to identify the ROI and saccade toward it. The system, characterized using ideal gratings and benchmarked against the Event Camera Motion Segmentation Dataset, reaches a mean IoU of 82.2% and a mean SSIM of 96% in multi-object motion segmentation. The detection of salient objects reaches 88.8% accuracy in office scenarios and 89.8% in low-light conditions on the Event-Assisted Low-Light Video Object Segmentation Dataset. A real-time demonstrator shows the system’s 0.12 s response to dynamic scenes. Its learning-free design ensures robustness across perceptual scenes, making it a reliable foundation for real-time robotic applications serving as a basis for more complex architectures.

[215] Self-Supervised Learning Based on Transformed Image Reconstruction for Equivariance-Coherent Feature Representation

Qin Wang, Alessio Quercia, Benjamin Bruns, Abigail Morrison, Hanno Scharr, Kai Krajsek

Main category: cs.CV

TL;DR: A novel self-supervised learning method that learns equivariance-coherent representations through intermediate transformation reconstruction, preserving transformation information while maintaining competitive performance on invariant tasks.

Details

Motivation: Current SSL methods learn invariant representations by discarding transformation information, but some computer vision tasks actually require this information. Recent equivariant approaches impose restrictive assumptions that limit flexibility and generalization.

Method: Proposes equivariance-coherence as a weaker transformation relation definition. Introduces an SSL auxiliary task that reconstructs images at intermediate points along transformation paths (e.g., reconstructing 10° and 20° rotations when training on 30° rotations). Decomposes features into invariant and equivariant parts trained with standard SSL losses and reconstruction losses respectively.

Result: Substantial improvements on synthetic equivariance benchmarks while maintaining competitive performance on downstream tasks requiring invariant representations. Seamlessly integrates with existing SSL methods (iBOT, DINOv2) and enhances performance across diverse tasks including segmentation, detection, depth estimation, and video dense prediction.

Conclusion: The framework provides a practical way to augment SSL methods with equivariant capabilities while preserving invariant performance, offering a balanced approach between invariance and equivariance in representation learning.

Abstract: Self-supervised learning (SSL) methods have achieved remarkable success in learning image representations allowing invariances in them - but therefore discarding transformation information that some computer vision tasks actually require. While recent approaches attempt to address this limitation by learning equivariant features using linear operators in feature space, they impose restrictive assumptions that constrain flexibility and generalization. We introduce a weaker definition for the transformation relation between image and feature space denoted as equivariance-coherence. We propose a novel SSL auxiliary task that learns equivariance-coherent representations through intermediate transformation reconstruction, which can be integrated with existing joint embedding SSL methods. Our key idea is to reconstruct images at intermediate points along transformation paths, e.g. when training on 30-degree rotations, we reconstruct the 10-degree and 20-degree rotation states. Reconstructing intermediate states requires the transformation information used in augmentations, rather than suppressing it, and therefore fosters features containing the augmented transformation information. Our method decomposes feature vectors into invariant and equivariant parts, training them with standard SSL losses and reconstruction losses, respectively. We demonstrate substantial improvements on synthetic equivariance benchmarks while maintaining competitive performance on downstream tasks requiring invariant representations. The approach seamlessly integrates with existing SSL methods (iBOT, DINOv2) and consistently enhances performance across diverse tasks, including segmentation, detection, depth estimation, and video dense prediction. Our framework provides a practical way for augmenting SSL methods with equivariant capabilities while preserving invariant performance.

[216] Unconditional Priors Matter! Improving Conditional Generation of Fine-Tuned Diffusion Models

Prin Phunyaphibarn, Phillip Y. Lee, Jaihoon Kim, Minhyuk Sung

Main category: cs.CV

TL;DR: CFG training with single network for both conditional/unconditional noise prediction causes poor unconditional priors that degrade conditional generation; replacing unconditional noise with predictions from a base model or different diffusion model improves quality.

Details

Motivation: Current CFG-based training uses a single network to learn both conditional and unconditional noise prediction with small dropout rate, but this joint learning with limited bandwidth results in poor unconditional priors that actually degrade conditional generation quality.

Method: Propose replacing the unconditional noise predictions in CFG with those from: 1) the base model that was fine-tuned to create the conditional model, or 2) a different diffusion model entirely. This decouples unconditional noise estimation from the conditional training.

Result: Significant improvement in conditional generation quality across various CFG-based models including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix for both image and video generation tasks.

Conclusion: The unconditional noise component in CFG training is critical for conditional generation quality, and using better unconditional noise estimates from base models or other diffusion models can substantially improve conditional generation without retraining.

Abstract: Classifier-Free Guidance (CFG) is a fundamental technique in training conditional diffusion models. The common practice for CFG-based training is to use a single network to learn both conditional and unconditional noise prediction, with a small dropout rate for conditioning. However, we observe that the joint learning of unconditional noise with limited bandwidth in training results in poor priors for the unconditional case. More importantly, these poor unconditional noise predictions become a serious reason for degrading the quality of conditional generation. Inspired by the fact that most CFG-based conditional models are trained by fine-tuning a base model with better unconditional generation, we first show that simply replacing the unconditional noise in CFG with that predicted by the base model can significantly improve conditional generation. Furthermore, we show that a diffusion model other than the one the fine-tuned model was trained on can be used for unconditional noise replacement. We experimentally verify our claim with a range of CFG-based conditional models for both image and video generation, including Zero-1-to-3, Versatile Diffusion, DiT, DynamiCrafter, and InstructPix2Pix.

[217] Benchmarking 3D Human Pose Estimation Models under Occlusions

Filipa Lino, Carlos Santiago, Manuel Marques

Main category: cs.CV

TL;DR: Benchmark study evaluating robustness of 9 state-of-the-art 3D Human Pose Estimation models under realistic occlusion conditions, revealing significant performance degradation and identifying vulnerable joints.

Details

Motivation: Occlusions pose a major challenge for accurate 3D human pose reconstruction in real-world scenarios, but current models' robustness under realistic occlusion conditions hasn't been systematically benchmarked.

Method: Evaluated 9 SOTA 2D-to-3D HPE models (convolutional, transformer-based, graph-based, diffusion-based) using BlendMimic3D dataset with ground-truth annotations. Simulated occlusion by adding noise to 2D keypoints based on real detector behavior, conducted global and per-joint sensitivity analyses without retraining models.

Result: All models showed notable performance degradation under occlusion, with diffusion-based models underperforming despite their stochastic nature. Per-joint analysis revealed consistent vulnerability in distal joints (wrists, feet) across all model architectures.

Conclusion: Current 3D HPE models have critical limitations in handling occlusions, highlighting the need for improved real-world robustness. The benchmark provides insights for future model development.

Abstract: Human Pose Estimation (HPE) involves detecting and localizing keypoints on the human body from visual data. In 3D HPE, occlusions, where parts of the body are not visible in the image, pose a significant challenge for accurate pose reconstruction. This paper presents a benchmark on the robustness of 3D HPE models under realistic occlusion conditions, involving combinations of occluded keypoints commonly observed in real-world scenarios. We evaluate nine state-of-the-art 2D-to-3D HPE models, spanning convolutional, transformer-based, graph-based, and diffusion-based architectures, using the BlendMimic3D dataset, a synthetic dataset with ground-truth 2D/3D annotations and occlusion labels. All models were originally trained on Human3.6M and tested here without retraining to assess their generalization. We introduce a protocol that simulates occlusion by adding noise into 2D keypoints based on real detector behavior, and conduct both global and per-joint sensitivity analyses. Our findings reveal that all models exhibit notable performance degradation under occlusion, with diffusion-based models underperforming despite their stochastic nature. Additionally, a per-joint occlusion analysis identifies consistent vulnerability in distal joints (e.g., wrists, feet) across models. Overall, this work highlights critical limitations of current 3D HPE models in handling occlusions, and provides insights for improving real-world robustness.

[218] Survey of Video Diffusion Models: Foundations, Implementations, and Applications

Yimu Wang, Xuye Liu, Wei Pang, Li Ma, Shuai Yuan, Paul Debevec, Ning Yu

Main category: cs.CV

TL;DR: A comprehensive survey on diffusion-based video generation covering technical foundations, methodologies, applications, and current challenges in the field.

Details

Motivation: To provide a systematic and updated review of diffusion-based video generation, addressing gaps in existing surveys by offering broader coverage including evaluation metrics, industry solutions, and training techniques.

Method: Survey methodology involving systematic taxonomy of diffusion-based video generation approaches, analysis of architectural innovations, optimization strategies, and investigation of applications across low-level vision tasks.

Result: A comprehensive resource covering evolution, technical foundations, practical applications, and synergies with related domains like video representation learning and question answering.

Conclusion: This survey serves as a foundational resource for researchers and practitioners, providing insights into both theoretical frameworks and practical implementations in the rapidly evolving field of diffusion-based video generation.

Abstract: Recent advances in diffusion models have revolutionized video generation, offering superior temporal consistency and visual quality compared to traditional generative adversarial networks-based approaches. While this emerging field shows tremendous promise in applications, it faces significant challenges in motion consistency, computational efficiency, and ethical considerations. This survey provides a comprehensive review of diffusion-based video generation, examining its evolution, technical foundations, and practical applications. We present a systematic taxonomy of current methodologies, analyze architectural innovations and optimization strategies, and investigate applications across low-level vision tasks such as denoising and super-resolution. Additionally, we explore the synergies between diffusionbased video generation and related domains, including video representation learning, question answering, and retrieval. Compared to the existing surveys (Lei et al., 2024a;b; Melnik et al., 2024; Cao et al., 2023; Xing et al., 2024c) which focus on specific aspects of video generation, such as human video synthesis (Lei et al., 2024a) or long-form content generation (Lei et al., 2024b), our work provides a broader, more updated, and more fine-grained perspective on diffusion-based approaches with a special section for evaluation metrics, industry solutions, and training engineering techniques in video generation. This survey serves as a foundational resource for researchers and practitioners working at the intersection of diffusion models and video generation, providing insights into both the theoretical frameworks and practical implementations that drive this rapidly evolving field. A structured list of related works involved in this survey is also available on https://github.com/Eyeline-Research/Survey-Video-Diffusion.

[219] Federated EndoViT: Pretraining Vision Transformers via Federated Learning on Endoscopic Image Collections

Max Kirchner, Alexander C. Jenke, Sebastian Bodenstedt, Fiona R. Kolbinger, Oliver L. Saldanha, Jakob N. Kather, Martin Wagner, Stefanie Speidel

Main category: cs.CV

TL;DR: Federated EndoViT enables privacy-preserving collaborative training of surgical foundation models using federated learning with adaptive optimization, achieving performance comparable to centralized training.

Details

Motivation: Data privacy regulations prevent multi-institutional data aggregation needed for robust surgical foundation models. Federated learning offers a privacy-preserving solution for collaborative training across institutions.

Method: Introduces Federated EndoViT (FL-EndoViT) using Masked Autoencoder (MAE) pretraining in decentralized surgical settings. Integrates adaptive Sharpness-Aware Minimization (FedSAM) to handle severe data heterogeneity. Pretrained on Endo700k dataset and evaluated against centralized baseline on segmentation, action recognition, and phase recognition tasks.

Result: FedSAM is critical for successful pretraining, overcoming convergence failures of standard federated methods. FL-EndoViT performs comparably to centralized counterpart, with advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. Full end-to-end fine-tuning is necessary for optimal performance.

Conclusion: Validates FL with adaptive optimization as viable paradigm for creating robust, privacy-preserving surgical foundation models. Provides scalable framework for collaborative Surgical Data Science and highlights optimizer’s critical role in handling data heterogeneity.

Abstract: Purpose: Data privacy regulations hinder the creation of generalizable foundation models (FMs) for surgery by preventing multi-institutional data aggregation. This study investigates federated learning (FL) as a privacy-preserving solution to collaboratively train robust surgical FMs. Methods: We introduce Federated EndoViT (FL-EndoViT), a federated framework that validates the Masked Autoencoder (MAE) pretraining strategy in a decentralized surgical setting. To ensure convergence under severe data heterogeneity, the architecture integrates adaptive Sharpness-Aware Minimization (FedSAM). Pretrained on the large-scale Endo700k dataset, FL-EndoViT is evaluated against a centralized baseline on different tasks including scene segmentation, action recognition, and phase recognition. Results: FedSAM is critical for successful pretraining, overcoming the convergence failures of standard federated methods. The resulting FL-EndoViT performs comparably to its centralized counterpart, with significant advantages in data-scarce, high-resolution segmentation and generalization to new surgical events. We also establish that full, end-to-end fine-tuning is necessary for optimal performance. Conclusion: This work validates FL with adaptive optimization as a viable paradigm for creating robust, privacy-preserving surgical FMs. Our findings provide a scalable framework for collaborative Surgical Data Science and underscore the optimizer’s critical role in handling data heterogeneity. Future work should explore video-based models to incorporate spatiotemporal dynamics.

[220] ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

Sijia Chen, Yanqiu Yu, En Yu, Wenbing Tao

Main category: cs.CV

TL;DR: ReaMOT introduces a reasoning-based multi-object tracking task that requires logical reasoning to track objects based on implicit language instructions, with a new benchmark dataset and training-free framework combining LVLM reasoning with SAM2 temporal modeling.

Details

Motivation: Existing referring multi-object tracking (RMOT) methods are designed for explicit instructions and fail to handle complex instructions requiring logical reasoning. The authors aim to address this limitation by creating a more challenging task that requires understanding implicit constraints through reasoning.

Method: Proposes ReaMOT (Reasoning-based Multi-Object Tracking) task and creates the ReaMOT Challenge benchmark with 1,156 instructions categorized into High-Level Reasoning and Low-Level Perception, covering 423,359 image-language pairs across 869 scenes. Also introduces ReaTrack, a training-free framework that synergizes the reasoning capabilities of Thinking-variant Large Vision-Language Models (LVLM) with the precise temporal modeling of SAM2.

Result: Extensive experiments on the ReaMOT Challenge benchmark demonstrate the effectiveness of the ReaTrack framework in handling complex reasoning-based tracking tasks.

Conclusion: ReaMOT addresses the limitation of existing RMOT methods by introducing a reasoning-based tracking task that requires logical reasoning for implicit instructions, with a comprehensive benchmark and effective framework that combines vision-language reasoning with temporal modeling.

Abstract: Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms are largely designed for explicit instructions and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that requires models to identify and track targets that satisfy implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark comprising: (1) a large-scale dataset with 1,156 instructions categorized into High-Level Reasoning and Low-Level Perception, covering 423,359 image-language pairs across 869 diverse scenes; and (2) a tailored metric suite designed to jointly evaluate reasoning accuracy and tracking robustness. Furthermore, we propose ReaTrack, a training-free framework that synergizes the reasoning capabilities of Thinking-variant Large Vision-Language Model (LVLM) with the precise temporal modeling of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrates the effectiveness of our ReaTrack framework.

[221] UFM: A Simple Path towards Unified Dense Correspondence with Flow

Yuchen Zhang, Nikhil Keetha, Chenwei Lyu, Bhuvan Jhamb, Yutian Chen, Yuheng Qiu, Jay Karhade, Shreyas Jha, Yaoyu Hu, Deva Ramanan, Sebastian Scherer, Wenshan Wang

Main category: cs.CV

TL;DR: UFM is a unified transformer model for both optical flow and wide-baseline matching that outperforms specialized approaches in both domains.

Details

Motivation: Dense image correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between images. The authors aim to develop a unified approach that can handle both tasks effectively.

Method: UFM uses a simple, generic transformer architecture that directly regresses (u,v) flow coordinates. It’s trained on unified data for co-visible pixels and avoids the typical coarse-to-fine cost volumes used in prior work, making it easier to train and more accurate for large flows.

Result: UFM achieves 28% higher accuracy than state-of-the-art flow methods (Unimatch), with 62% less error and 6.7x faster inference than dense wide-baseline matchers (RoMa). It’s the first unified model to outperform specialized approaches in both domains.

Conclusion: Unified training can outperform specialized approaches for dense correspondence tasks. UFM enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence applications.

Abstract: Dense image correspondence is central to many applications, such as visual odometry, 3D reconstruction, object association, and re-identification. Historically, dense correspondence has been tackled separately for wide-baseline scenarios and optical flow estimation, despite the common goal of matching content between two images. In this paper, we develop a Unified Flow & Matching model (UFM), which is trained on unified data for pixels that are co-visible in both source and target images. UFM uses a simple, generic transformer architecture that directly regresses the (u,v) flow. It is easier to train and more accurate for large flows compared to the typical coarse-to-fine cost volumes in prior work. UFM is 28% more accurate than state-of-the-art flow methods (Unimatch), while also having 62% less error and 6.7x faster than dense wide-baseline matchers (RoMa). UFM is the first to demonstrate that unified training can outperform specialized approaches across both domains. This result enables fast, general-purpose correspondence and opens new directions for multi-modal, long-range, and real-time correspondence tasks.

[222] Common Objects Out of Context (COOCo): Investigating Multimodal Context and Semantic Scene Violations in Referential Communication

Filippo Merlo, Ece Takmaz, Wenkai Chen, Albert Gatt

Main category: cs.CV

TL;DR: VLMs adaptively use scene context for object reference generation, balancing local and contextual information based on scene-object semantic relatedness and noise levels.

Details

Motivation: To understand how Vision-Language Models (VLMs) rely on scene context when generating references to objects, and under what conditions they adaptively use contextual information versus local object features.

Method: Introduced the Common Objects Out-of-Context (COOCo) dataset and conducted experiments on several VLMs under varying degrees of scene-object congruency and noise. Analyzed attention patterns as a function of target-scene semantic fit.

Result: VLMs leverage scene context adaptively based on semantic relatedness and noise levels. Successful object categorization is associated with increased mid-layer attention to the target. Attention shows non-monotonic dependency on semantic fit, dropping at moderate fit and increasing for both low and high fit.

Conclusion: VLMs dynamically balance local and contextual information for reference generation, with attention patterns revealing how models adapt to scene-object relationships.

Abstract: To what degree and under what conditions do VLMs rely on scene context when generating references to objects? To address this question, we introduce the $\textit{Common Objects Out-of-Context (COOCo)}$ dataset and conduct experiments on several VLMs under different degrees of scene-object congruency and noise. We find that models leverage scene context adaptively, depending on scene-object semantic relatedness and noise level. Based on these consistent trends across models, we turn to the question of how VLM attention patterns change as a function of target-scene semantic fit, and to what degree these patterns are predictive of categorisation accuracy. We find that successful object categorisation is associated with increased mid-layer attention to the target. We also find a non-monotonic dependency on semantic fit, with attention dropping at moderate fit and increasing for both low and high fit. These results suggest that VLMs dynamically balance local and contextual information for reference generation. Dataset and code are available here: $\href{https://github.com/cs-nlp-uu/scenereg}{https://github.com/cs-nlp-uu/scenereg}$.

[223] TAMMs: Change Understanding and Forecasting in Satellite Image Time Series with Temporal-Aware Multimodal Models

Zhongbin Guo, Yuhao Wang, Ping Jian, Chengzhi Li, Xinyue Chen, Zhen Yang, Ertai E

Main category: cs.CV

TL;DR: TAMMs is a unified MLLM-diffusion framework that jointly performs Temporal Change Description and Future Satellite Image Forecasting by enhancing long-range temporal understanding through Temporal Adaptation Modules and Semantic-Fused Control Injection.

Details

Motivation: Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) are critical but disjointed tasks in Satellite Image Time Series analysis, both fundamentally limited by modeling long-range temporal dynamics. The authors aim to improve performance on both tasks simultaneously by enhancing long-range temporal understanding capabilities.

Method: Introduces TAMMs, a unified framework with two key innovations: 1) Temporal Adaptation Modules (TAM) that enhance frozen MLLM’s ability to comprehend long-range dynamics, and 2) Semantic-Fused Control Injection (SFCI) mechanism that translates change understanding into fine-grained generative control. This creates a synergistic design where understanding from TCD directly informs and improves FSIF consistency.

Result: Extensive experiments demonstrate TAMMs significantly outperforms state-of-the-art specialist baselines on both Temporal Change Description and Future Satellite Image Forecasting tasks.

Conclusion: TAMMs successfully unifies TCD and FSIF tasks within a single MLLM-diffusion architecture, showing that enhancing long-range temporal understanding capabilities can simultaneously improve performance on both historically disjointed tasks in satellite image analysis.

Abstract: Temporal Change Description (TCD) and Future Satellite Image Forecasting (FSIF) are critical, yet historically disjointed tasks in Satellite Image Time Series (SITS) analysis. Both are fundamentally limited by the common challenge of modeling long-range temporal dynamics. To explore how to improve the performance of methods on both tasks simultaneously by enhancing long-range temporal understanding capabilities, we introduce TAMMs, the first unified framework designed to jointly perform TCD and FSIF within a single MLLM-diffusion architecture. TAMMs introduces two key innovations: Temporal Adaptation Modules (TAM) enhance frozen MLLM’s ability to comprehend long-range dynamics, and Semantic-Fused Control Injection (SFCI) mechanism translates this change understanding into fine-grained generative control. This synergistic design makes the understanding from the TCD task to directly inform and improve the consistency of the FSIF task. Extensive experiments demonstrate TAMMs significantly outperforms state-of-the-art specialist baselines on both tasks. Our dataset can be found at https://huggingface.co/datasets/IceInPot/TAMMs .

[224] CARINOX: Inference-time Scaling with Category-Aware Reward-based Initial Noise Optimization and Exploration

Seyed Amir Kasaei, Ali Aghayari, Arash Marioriyad, Niki Sepasian, Shayan Baghayi Nejad, MohammadAmin Fazli, Mahdieh Soleymani Baghshah, Mohammad Hossein Rohban

Main category: cs.CV

TL;DR: CARINOX is a unified framework combining noise optimization and exploration with principled reward selection to improve compositional alignment in text-to-image diffusion models without fine-tuning.

Details

Motivation: Text-to-image diffusion models like Stable Diffusion often fail at compositional alignment for complex prompts describing object relationships, attributes, or spatial arrangements. Existing inference-time approaches have limitations: optimization can stall due to poor initialization, while exploration requires too many samples. Neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality.

Method: CARINOX combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. It uses category-aware reward-based initial noise optimization and exploration to address different compositional challenges.

Result: CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories while preserving image quality and diversity.

Conclusion: The unified framework effectively addresses compositional alignment challenges in text-to-image diffusion models by combining optimization and exploration with principled reward selection, achieving significant improvements over existing methods.

Abstract: Text-to-image diffusion models, such as Stable Diffusion, can produce high-quality and diverse images but often fail to achieve compositional alignment, particularly when prompts describe complex object relationships, attributes, or spatial arrangements. Recent inference-time approaches address this by optimizing or exploring the initial noise under the guidance of reward functions that score text-image alignment without requiring model fine-tuning. While promising, each strategy has intrinsic limitations when used alone: optimization can stall due to poor initialization or unfavorable search trajectories, whereas exploration may require a prohibitively large number of samples to locate a satisfactory output. Our analysis further shows that neither single reward metrics nor ad-hoc combinations reliably capture all aspects of compositionality, leading to weak or inconsistent guidance. To overcome these challenges, we present Category-Aware Reward-based Initial Noise Optimization and Exploration (CARINOX), a unified framework that combines noise optimization and exploration with a principled reward selection procedure grounded in correlation with human judgments. Evaluations on two complementary benchmarks covering diverse compositional challenges show that CARINOX raises average alignment scores by +16% on T2I-CompBench++ and +11% on the HRS benchmark, consistently outperforming state-of-the-art optimization and exploration-based methods across all major categories, while preserving image quality and diversity. The project page is available at https://amirkasaei.com/carinox/.

[225] Scalable Dynamic Origin-Destination Demand Estimation Enhanced by High-Resolution Satellite Imagery Data

Jiachao Liu, Pablo Guarda, Koichiro Niinuma, Sean Qian

Main category: cs.CV

TL;DR: A framework for dynamic origin-destination demand estimation using satellite imagery and traffic data, with computer vision for vehicle detection and map matching.

Details

Motivation: Traditional traffic data from local sensors is sparse and limited. Satellite imagery provides consistent, city-wide road and traffic information for both parking and moving vehicles, overcoming data availability limitations for better demand estimation.

Method: Design a computer vision pipeline for class-specific vehicle detection and map matching from satellite imagery to generate link-level traffic density observations. Formulate a computational graph-based DODE framework that calibrates dynamic network states by jointly matching observed traffic counts/speeds from local sensors with satellite-derived density measurements.

Result: Supplementing traditional data with satellite-derived density significantly improves estimation performance, especially for links without local sensors. Real-world experiments show the framework’s potential for practical deployment on large-scale networks.

Conclusion: The integrated framework combining satellite imagery with conventional traffic data provides improved dynamic origin-destination demand estimation, particularly benefiting areas without local sensors and enabling large-scale network applications.

Abstract: This study presents a novel integrated framework for dynamic origin-destination demand estimation (DODE) in multi-class mesoscopic network models, incorporating high-resolution satellite imagery together with conventional traffic data from local sensors. Unlike sparse local detectors, satellite imagery offers consistent, city-wide road and traffic information of both parking and moving vehicles, overcoming data availability limitations. To extract information from imagery data, we design a computer vision pipeline for class-specific vehicle detection and map matching, generating link-level traffic density observations by vehicle class. Building upon this information, we formulate a computational graph-based DODE framework that calibrates dynamic network states by jointly matching observed traffic counts/speeds from local sensors with density measurements derived from satellite imagery. To assess the accuracy and robustness of the proposed framework, we conduct a series of numerical experiments using both synthetic and real-world data. The results demonstrate that supplementing traditional data with satellite-derived density significantly improves estimation performance, especially for links without local sensors. Real-world experiments also show the framework’s potential for practical deployment on large-scale networks. Sensitivity analysis further evaluates the impact of data quality related to satellite imagery data.

[226] Multi-Expert Learning Framework with the State Space Model for Optical and SAR Image Registration

Wei Wang, Dou Quan, Ning Huyan, Chonghua Lv, Shuang Wang, Yunan Li, Licheng Jiao

Main category: cs.CV

TL;DR: ME-SSM: A multi-expert learning framework with State Space Model for optical and SAR image registration, addressing cross-modal challenges through dynamic feature fusion and efficient global context capture.

Details

Motivation: Address challenges in optical-SAR image registration: (i) nonlinear radiometric variations between modalities, (ii) limited textures hindering feature extraction, (iii) CNN's limited receptive field vs Transformer's high computational cost.

Method: Multi-expert learning framework with State Space Model (Mamba) using: (1) multi-expert feature extraction from various image transformations with learnable soft router for dynamic fusion, (2) Mamba’s multi-directional cross-scanning for efficient global context capture with linear complexity, (3) multi-level feature aggregation module for multi-scale fusion.

Result: Extensive experiments demonstrate effectiveness and advantages of ME-SSM for optical and SAR image registration, showing improved accuracy while avoiding high computational costs.

Conclusion: ME-SSM successfully addresses cross-modal registration challenges through multi-expert learning and efficient state space modeling, achieving better performance than existing methods.

Abstract: Optical and Synthetic Aperture Radar (SAR) image registration is crucial for multi-modal image fusion and applications. However, several challenges limit the performance of existing deep learning-based methods in cross-modal image registration: (i) significant nonlinear radiometric variations between optical and SAR images affect the shared feature learning and matching; (ii) limited textures in images hinder discriminative feature extraction; (iii) the local receptive field of Convolutional Neural Networks (CNNs) restricts the learning of contextual information, while the Transformer can capture long-range global features but with high computational complexity. To address these issues, this paper proposes a multi-expert learning framework with the State Space Model (ME-SSM) for optical and SAR image registration. Firstly, to improve the registration performance with limited textures, ME-SSM constructs a multi-expert learning framework to capture shared features from multi-modal images. Specifically, it extracts features from various transformations of the input image and employs a learnable soft router to dynamically fuse these features, thereby enriching feature representations and improving registration performance. Secondly, ME-SSM introduces a state space model, Mamba, for feature extraction, which employs a multi-directional cross-scanning strategy to efficiently capture global contextual relationships with linear complexity. ME-SSM can expand the receptive field, enhance image registration accuracy, and avoid incurring high computational costs. Additionally, ME-SSM uses a multi-level feature aggregation (MFA) module to enhance the multi-scale feature fusion and interaction. Extensive experiments have demonstrated the effectiveness and advantages of our proposed ME-SSM on optical and SAR image registration.

[227] Driving as a Diagnostic Tool: Scenario-based Cognitive Assessment in Older Drivers from Driving Video

Md Zahid Hasan, Guillermo Basulto-Elias, Jun Ha Chang, Shauna Hallmark, Matthew Rizzo, Anuj Sharma, Soumik Sarkar

Main category: cs.CV

TL;DR: Using large vision models to analyze naturalistic driving videos for early detection of cognitive decline in older drivers by identifying behavioral patterns that correlate with dementia and MCI.

Details

Motivation: Current diagnostic methods for cognitive decline (Dementia, MCI) are time-consuming and costly, leading to underdiagnosis. Real-world driving behavior captured through in-vehicle sensors can provide "digital fingerprints" that correlate with functional decline, enabling early detection through non-invasive monitoring.

Method: Proposes a framework using large vision models to analyze naturalistic driving videos across different roadway scenarios. The method extracts behavioral patterns from driving videos that serve as observations of current cognitive status, treating the vehicle as a “diagnostic tool” to identify early warning signs of functional impairment.

Result: The approach enables identification of cognitive status and prediction of disease progression from driving behavior patterns. It contributes to proactive intervention strategies by detecting early warning signs of functional impairment through scalable, non-invasive monitoring systems.

Conclusion: This work enhances early detection of cognitive decline in aging populations and supports the development of scalable, non-invasive monitoring systems to mitigate the growing societal and economic burden of cognitive disorders through analysis of real-world driving behavior using large vision models.

Abstract: We introduce scenario-based cognitive status identification in older drivers from naturalistic driving videos, leveraging large vision models. In recent times, cognitive decline including Dementia and Mild Cognitive Impairment (MCI), is often underdiagnosed due to the time-consuming and costly nature of current diagnostic methods. By analyzing real-world driving behavior captured through in-vehicle sensors, this study aims to extract “digital fingerprints” that correlate with functional decline and clinical features of dementia. Moreover, modern large vision models can draw meaningful insights from everyday driving patterns across different roadway scenarios to early detect cognitive decline. We propose a framework that uses large vision models and naturalistic driving videos to analyze driver behavior, identify cognitive status and predict disease progression. We leverage the strong relationship between real-world driving behavior as an observation of the current cognitive status of the drivers where the vehicle can be utilized as a “diagnostic tool”. Our method identifies early warning signs of functional impairment, contributing to proactive intervention strategies. This work enhances early detection and supports the development of scalable, non-invasive monitoring systems to mitigate the growing societal and economic burden of cognitive decline in the aging population.

[228] Local Dense Logit Relations for Enhanced Knowledge Distillation

Liuchi Xu, Kang Liu, Jinshuai Liu, Lu Wang, Lisheng Xu, Jun Cheng

Main category: cs.CV

TL;DR: LDRLD is a novel logit distillation method that captures fine-grained inter-class relationships through recursive decoupling and recombining of logit information, enhanced by adaptive weighting strategies.

Details

Motivation: Existing logit distillation methods lack thorough exploration of fine-grained relationships within logit knowledge, limiting their ability to provide detailed insights for student model learning.

Method: Proposes Local Dense Relational Logit Distillation (LDRLD) that recursively decouples and recombines logit information to capture inter-class relationships. Introduces Adaptive Decay Weight (ADW) strategy using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD) to dynamically adjust weights for critical category pairs. Also distills remaining non-target knowledge after recursive decoupling.

Result: Extensive experiments on CIFAR-100, ImageNet-1K, and Tiny-ImageNet datasets demonstrate favorable comparison with state-of-the-art logit-based distillation approaches.

Conclusion: The method improves student performance by transferring fine-grained knowledge and emphasizing critical relationships, offering a more detailed and clearer approach to logit distillation.

Abstract: State-of-the-art logit distillation methods exhibit versatility, simplicity, and efficiency. Despite the advances, existing studies have yet to delve thoroughly into fine-grained relationships within logit knowledge. In this paper, we propose Local Dense Relational Logit Distillation (LDRLD), a novel method that captures inter-class relationships through recursively decoupling and recombining logit information, thereby providing more detailed and clearer insights for student learning. To further optimize the performance, we introduce an Adaptive Decay Weight (ADW) strategy, which can dynamically adjust the weights for critical category pairs using Inverse Rank Weighting (IRW) and Exponential Rank Decay (ERD). Specifically, IRW assigns weights inversely proportional to the rank differences between pairs, while ERD adaptively controls weight decay based on total ranking scores of category pairs. Furthermore, after the recursive decoupling, we distill the remaining non-target knowledge to ensure knowledge completeness and enhance performance. Ultimately, our method improves the student’s performance by transferring fine-grained knowledge and emphasizing the most critical relationships. Extensive experiments on datasets such as CIFAR-100, ImageNet-1K, and Tiny-ImageNet demonstrate that our method compares favorably with state-of-the-art logit-based distillation approaches. The code will be made publicly available.

[229] SPARC: Separating Perception And Reasoning Circuits for Test-time Scaling of VLMs

Niccolo Avogaro, Nayanika Debnath, Li Mi, Thomas Frick, Junling Wang, Zexue He, Hang Hua, Konrad Schindler, Mattia Rigotti

Main category: cs.CV

TL;DR: SPARC is a modular VLM framework that decouples visual perception from reasoning through explicit visual search followed by region-conditioned reasoning, enabling test-time scaling and compute efficiency.

Details

Motivation: Current VLMs have brittle test-time scaling where perception and reasoning are entangled in unstructured chains-of-thought, leading to error cascades and requiring expensive RL training. There's a need for modular architectures that separate these functions like brain processing.

Method: Two-stage pipeline: 1) Visual search stage localizes question-relevant regions, 2) Reasoning stage conditions on those regions for final answer. Enables asymmetric compute allocation, selective optimization, and compressed contexts via multi-resolution processing.

Result: Outperforms monolithic baselines and visual-grounding approaches on challenging visual reasoning benchmarks. Improves Qwen3VL-4B accuracy by 6.7 points on V* VQA, surpasses “thinking with images” by 4.6 points on OOD task with 200× lower token budget.

Conclusion: Explicit separation of perception and reasoning circuits enables more robust test-time scaling, computational efficiency, and better performance in VLMs through modular design principles.

Abstract: Despite recent successes, test-time scaling - i.e., dynamically expanding the token budget during inference as needed - remains brittle for vision-language models (VLMs): unstructured chains-of-thought about images entangle perception and reasoning, leading to long, disorganized contexts where small perceptual mistakes may cascade into completely wrong answers. Moreover, expensive reinforcement learning with hand-crafted rewards is required to achieve good performance. Here, we introduce SPARC (Separating Perception And Reasoning Circuits), a modular framework that explicitly decouples visual perception from reasoning. Inspired by sequential sensory-to-cognitive processing in the brain, SPARC implements a two-stage pipeline where the model first performs explicit visual search to localize question-relevant regions, then conditions its reasoning on those regions to produce the final answer. This separation enables independent test-time scaling with asymmetric compute allocation (e.g., prioritizing perceptual processing under distribution shift), supports selective optimization (e.g., improving the perceptual stage alone when it is the bottleneck for end-to-end performance), and accommodates compressed contexts by running global search at lower image resolutions and allocating high-resolution processing only to selected regions, thereby reducing total visual tokens count and compute. Across challenging visual reasoning benchmarks, SPARC outperforms monolithic baselines and strong visual-grounding approaches. For instance, SPARC improves the accuracy of Qwen3VL-4B on the $V^*$ VQA benchmark by 6.7 percentage points, and it surpasses “thinking with images” by 4.6 points on a challenging OOD task despite requiring a 200$\times$ lower token budget.

[230] ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion

Xurui Peng, Chenqian Yan, Hong Liu, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, Mingbao Lin

Main category: cs.CV

TL;DR: ERTACache is a principled caching framework for diffusion models that accelerates inference by reusing intermediate features while minimizing quality degradation through error analysis and correction.

Details

Motivation: Diffusion models have high computational overhead due to iterative inference. While feature caching offers acceleration potential, naive reuse causes quality degradation. The paper aims to develop a principled caching approach that maintains quality while achieving significant speedup.

Method: Proposes ERTACache framework that: 1) analyzes cumulative caching error into feature shift and step amplification components, 2) uses offline residual profiling to identify reusable steps, 3) dynamically adjusts integration intervals with trajectory-aware correction, and 4) approximates cache-induced errors via closed-form residual linearization model.

Result: Achieves up to 2x inference speedup on image and video generation benchmarks while preserving or improving visual quality. On Wan2.1 video diffusion model, achieves 2x acceleration with minimal VBench degradation, maintaining baseline fidelity while improving efficiency.

Conclusion: ERTACache provides an effective framework for accelerating diffusion model inference through principled caching, balancing speed and quality trade-offs. The method enables aggressive cache reuse while maintaining generation quality.

Abstract: Diffusion models suffer from substantial computational overhead due to their inherently iterative inference process. While feature caching offers a promising acceleration strategy by reusing intermediate outputs across timesteps, naive reuse often incurs noticeable quality degradation. In this work, we formally analyze the cumulative error introduced by caching and decompose it into two principal components: feature shift error, caused by inaccuracies in cached outputs, and step amplification error, which arises from error propagation under fixed timestep schedules. To address these issues, we propose ERTACache, a principled caching framework that jointly rectifies both error types. Our method employs an offline residual profiling stage to identify reusable steps, dynamically adjusts integration intervals via a trajectory-aware correction coefficient, and analytically approximates cache-induced errors through a closed-form residual linearization model. Together, these components enable accurate and efficient sampling under aggressive cache reuse. Extensive experiments across standard image and video generation benchmarks show that ERTACache achieves up to 2x inference speedup while consistently preserving or even improving visual quality. Notably, on the state-of-the-art Wan2.1 video diffusion model, ERTACache delivers 2x acceleration with minimal VBench degradation, effectively maintaining baseline fidelity while significantly improving efficiency. The code is available at https://github.com/bytedance/ERTACache.

[231] LD-ViCE: Latent Diffusion Model for Video Counterfactual Explanations

Payal Varshney, Adriano Lucieri, Christoph Balada, Sheraz Ahmed, Andreas Dengel

Main category: cs.CV

TL;DR: LD-ViCE is a novel framework for generating counterfactual explanations for video-based AI models using latent diffusion models, improving temporal coherence and semantic fidelity while reducing computational costs.

Details

Motivation: Video-based AI systems in safety-critical domains like autonomous driving and healthcare need interpretable decisions, but existing explanation techniques lack temporal coherence and actionable causal insights. Current counterfactual methods don't incorporate target model guidance, reducing semantic fidelity and practical utility.

Method: LD-ViCE operates in latent space using a state-of-the-art diffusion model to reduce computational costs, with an additional refinement step to produce realistic and interpretable counterfactuals. It generates explanations by modifying video content in semantically meaningful ways while maintaining temporal consistency.

Result: Experiments on three diverse video datasets (EchoNet-Dynamic, FERV39k, Something-Something V2) with multiple target models show LD-ViCE generalizes well and achieves state-of-the-art performance. On EchoNet-Dynamic, it achieves significantly higher regression accuracy than prior methods and exhibits high temporal consistency, with refinement further improving perceptual quality.

Conclusion: LD-ViCE advances the trustworthiness and interpretability of video-based AI systems through visually coherent counterfactual explanations, providing actionable insights into model behavior across various video understanding tasks.

Abstract: Video-based AI systems are increasingly adopted in safety-critical domains such as autonomous driving and healthcare. However, interpreting their decisions remains challenging due to the inherent spatiotemporal complexity of video data and the opacity of deep learning models. Existing explanation techniques often suffer from limited temporal coherence and a lack of actionable causal insights. Current counterfactual explanation methods typically do not incorporate guidance from the target model, reducing semantic fidelity and practical utility. We introduce Latent Diffusion for Video Counterfactual Explanations (LD-ViCE), a novel framework designed to explain the behavior of video-based AI models. Compared to previous approaches, LD-ViCE reduces the computational costs of generating explanations by operating in latent space using a state-of-the-art diffusion model, while producing realistic and interpretable counterfactuals through an additional refinement step. Experiments on three diverse video datasets - EchoNet-Dynamic (cardiac ultrasound), FERV39k (facial expression), and Something-Something V2 (action recognition) with multiple target models covering both classification and regression tasks, demonstrate that LD-ViCE generalizes well and achieves state-of-the-art performance. On the EchoNet-Dynamic dataset, LD-ViCE achieves significantly higher regression accuracy than prior methods and exhibits high temporal consistency, while the refinement stage further improves perceptual quality. Qualitative analyses confirm that LD-ViCE produces semantically meaningful and temporally coherent explanations, providing actionable insights into model behavior. LD-ViCE advances the trustworthiness and interpretability of video-based AI systems through visually coherent counterfactual explanations.

[232] VIMD: Monocular Visual-Inertial Motion and Depth Estimation

Saimouli Katragadda, Guoquan Huang

Main category: cs.CV

TL;DR: VIMD: Monocular visual-inertial motion and depth learning framework that estimates dense metric depth by leveraging MSCKF-based motion tracking and iteratively refining per-pixel scale using multi-view information.

Details

Motivation: Need for accurate and efficient dense metric depth estimation for 3D visual perception in robotics and XR applications, with practical deployment in resource-constrained settings.

Method: Uses MSCKF-based monocular visual-inertial motion tracking to exploit multi-view information for iterative per-pixel scale refinement (instead of global affine fitting), making it modular and compatible with various depth estimation backbones.

Result: Achieves exceptional accuracy and robustness on TartanAir and VOID datasets, with strong zero-shot generalization on AR Table dataset, even with extremely sparse points (10-20 metric depth points per image).

Conclusion: VIMD provides a practical solution for resource-constrained deployment with robust performance and strong generalization capabilities across various scenarios.

Abstract: Accurate and efficient dense metric depth estimation is crucial for 3D visual perception in robotics and XR. In this paper, we develop a monocular visual-inertial motion and depth (VIMD) learning framework to estimate dense metric depth by leveraging accurate and efficient MSCKF-based monocular visual-inertial motion tracking. At the core the proposed VIMD is to exploit multi-view information to iteratively refine per-pixel scale, instead of globally fitting an invariant affine model as in the prior work. The VIMD framework is highly modular, making it compatible with a variety of existing depth estimation backbones. We conduct extensive evaluations on the TartanAir and VOID datasets and demonstrate its zero-shot generalization capabilities on the AR Table dataset. Our results show that VIMD achieves exceptional accuracy and robustness, even with extremely sparse points as few as 10-20 metric depth points per image. This makes the proposed VIMD a practical solution for deployment in resource constrained settings, while its robust performance and strong generalization capabilities offer significant potential across a wide range of scenarios.

[233] MoWM: Mixture-of-World-Models for Embodied Planning via Latent-to-Pixel Feature Modulation

Yangcheng Yu, Xin Jin, Yu Shang, Xin Zhang, Haisheng Su, Wei Wu, Yong Li

Main category: cs.CV

TL;DR: MoWM: A mixture-of-world-model framework that fuses motion-aware latent features with pixel-space features for embodied action planning in robotics, achieving state-of-the-art performance on manipulation tasks.

Details

Motivation: Current video generation world models rely on pixel-level reconstruction which introduces visual redundancies that hinder action decoding and generalization, while latent world models overlook fine-grained details critical for precise manipulation.

Method: Proposes MoWM framework that combines motion-aware latent world model features with pixel-space features to emphasize action-relevant visual details for action decoding in embodied planning tasks.

Result: Achieves state-of-the-art task success rates and superior generalization on CALVIN and real-world manipulation tasks, with comprehensive analysis of feature space strengths.

Conclusion: The mixture-of-world-model approach effectively balances motion awareness and fine-grained visual details for improved embodied action planning, offering valuable insights for future research.

Abstract: Embodied action planning is a core challenge in robotics, requiring models to generate precise actions from visual observations and language instructions. While video generation world models are promising, their reliance on pixel-level reconstruction often introduces visual redundancies that hinder action decoding and generalization. Latent world models offer a compact, motion-aware representation, but overlook the fine-grained details critical for precise manipulation. To overcome these limitations, we propose MoWM, a mixture-of-world-model framework that fuses representations from hybrid world models for embodied action planning. Our approach combines motion-aware latent world model features with pixel-space features, enabling MoWM to emphasize action-relevant visual details for action decoding. Extensive evaluations on the CALVIN and real-world manipulation tasks demonstrate that our method achieves state-of-the-art task success rates and superior generalization. We also provide a comprehensive analysis of the strengths of each feature space, offering valuable insights for future research in embodied planning. The code is available at: https://github.com/tsinghua-fib-lab/MoWM.

[234] MultiMat: Multimodal Program Synthesis for Procedural Materials using Large Multimodal Models

Jonas Belouadi, Tamy Boubekeur, Adrien Kaiser

Main category: cs.CV

TL;DR: MultiMat: A multimodal program synthesis framework using large multimodal models to generate procedural material graphs from both visual and textual representations, outperforming text-only approaches.

Details

Motivation: Creating procedural material node graphs is challenging and requires professional training. While neural program synthesis exists, current approaches only use textual representations, failing to capture the visual-spatial nature that makes node graphs accessible to humans.

Method: MultiMat leverages large multimodal models to process both visual and textual graph representations. The framework is trained on a dataset of production-quality procedural materials and uses a constrained tree search inference algorithm to ensure static correctness while efficiently navigating the program space.

Result: The multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

Conclusion: Multimodal approaches that combine visual and textual representations are superior for procedural material graph synthesis compared to text-only methods, better capturing the visual-spatial nature of node graphs.

Abstract: Material node graphs are programs that generate the 2D channels of procedural materials, including geometry such as roughness and displacement maps, and reflectance such as albedo and conductivity maps. They are essential in computer graphics for representing the appearance of virtual 3D objects parametrically and at arbitrary resolution. In particular, their directed acyclic graph structure and intermediate states enable a modular, interpretable workflow for interactive appearance modeling. However, creating such graphs remains challenging and typically requires professional training. While recent neural program synthesis approaches attempt to simplify this process, they solely represent graphs as textual programs, failing to capture the inherently visual-spatial nature of node graphs that makes them accessible to humans. To address this gap, we present MultiMat, a multimodal program synthesis framework that leverages large multimodal models to process both visual and textual graph representations for improved generation of procedural material graphs. We train our models on a new dataset of production-quality procedural materials and combine them with a constrained tree search inference algorithm that ensures static correctness while efficiently navigating the program space. Our experimental results show that our multimodal program synthesis method is more efficient in both unconditional and conditional graph synthesis with higher visual quality and fidelity than text-only baselines, establishing new state-of-the-art performance.

[235] MILR: Improving Multimodal Image Generation via Test-Time Latent Reasoning

Yapeng Mi, Yanpeng Zhao, Hengli Li, Chenxi Li, Huimin Wu, Xiaojian Ma, Song-Chun Zhu, Ying Nian Wu, Qing Li

Main category: cs.CV

TL;DR: MILR is a test-time method for multimodal image generation that performs joint reasoning over image and text in a unified latent space using policy gradient optimization guided by an image quality critic.

Details

Motivation: Existing reasoning-based image generation methods are limited to single modalities (image or text only) or require high-quality reasoning data for fine-tuning. There's a need for methods that can perform joint cross-modal reasoning without extensive training data.

Method: MILR operates at test time by searching through vector representations of discrete image and text tokens in a unified latent space. It uses policy gradient optimization guided by an image quality critic, implemented within the MUG framework which natively supports language reasoning before image synthesis.

Result: Achieves state-of-the-art results on GenEval, T2I-CompBench, and WISE benchmarks. On knowledge-intensive WISE, attains 0.63 overall score, improving over baseline by 80%. Shows strong performance in temporal and cultural reasoning tasks.

Conclusion: Joint reasoning in unified latent space is key to MILR’s strong performance. The method demonstrates effective cross-modal reasoning capabilities without requiring fine-tuning on reasoning data, operating entirely at test time.

Abstract: Reasoning-augmented machine learning systems have shown improved performance in various domains, including image generation. However, existing reasoning-based methods for image generation either restrict reasoning to a single modality (image or text) or rely on high-quality reasoning data for fine-tuning. To tackle these limitations, we propose MILR, a test-time method that jointly reasons over image and text in a unified latent vector space. Reasoning in MILR is performed by searching through vector representations of discrete image and text tokens. Practically, this is implemented via the policy gradient method, guided by an image quality critic. We instantiate MILR within the unified multimodal understanding and generation (MUG) framework that natively supports language reasoning before image synthesis and thus facilitates cross-modal reasoning. The intermediate model outputs, which are to be optimized, serve as the unified latent space, enabling MILR to operate entirely at test time. We evaluate MILR on GenEval, T2I-CompBench, and WISE, achieving state-of-the-art results on all benchmarks. Notably, on knowledge-intensive WISE, MILR attains an overall score of 0.63, improving over the baseline by 80%. Our further analysis indicates that joint reasoning in the unified latent space is the key to its strong performance. Moreover, our qualitative studies reveal MILR’s non-trivial ability in temporal and cultural reasoning, highlighting the efficacy of our reasoning method.

[236] UGround: Towards Unified Visual Grounding with Unrolled Transformers

Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, Dejing Dou

Main category: cs.CV

TL;DR: UGround introduces a unified visual grounding paradigm that dynamically selects intermediate transformer layers as “mask as prompt” instead of using fixed last hidden layers, addressing error propagation and spatial cue limitations in current methods.

Details

Motivation: Current visual grounding methods rely on fixed last hidden layers which amplify cumulative errors through layer-by-layer propagation without correction, and use tokens as prompts that implicitly project textual embeddings into visual space without explicit spatial cues like coordinates.

Method: UGround uses Policy-Prompted Masking with two components: 1) Stochastic Skip Connection (SSC) - a reinforcement learning policy that allows tokens to dynamically slide across transformer layers for skip connections to vision models, and 2) Mask as Prompt (MasP) - uses similarity maps from tokens and image tokens as soft logit masks to prompt SAM for mask generation with explicit spatial cues.

Result: UGround unifies visual grounding within a single framework from an attribute perspective, spanning traditional refer expression segmentation to reasoning segmentation, single-target to multi-target, and positive query to false premise scenarios.

Conclusion: UGround provides a more flexible and effective approach to visual grounding by dynamically selecting intermediate layers and using explicit spatial cues, addressing limitations of current fixed-layer approaches.

Abstract: We present UGround, a \textbf{U}nified visual \textbf{Ground}ing paradigm that dynamically selects intermediate layers across \textbf{U}nrolled transformers as mask as prompt'', diverging from the prevailing pipeline that leverages the fixed last hidden layer as \texttt{} as prompt’’. UGround addresses two primary challenges posed by the prevailing paradigm: (1) its reliance on the fixed last hidden layer, which sequentially amplifies cumulative errors arising from layer-by-layer propagation without intermediate correction, and (2) its use of \texttt{} as a prompt, which implicitly projects textual embeddings into visual space without explicit spatial cues (\eg, coordinates). Central to UGround is Policy-Prompted Masking, which comprises two key components: Stochastic Skip Connection (SSC) and Mask as Prompt (MasP). SSC is a reinforcement learning policy that, via stochastic sampling, allows each \texttt{} token to slide across unrolled transformer layers, enabling dynamic layer selection at which it connects to the vision model (\eg, SAM) in a skip-connection fashion. Given the selected hidden layer, MasP uses the similarity map derived from the \texttt{} token and image tokens as a soft logit mask to prompt SAM for mask generation, offering explicit spatial cues through its activation regions. To validate the effectiveness of UGround, we, for the first time, have unified visual grounding within a single framework from an attribute perspective, spanning from traditional refer expression segmentation to newly proposed reasoning segmentation, single-target to multi-target, positive query to false premise (empty target). All codes and models are publicly available at \href{https://github.com/rui-qian/UGround}{https://github.com/rui-qian/UGround}.

[237] Machine Learning Detection of Road Surface Conditions: A Generalizable Model using Traffic Cameras and Weather Data

Carly Sutter, Kara J. Sulia, Nick P. Bassill, Christopher D. Wirz, Christopher D. Thorncroft, Jay C. Rothenberger, Vanessa Przybylo, Mariana G. Cains, Jacob Radford, David Aaron Evans

Main category: cs.CV

TL;DR: Machine learning models (CNNs and random forests) trained on NYSDOT camera images and weather data to automatically classify road surface conditions for winter weather operations.

Details

Motivation: To support transportation agencies in making critical operational decisions during hazardous weather events by automatically classifying road conditions across the state, improving spatial and temporal awareness for better resource allocation and traveler safety.

Method: Developed convolutional neural networks and random forests trained on ~22,000 hand-labeled roadside camera images from NYSDOT, combined with weather data, to predict six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed.

Result: The weather-related road surface condition model achieves 81.5% accuracy on completely unseen cameras, demonstrating good generalizability for operational deployment.

Conclusion: The model has potential to improve decision-making for operations, roadway maintenance, and traveler safety during winter weather events through automated road condition classification.

Abstract: Transportation agencies make critical operational decisions during hazardous weather events, including assessment of road conditions and resource allocation. In this study, machine learning models are developed to provide additional support for the New York State Department of Transportation (NYSDOT) by automatically classifying current road conditions across the state. Convolutional neural networks and random forests are trained on NYSDOT roadside camera images and weather data to predict road surface conditions. This task draws critically on a robust hand-labeled dataset of ~22,000 camera images containing six road surface conditions: severe snow, snow, wet, dry, poor visibility, or obstructed. Model generalizability is prioritized to meet the operational needs of the NYSDOT decision makers, including integration of operational datasets and use of representative and realistic images. The weather-related road surface condition model in this study achieves an accuracy of 81.5% on completely unseen cameras. With operational deployment, this model has the potential to improve spatial and temporal awareness of road surface conditions, which can strengthen decision-making for operations, roadway maintenance, and traveler safety, particularly during winter weather events.

[238] SNAP: Towards Segmenting Anything in Any Point Cloud

Aniket Gupta, Hanhui Wang, Charles Saunders, Aruni RoyChowdhury, Hanumant Singh, Huaizu Jiang

Main category: cs.CV

TL;DR: SNAP is a unified 3D point cloud segmentation model supporting both point-based and text-based prompts across diverse domains (indoor, outdoor, aerial), achieving state-of-the-art zero-shot performance through domain-adaptive normalization and CLIP-based text matching.

Details

Motivation: Current interactive 3D segmentation approaches are limited to single domains and single interaction types, with training on multiple datasets causing negative transfer. There's a need for a unified model that works across domains with multiple prompt types.

Method: Trains on 7 datasets spanning indoor, outdoor, and aerial environments using domain-adaptive normalization to prevent negative transfer. For text prompts, automatically generates mask proposals and matches them against CLIP embeddings of textual queries for panoptic and open-vocabulary segmentation.

Result: Achieves state-of-the-art on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and competitive results on all 5 text-prompted benchmarks, demonstrating a unified model can match or exceed specialized domain-specific approaches.

Conclusion: SNAP provides a practical tool for scalable 3D annotation through a unified model that supports diverse interaction types across multiple domains while preventing negative transfer.

Abstract: Interactive 3D point cloud segmentation enables efficient annotation of complex 3D scenes through user-guided prompts. However, current approaches are typically restricted in scope to a single domain (indoor or outdoor), and to a single form of user interaction (either spatial clicks or textual prompts). Moreover, training on multiple datasets often leads to negative transfer, resulting in domain-specific tools that lack generalizability. To address these limitations, we present SNAP (Segment aNything in Any Point cloud), a unified model for interactive 3D segmentation that supports both point-based and text-based prompts across diverse domains. Our approach achieves cross-domain generalizability by training on 7 datasets spanning indoor, outdoor, and aerial environments, while employing domain-adaptive normalization to prevent negative transfer. For text-prompted segmentation, we automatically generate mask proposals without human intervention and match them against CLIP embeddings of textual queries, enabling both panoptic and open-vocabulary segmentation. Extensive experiments demonstrate that SNAP consistently delivers high-quality segmentation results. We achieve state-of-the-art performance on 8 out of 9 zero-shot benchmarks for spatial-prompted segmentation and demonstrate competitive results on all 5 text-prompted benchmarks. These results show that a unified model can match or exceed specialized domain-specific approaches, providing a practical tool for scalable 3D annotation. Project page is at, https://neu-vi.github.io/SNAP/

[239] On the Provable Importance of Gradients for Language-Assisted Image Clustering

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

Main category: cs.CV

TL;DR: GradNorm: A gradient-based framework for Language-assisted Image Clustering that uses gradient magnitudes to filter positive nouns from unlabeled text data, with theoretical guarantees and state-of-the-art performance.

Details

Motivation: Existing methods for Language-assisted Image Clustering rely on CLIP feature spaces without theoretical foundation. The core challenge is filtering positive nouns (semantically close to target images) from unlabeled corpus data.

Method: Proposes GradNorm framework that measures noun positiveness based on gradient magnitudes back-propagated from cross-entropy between predicted target distribution and softmax output. Provides theoretical error bounds and proves existing strategies are special cases.

Result: Extensive experiments show GradNorm achieves state-of-the-art clustering performance on various benchmarks, demonstrating strong empirical performance.

Conclusion: GradNorm provides a theoretically grounded, gradient-based solution for filtering positive nouns in Language-assisted Image Clustering, outperforming existing methods and offering theoretical guarantees.

Abstract: This paper investigates the recently emerged problem of Language-assisted Image Clustering (LaIC), where textual semantics are leveraged to improve the discriminability of visual representations to facilitate image clustering. Due to the unavailability of true class names, one of core challenges of LaIC lies in how to filter positive nouns, i.e., those semantically close to the images of interest, from unlabeled wild corpus data. Existing filtering strategies are predominantly based on the off-the-shelf feature space learned by CLIP; however, despite being intuitive, these strategies lack a rigorous theoretical foundation. To fill this gap, we propose a novel gradient-based framework, termed as GradNorm, which is theoretically guaranteed and shows strong empirical performance. In particular, we measure the positiveness of each noun based on the magnitude of gradients back-propagated from the cross-entropy between the predicted target distribution and the softmax output. Theoretically, we provide a rigorous error bound to quantify the separability of positive nouns by GradNorm and prove that GradNorm naturally subsumes existing filtering strategies as extremely special cases of itself. Empirically, extensive experiments show that GradNorm achieves the state-of-the-art clustering performance on various benchmarks.

[240] SHIELD: Suppressing Hallucinations In LVLM Encoders via Bias and Vulnerability Defense

Yiyang Huang, Liang Shi, Yitian Zhang, Yi Xu, Yun Fu

Main category: cs.CV

TL;DR: SHIELD is a training-free framework that addresses object hallucination in Large Vision-Language Models by targeting visual encoder issues through re-weighting, noise-derived tokens, and adversarial attacks with contrastive decoding.

Details

Motivation: Object hallucination in LVLMs (producing plausible but inaccurate object descriptions) remains a major challenge. Previous work focused on LLM components, but this paper identifies that visual encoders are the primary source of hallucinations due to statistical bias, inherent bias, and vulnerability issues.

Method: SHIELD uses three training-free strategies: 1) Re-weighting visual tokens to reduce statistical bias, 2) Introducing noise-derived tokens to counter inherent bias in visual encoders, and 3) Applying adversarial attacks with contrastive decoding to address model vulnerability.

Result: SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. It also achieves strong performance on general LVLM benchmarks, demonstrating broad applicability beyond just hallucination reduction.

Conclusion: Visual encoders are a key source of hallucinations in LVLMs, and SHIELD provides an effective training-free solution that addresses this issue while maintaining general performance.

Abstract: Large Vision-Language Models (LVLMs) excel in diverse cross-modal tasks. However, object hallucination, where models produce plausible but inaccurate object descriptions, remains a significant challenge. In contrast to previous work focusing on LLM components, this paper is the first to trace LVLM hallucinations to visual encoders and identifies three key issues: statistical bias, inherent bias, and vulnerability. To address these challenges, we propose SHIELD, a training-free framework that mitigates hallucinations through three strategies: re-weighting visual tokens to reduce statistical bias, introducing noise-derived tokens to counter inherent bias, and applying adversarial attacks with contrastive decoding to address vulnerability. Experiments demonstrate that SHIELD effectively mitigates object hallucinations across diverse benchmarks and LVLM families. Moreover, SHIELD achieves strong performance on the general LVLM benchmark, highlighting its broad applicability. Code is available at https://github.com/hukcc/SHIELD.

[241] GenTrack2: An Improved Hybrid Approach for Multi-Object Tracking

Toan Van Nguyen, Rasmus G. K. Christiansen, Dirk Kraft, Leon Bodenhagen

Main category: cs.CV

TL;DR: A visual multi-object tracking method combining stochastic particle filters with deterministic association for consistent identity tracking under nonlinear dynamics and varying target counts.

Details

Motivation: To address challenges in multi-object tracking including nonlinear dynamics, non-Gaussian noise, time-varying target numbers, and identity preservation during occlusions and interactions.

Method: Uses stochastic particle filter with PSO optimization for nonlinear dynamics, deterministic association with cost matrix for identity consistency, velocity regression for trend prediction, and smooth state updating for weak tracks.

Result: Experimental results show superior performance compared to state-of-the-art trackers, with code available on GitHub.

Conclusion: The proposed hybrid stochastic-deterministic approach effectively handles complex tracking scenarios with identity preservation and works for both pre-recorded and live video streams.

Abstract: This paper proposes a visual multi-object tracking method that jointly employs stochastic and deterministic mechanisms to ensure identifier consistency for unknown and time-varying target numbers under nonlinear dynamics. A stochastic particle filter addresses nonlinear dynamics and non-Gaussian noise, with support from particle swarm optimization (PSO) to guide particles toward state distribution modes and mitigate divergence through proposed fitness measures incorporating motion consistency, appearance similarity, and social-interaction cues with neighboring targets. Deterministic association further enforces identifier consistency via a proposed cost matrix incorporating spatial consistency between particles and current detections, detection confidences, and track penalties. Subsequently, a novel scheme is proposed for the smooth updating of target states while preserving their identities, particularly for weak tracks during interactions with other targets and prolonged occlusions. Moreover, velocity regression over past states provides trend-seed velocities, enhancing particle sampling and state updates. The proposed tracker is designed to operate flexibly for both pre-recorded videos and camera live streams, where future frames are unavailable. Experimental results confirm superior performance compared to state-of-the-art trackers. The source-code reference implementations of both the proposed method and compared-trackers are provided on GitHub: https://github.com/SDU-VelKoTek/GenTrack2

[242] MediRound: Multi-Round Entity-Level Reasoning Segmentation in Medical Images

Qinyue Tong, Ziqian Lu, Jun Liu, Rui Zuo, Zheming Lu

Main category: cs.CV

TL;DR: MEMR-Seg introduces multi-round entity-level reasoning for medical image segmentation, addressing limitations of single-round dialogue approaches with a new dataset and model featuring error correction mechanisms.

Details

Motivation: Existing medical image segmentation methods lack interactivity and multi-round reasoning capabilities. While text-prompt approaches enable user-driven segmentation, they're limited to single-round dialogues and cannot perform complex multi-round reasoning about medical entities.

Method: Proposes MEMR-Seg task for multi-round entity-level medical reasoning segmentation, constructs MR-MedSeg dataset (177K multi-round medical segmentation dialogues), develops MediRound baseline model with lightweight Judgment & Correction Mechanism to mitigate error propagation in chain-like multi-round pipelines.

Result: Experimental results show the method effectively addresses MEMR-Seg task and outperforms conventional medical referring segmentation methods.

Conclusion: The work successfully introduces multi-round reasoning capabilities to medical image segmentation, enabling more interactive and sophisticated entity-level reasoning through dialogue-based approaches.

Abstract: Despite the progress in medical image segmentation, most existing methods remain task-specific and lack interactivity. Although recent text-prompt-based segmentation approaches enhance user-driven and reasoning-based segmentation, they remain confined to single-round dialogues and fail to perform multi-round reasoning. In this work, we introduce Multi-Round Entity-Level Medical Reasoning Segmentation (MEMR-Seg), a new task that requires generating segmentation masks through multi-round queries with entity-level reasoning. To support this task, we construct MR-MedSeg, a large-scale dataset of 177K multi-round medical segmentation dialogues, featuring entity-based reasoning across rounds. Furthermore, we propose MediRound, an effective baseline model designed for multi-round medical reasoning segmentation. To mitigate the inherent error propagation in the chain-like pipeline of multi-round segmentation, we introduce a lightweight yet effective Judgment & Correction Mechanism during model inference. Experimental results demonstrate that our method effectively addresses the MEMR-Seg task and outperforms conventional medical referring segmentation methods.

[243] UniFit: Towards Universal Virtual Try-on with MLLM-Guided Semantic Alignment

Wei Zhang, Yeying Jin, Xin Li, Yan Zhang, Xiaofeng Cong, Cong Wang, Fengcai Qiao, zhichao Lian

Main category: cs.CV

TL;DR: UniFit: A universal virtual try-on framework using Multimodal Large Language Models to bridge semantic gaps between text instructions and reference images, enabling flexible handling of diverse try-on tasks.

Details

Motivation: Current virtual try-on methods struggle with semantic gaps between text instructions and reference images, and face data scarcity in complex scenarios. There's a need for a universal framework that can handle diverse try-on tasks flexibly.

Method: Proposes UniFit with MLLM-Guided Semantic Alignment Module (MGSA) that integrates multimodal inputs using an MLLM and learnable queries with semantic alignment loss. Uses two-stage progressive training with self-synthesis pipeline to learn from limited data.

Result: UniFit supports wide range of VTON tasks including multi-garment and model-to-model try-on, achieving state-of-the-art performance in extensive experiments.

Conclusion: UniFit successfully addresses semantic gap and data scarcity challenges in virtual try-on through MLLM integration and progressive training, creating a universal framework for diverse try-on tasks.

Abstract: Image-based virtual try-on (VTON) aims to synthesize photorealistic images of a person wearing specified garments. Despite significant progress, building a universal VTON framework that can flexibly handle diverse and complex tasks remains a major challenge. Recent methods explore multi-task VTON frameworks guided by textual instructions, yet they still face two key limitations: (1) semantic gap between text instructions and reference images, and (2) data scarcity in complex scenarios. To address these challenges, we propose UniFit, a universal VTON framework driven by a Multimodal Large Language Model (MLLM). Specifically, we introduce an MLLM-Guided Semantic Alignment Module (MGSA), which integrates multimodal inputs using an MLLM and a set of learnable queries. By imposing a semantic alignment loss, MGSA captures cross-modal semantic relationships and provides coherent and explicit semantic guidance for the generative process, thereby reducing the semantic gap. Moreover, by devising a two-stage progressive training strategy with a self-synthesis pipeline, UniFit is able to learn complex tasks from limited data. Extensive experiments show that UniFit not only supports a wide range of VTON tasks, including multi-garment and model-to-model try-on, but also achieves state-of-the-art performance. The source code and pretrained models are available at https://github.com/zwplus/UniFit.

[244] VLA-Pruner: Temporal-Aware Dual-Level Visual Token Pruning for Efficient Vision-Language-Action Inference

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

Main category: cs.CV

TL;DR: VLA-Pruner: A dual-system token pruning method for Vision-Language-Action models that preserves both semantic and action-critical visual information for efficient real-time deployment.

Details

Motivation: Current token pruning methods for Vision-Language Models focus only on semantic salience, overlooking the dual-system nature of VLA models that require both high-level semantic understanding and low-level action execution, leading to degraded performance when critical action information is discarded.

Method: Proposes VLA-Pruner with dual-level importance criterion: vision-language prefill attention for semantic relevance and action decode attention (estimated via temporal smoothing) for action-level importance. Uses adaptive dual-level token selection strategy to preserve compact, informative visual tokens for both semantic understanding and action execution within compute budgets.

Result: Achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks, demonstrating effective acceleration while maintaining both semantic understanding and action generation capabilities.

Conclusion: VLA-Pruner successfully addresses the limitations of VLM-specific token pruning by aligning with VLA’s dual-system nature, enabling efficient real-time deployment of embodied AI systems without sacrificing performance.

Abstract: Vision-Language-Action (VLA) models have shown great promise for embodied AI, yet the heavy computational cost of processing continuous visual streams severely limits their real-time deployment. Token pruning (keeping salient visual tokens and dropping redundant ones) has emerged as an effective approach for accelerating Vision-Language Models (VLMs), offering a solution for efficient VLA. However, these VLM-specific token pruning methods select tokens based solely on semantic salience metrics (e.g., prefill attention), while overlooking the VLA’s intrinsic dual-system nature of high-level semantic understanding and low-level action execution. Consequently, these methods bias token retention toward semantic cues, discard critical information for action generation, and significantly degrade VLA performance. To bridge this gap, we propose VLA-Pruner, a versatile plug-and-play VLA-specific token prune method that aligns with the dual-system nature of VLA models and exploits the temporal continuity in robot manipulation. Specifically, VLA-Pruner adopts a dual-level importance criterion for visual token retention: vision-language prefill attention for semantic-level relevance and action decode attention, estimated via temporal smoothing, for action-level importance. Based on this criterion, VLA-Pruner proposes a novel dual-level token selection strategy that adaptively preserves a compact, informative set of visual tokens for both semantic understanding and action execution under given compute budget. Experiments show that VLA-Pruner achieves state-of-the-art performance across multiple VLA architectures and diverse robotic tasks.

[245] Not All Pixels Are Equal: Pixel-wise Meta-Learning for Medical Segmentation with Noisy Labels

Chenyu Mu, Guihai Chen, Xun Yang, Erkun Yang, Cheng Deng

Main category: cs.CV

TL;DR: MetaDCSeg: A robust medical image segmentation framework that dynamically learns pixel-wise weights to handle noisy annotations and ambiguous boundaries using a Dynamic Center Distance mechanism.

Details

Motivation: Medical image segmentation faces challenges from noisy annotations and ambiguous anatomical boundaries in real-world scenarios. Existing methods fail to address pixel-wise heterogeneity and local variations, particularly at boundaries.

Method: Proposes MetaDCSeg framework with Dynamic Center Distance (DCD) mechanism that learns optimal pixel-wise weights to suppress noisy labels while preserving reliable annotations. Uses weighted feature distances for foreground, background, and boundary centers to focus on hard-to-segment pixels near ambiguous boundaries.

Result: Extensive experiments across four benchmark datasets with varying noise levels show MetaDCSeg outperforms existing state-of-the-art methods.

Conclusion: MetaDCSeg effectively handles noisy labels and boundary ambiguities in medical image segmentation through dynamic pixel-wise weighting and explicit boundary uncertainty modeling.

Abstract: Medical image segmentation is crucial for clinical applications, but it is frequently disrupted by noisy annotations and ambiguous anatomical boundaries, limiting its application in real-world scenarios. Existing methods often directly adapt noisy label learning techniques designed for instance classification, overlooking the pixel-wise heterogeneity in medical segmentation with its spatially and anatomically varying difficulties. Consequently, global assumptions or simple confidence metrics fail to address these local variations, leaving boundary ambiguities unresolved. To address this issue, we propose MetaDCSeg, a robust framework that dynamically learns optimal pixel-wise weights to suppress the influence of noisy labels while preserving reliable annotations. By explicitly modeling boundary uncertainty through a Dynamic Center Distance (DCD) mechanism, our approach utilizes weighted feature distances for foreground, background, and boundary centers, directing the model’s attention toward hard-to-segment pixels near ambiguous boundaries. This strategy enables more precise handling of structural boundaries, which are often overlooked by existing methods, and significantly enhances segmentation performance. Extensive experiments across four benchmark datasets with varying noise levels demonstrate that MetaDCSeg outperforms existing state-of-the-art methods.

[246] Hunyuan-GameCraft-2: Instruction-following Interactive Game World Model

Junshu Tang, Jiacheng Liu, Jiaqi Li, Longhuang Wu, Haoyu Yang, Penghao Zhao, Siruis Gong, Xiang Yuan, Shuai Shao, Linfeng Zhang, Qinglin Lu

Main category: cs.CV

TL;DR: Hunyuan-GameCraft-2 enables natural language and input device control of generated game videos through instruction-driven interaction modeling

Details

Motivation: Current generative world models have rigid action schemas and high annotation costs, limiting their ability to model diverse in-game interactions and player-driven dynamics

Method: Developed automated process to transform unstructured text-video pairs into causally aligned interactive datasets; built on 14B image-to-video MoE foundation model with text-driven interaction injection mechanism for fine-grained control over camera, character, and environment dynamics

Result: Model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse free-form user instructions; evaluated on new interaction-focused benchmark InterBench

Conclusion: Introduces a new paradigm of instruction-driven interaction for generative game world modeling that enables flexible, semantically rich interaction through natural language and input devices

Abstract: Recent advances in generative world models have enabled remarkable progress in creating open-ended game environments, evolving from static scene synthesis toward dynamic, interactive simulation. However, current approaches remain limited by rigid action schemas and high annotation costs, restricting their ability to model diverse in-game interactions and player-driven dynamics. To address these challenges, we introduce Hunyuan-GameCraft-2, a new paradigm of instruction-driven interaction for generative game world modeling. Instead of relying on fixed keyboard inputs, our model allows users to control game video contents through natural language prompts, keyboard, or mouse signals, enabling flexible and semantically rich interaction within generated worlds. We formally defined the concept of interactive video data and developed an automated process to transform large-scale, unstructured text-video pairs into causally aligned interactive datasets. Built upon a 14B image-to-video Mixture-of-Experts(MoE) foundation model, our model incorporates a text-driven interaction injection mechanism for fine-grained control over camera motion, character behavior, and environment dynamics. We introduce an interaction-focused benchmark, InterBench, to evaluate interaction performance comprehensively. Extensive experiments demonstrate that our model generates temporally coherent and causally grounded interactive game videos that faithfully respond to diverse and free-form user instructions such as “open the door”, “draw a torch”, or “trigger an explosion”.

[247] Temporal Concept Dynamics in Diffusion Models via Prompt-Conditioned Interventions

Ada Gorgun, Fawaz Sammani, Nikos Deligiannis, Bernt Schiele, Jonas Fischer

Main category: cs.CV

TL;DR: PCI is a training-free framework that analyzes when concepts form during diffusion model generation by measuring concept insertion success across timesteps, revealing temporal dynamics and improving editing interventions.

Details

Motivation: Current diffusion model evaluation focuses on final outputs, but understanding the dynamic generation process is crucial for controllability, reliability, and predictability. The paper investigates when specific concepts form and lock in during denoising trajectories.

Method: Proposes Prompt-Conditioned Intervention (PCI), a training-free, model-agnostic framework that analyzes Concept Insertion Success (CIS) - the probability that a concept inserted at a given timestep is preserved in the final image. This characterizes temporal dynamics of concept formation across diffusion models.

Result: Reveals diverse temporal behaviors across diffusion models, showing certain phases are more favorable to specific concepts even within the same concept type. Provides actionable insights for text-driven image editing, yielding quantitatively stronger edits with better semantic accuracy and content preservation than baselines.

Conclusion: PCI offers a principled way to analyze concept dynamics in diffusion models without requiring model internals or training. The temporal insights enable more effective interventions for image editing and improve understanding of generation processes.

Abstract: Diffusion models are usually evaluated by their final outputs, gradually denoising random noise into meaningful images. Yet, generation unfolds along a trajectory, and analyzing this dynamic process is crucial for understanding how controllable, reliable, and predictable these models are in terms of their success/failure modes. In this work, we ask the question: when does noise turn into a specific concept (e.g., age) and lock in the denoising trajectory? We propose PCI (Prompt-Conditioned Intervention) to study this question. PCI is a training-free and model-agnostic framework for analyzing concept dynamics through diffusion time. The central idea is the analysis of Concept Insertion Success (CIS), defined as the probability that a concept inserted at a given timestep is preserved and reflected in the final image, offering a way to characterize the temporal dynamics of concept formation. Applied to several state-of-the-art text-to-image diffusion models and a broad taxonomy of concepts, PCI reveals diverse temporal behaviors across diffusion models, in which certain phases of the trajectory are more favorable to specific concepts even within the same concept type. These findings also provide actionable insights for text-driven image editing, highlighting when interventions are most effective without requiring access to model internals or training, and yielding quantitatively stronger edits that achieve a balance of semantic accuracy and content preservation than strong baselines. Code is available at: https://adagorgun.github.io/PCI-Project/

[248] OpenMonoGS-SLAM: Monocular Gaussian Splatting SLAM with Open-set Semantics

Jisang Yoo, Gyeongjin Kang, Hyun-kyu Ko, Hyeonwoo Yu, Eunbyung Park

Main category: cs.CV

TL;DR: OpenMonoGS-SLAM: A monocular SLAM framework combining 3D Gaussian Splatting with open-set semantic understanding using visual foundation models, operating without depth sensors or semantic ground truth.

Details

Motivation: To address limitations in current SLAM systems that rely on depth sensors or closed-set semantic models, which restrict scalability and adaptability in open-world environments. The goal is to create a more flexible system that can understand semantics in diverse, unstructured settings.

Method: Unifies 3D Gaussian Splatting with open-set semantic understanding using Visual Foundation Models (MASt3R for geometry, SAM and CLIP for semantics). Uses self-supervised learning without depth input or 3D semantic ground truth. Includes a memory mechanism for managing high-dimensional semantic features to construct Gaussian semantic feature maps.

Result: Achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, without relying on supplementary sensors like depth maps or semantic annotations.

Conclusion: Demonstrates successful integration of 3D Gaussian Splatting with open-set semantic understanding for monocular SLAM, enabling robust performance in diverse environments without specialized sensors or annotations.

Abstract: Simultaneous Localization and Mapping (SLAM) is a foundational component in robotics, AR/VR, and autonomous systems. With the rising focus on spatial AI in recent years, combining SLAM with semantic understanding has become increasingly important for enabling intelligent perception and interaction. Recent efforts have explored this integration, but they often rely on depth sensors or closed-set semantic models, limiting their scalability and adaptability in open-world environments. In this work, we present OpenMonoGS-SLAM, the first monocular SLAM framework that unifies 3D Gaussian Splatting (3DGS) with open-set semantic understanding. To achieve our goal, we leverage recent advances in Visual Foundation Models (VFMs), including MASt3R for visual geometry and SAM and CLIP for open-vocabulary semantics. These models provide robust generalization across diverse tasks, enabling accurate monocular camera tracking and mapping, as well as a rich understanding of semantics in open-world environments. Our method operates without any depth input or 3D semantic ground truth, relying solely on self-supervised learning objectives. Furthermore, we propose a memory mechanism specifically designed to manage high-dimensional semantic features, which effectively constructs Gaussian semantic feature maps, leading to strong overall performance. Experimental results demonstrate that our approach achieves performance comparable to or surpassing existing baselines in both closed-set and open-set segmentation tasks, all without relying on supplementary sensors such as depth maps or semantic annotations.

[249] Block-Recurrent Dynamics in Vision Transformers

Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, T. Andy Keller

Main category: cs.CV

TL;DR: Vision Transformers exhibit block-recurrent depth structure where computation can be approximated with far fewer distinct blocks applied recurrently, enabling dynamical systems analysis of their behavior.

Details

Motivation: To provide a mechanistic account of Vision Transformers' computational phenomenology by interpreting their depth as a well-characterized dynamical flow rather than just architectural structure.

Method: Proposed Block-Recurrent Hypothesis (BRH) and developed Recurrent Approximations to Phase-structured TransfORmers (Raptor) - training block-recurrent surrogates of pretrained ViTs using only k « L distinct blocks applied recurrently.

Result: Demonstrated that trained ViTs admit block-recurrent structure; Raptor recovered 96% of DINOv2 ImageNet-1k accuracy with only 2 blocks at equivalent runtime; discovered directional convergence, token-specific dynamics, and low-rank updates consistent with dynamical systems behavior.

Conclusion: A compact recurrent program emerges along ViT depth, revealing low-complexity normative solutions that enable studying these models through principled dynamical systems analysis.

Abstract: As Vision Transformers (ViTs) become standard vision backbones, a mechanistic account of their computational phenomenology is essential. Despite architectural cues that hint at dynamical structure, there is no settled framework that interprets Transformer depth as a well-characterized flow. In this work, we introduce the Block-Recurrent Hypothesis (BRH), arguing that trained ViTs admit a block-recurrent depth structure such that the computation of the original $L$ blocks can be accurately rewritten using only $k \ll L$ distinct blocks applied recurrently. Across diverse ViTs, between-layer representational similarity matrices suggest few contiguous phases. To determine whether these phases reflect genuinely reusable computation, we train block-recurrent surrogates of pretrained ViTs: Recurrent Approximations to Phase-structured TransfORmers (Raptor). In small-scale, we demonstrate that stochastic depth and training promote recurrent structure and subsequently correlate with our ability to accurately fit Raptor. We then provide an empirical existence proof for BRH by training a Raptor model to recover $96%$ of DINOv2 ImageNet-1k linear probe accuracy in only 2 blocks at equivalent runtime. Finally, we leverage our hypothesis to develop a program of Dynamical Interpretability. We find i) directional convergence into class-dependent angular basins with self-correcting trajectories under small perturbations, ii) token-specific dynamics, where cls executes sharp late reorientations while patch tokens exhibit strong late-stage coherence toward their mean direction, and iii) a collapse to low rank updates in late depth, consistent with convergence to low-dimensional attractors. Altogether, we find a compact recurrent program emerges along ViT depth, pointing to a low-complexity normative solution that enables these models to be studied through principled dynamical systems analysis.

[250] Residual Decoding: Mitigating Hallucinations in Large Vision-Language Models via History-Aware Residual Guidance

Xinrong Chen, Xu Chu, Yingmin Qiu, Hengyuan Zhang, Jing Xiong, Shiyu Tang, Shuai Liu, Shaokang Yang, Cheng Yang, Hayden Kwok-Hay So, Ngai Wong

Main category: cs.CV

TL;DR: ResDec is a training-free method that reduces hallucinations in Large Vision-Language Models by using historical token information to correct language biases during decoding.

Details

Motivation: Large Vision-Language Models suffer from language priors and hallucinations - generating content that is grammatically correct but not grounded in visual input. This reduces their reliability for multimodal understanding tasks.

Method: Residual Decoding (ResDec) uses historical information during decoding to correct biases. It leverages the model’s internal implicit reasoning mechanism and token logits evolution to suppress hallucinations without requiring additional training.

Result: ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, reduces object hallucinations, and performs well on comprehensive LVLM benchmarks.

Conclusion: ResDec is a promising training-free approach that enhances the reliability of LVLMs by reducing hallucinations while maintaining strong performance across multimodal benchmarks.

Abstract: Large Vision-Language Models (LVLMs) can reason effectively from image-text inputs and perform well in various multimodal tasks. Despite this success, they are affected by language priors and often produce hallucinations. Hallucinations denote generated content that is grammatically and syntactically coherent, yet bears no match or direct relevance to actual visual input. To address this problem, we propose Residual Decoding (ResDec). It is a novel training-free method that uses historical information to aid decoding. The method relies on the internal implicit reasoning mechanism and token logits evolution mechanism of LVLMs to correct biases. Extensive experiments demonstrate that ResDec effectively suppresses hallucinations induced by language priors, significantly improves visual grounding, and reduces object hallucinations. In addition to mitigating hallucinations, ResDec also performs exceptionally well on comprehensive LVLM benchmarks, highlighting its broad applicability.

[251] Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang

Main category: cs.CV

TL;DR: Q-DiT4SR: A post-training quantization framework specifically designed for Diffusion Transformer-based real-world image super-resolution models, achieving state-of-the-art performance with significant model compression.

Details

Motivation: Diffusion Transformers (DiTs) show promise for Real-World Image Super-Resolution but suffer from heavy inference burden. Existing quantization methods are either designed for U-Net architectures or text-to-image tasks, and directly applying them to DiT-based super-resolution causes severe texture degradation.

Method: Proposes H-SVD (hierarchical SVD) that integrates global low-rank branch with local block-wise rank-1 branch under matched parameter budget. Also introduces Variance-aware Spatio-Temporal Mixed Precision: VaSMP for cross-layer weight bit-width allocation using rate-distortion theory, and VaTMP for intra-layer activation precision scheduling across diffusion timesteps via dynamic programming.

Result: Achieves state-of-the-art performance on multiple real-world datasets under both W4A6 and W4A4 settings. W4A4 quantization reduces model size by 5.8× and computational operations by over 60×.

Conclusion: Q-DiT4SR is the first PTQ framework specifically tailored for DiT-based Real-ISR, effectively addressing the inference burden while maintaining high-quality texture generation through specialized quantization techniques.

Abstract: Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by over 60$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.

[252] Unified Personalized Reward Model for Vision Generation

Yibin Wang, Yuhang Zang, Feng Han, Jiazi Bu, Yujie Zhou, Cheng Jin, Jiaqi Wang

Main category: cs.CV

TL;DR: UnifiedReward-Flex: A personalized multimodal reward model for vision generation that uses adaptive reasoning to align with context-dependent human preferences, improving image/video synthesis.

Details

Motivation: Current multimodal reward models for visual generation follow one-size-fits-all approaches with fixed evaluation rubrics, making them insensitive to content-specific visual cues and systematically misaligned with subjective, context-dependent human preferences.

Method: Two-stage training: (1) Distill structured reasoning traces from advanced VLMs for supervised fine-tuning to enable flexible, context-adaptive reasoning; (2) Apply direct preference optimization on curated preference pairs to strengthen reasoning fidelity and discriminative alignment. The model interprets semantic intent, grounds visual evidence, and dynamically constructs hierarchical assessments with fine-grained criteria.

Result: Extensive results demonstrate superiority when integrated into GRPO framework for image and video synthesis, showing improved alignment with human preferences compared to existing approaches.

Conclusion: UnifiedReward-Flex addresses limitations of current reward models by introducing personalized, context-adaptive reasoning for visual generation, leading to better alignment with subjective human preferences.

Abstract: Recent advancements in multimodal reward models (RMs) have significantly propelled the development of visual generation. Existing frameworks typically adopt Bradley-Terry-style preference modeling or leverage generative VLMs as judges, and subsequently optimize visual generation models via reinforcement learning. However, current RMs suffer from inherent limitations: they often follow a one-size-fits-all paradigm that assumes a monolithic preference distribution or relies on fixed evaluation rubrics. As a result, they are insensitive to content-specific visual cues, leading to systematic misalignment with subjective and context-dependent human preferences. To this end, inspired by human assessment, we propose UnifiedReward-Flex, a unified personalized reward model for vision generation that couples reward modeling with flexible and context-adaptive reasoning. Specifically, given a prompt and the generated visual content, it first interprets the semantic intent and grounds on visual evidence, then dynamically constructs a hierarchical assessment by instantiating fine-grained criteria under both predefined and self-generated high-level dimensions. Our training pipeline follows a two-stage process: (1) we first distill structured, high-quality reasoning traces from advanced closed-source VLMs to bootstrap SFT, equipping the model with flexible and context-adaptive reasoning behaviors; (2) we then perform direct preference optimization (DPO) on carefully curated preference pairs to further strengthen reasoning fidelity and discriminative alignment. To validate the effectiveness, we integrate UnifiedReward-Flex into the GRPO framework for image and video synthesis, and extensive results demonstrate its superiority.

[253] AdaptMMBench: Benchmarking Adaptive Multimodal Reasoning for Mode Selection and Reasoning Process

Xintong Zhang, Xiaowen Zhang, Jongrong Wu, Zhi Gao, Shilin Yan, Zhenxin Diao, Kunpeng Gao, Xuanyan Chen, Yuwei Wu, Yunde Jia, Qing Li

Main category: cs.CV

TL;DR: AdaptMMBench: A comprehensive benchmark for evaluating adaptive multimodal reasoning in VLMs across five domains with dynamic difficulty assessment and multi-dimensional process evaluation.

Details

Motivation: Existing evaluations for adaptive multimodal reasoning rely on static difficulty labels and simplistic metrics that fail to capture dynamic difficulty relative to model capacities, obscuring distinctions between adaptive mode selection and general performance while neglecting fine-grained process analyses.

Method: Proposes AdaptMMBench benchmark across five domains (real-world, OCR, GUI, knowledge, math) with Matthews Correlation Coefficient metric to evaluate selection rationality of reasoning modes. Dynamically identifies task difficulties based on models’ capability boundaries and facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency.

Result: Evaluation reveals that adaptive mode selection scales with model capacity but decouples from final accuracy, while key step coverage aligns with performance. Tool effectiveness remains highly inconsistent across model architectures.

Conclusion: AdaptMMBench provides a comprehensive framework for evaluating adaptive multimodal reasoning, offering insights into how VLMs dynamically modulate between tool-augmented visual reasoning and text reasoning, with implications for improving both effectiveness and efficiency.

Abstract: Adaptive multimodal reasoning has emerged as a promising frontier in Vision-Language Models (VLMs), aiming to dynamically modulate between tool-augmented visual reasoning and text reasoning to enhance both effectiveness and efficiency. However, existing evaluations rely on static difficulty labels and simplistic metrics, which fail to capture the dynamic nature of difficulty relative to varying model capacities. Consequently, they obscure the distinction between adaptive mode selection and general performance while neglecting fine-grained process analyses. In this paper, we propose AdaptMMBench, a comprehensive benchmark for adaptive multimodal reasoning across five domains: real-world, OCR, GUI, knowledge, and math, encompassing both direct perception and complex reasoning tasks. AdaptMMBench utilizes a Matthews Correlation Coefficient (MCC) metric to evaluate the selection rationality of different reasoning modes, isolating this meta-cognition ability by dynamically identifying task difficulties based on models’ capability boundaries. Moreover, AdaptMMBench facilitates multi-dimensional process evaluation across key step coverage, tool effectiveness, and computational efficiency. Our evaluation reveals that while adaptive mode selection scales with model capacity, it notably decouples from final accuracy. Conversely, key step coverage aligns with performance, though tool effectiveness remains highly inconsistent across model architectures.

[254] ConsisDrive: Identity-Preserving Driving World Models for Video Generation by Instance Mask

Zhuoran Yang, Yanyong Zhang

Main category: cs.CV

TL;DR: ConsisDrive is an identity-preserving driving world model that addresses identity drift in generated driving videos by enforcing instance-level temporal consistency through masked attention and loss mechanisms.

Details

Motivation: Current world models for autonomous driving suffer from identity drift - where objects change appearance or category across frames due to lack of instance-level temporal constraints, compromising the quality and reliability of generated driving data.

Method: Two key components: (1) Instance-Masked Attention that applies identity and trajectory masks to ensure visual tokens interact only with corresponding instance features across spatial/temporal dimensions; (2) Instance-Masked Loss that adaptively emphasizes foreground regions with probabilistic instance masking to reduce background noise while maintaining scene fidelity.

Result: Achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset.

Conclusion: ConsisDrive effectively addresses identity drift in driving world models through instance-level temporal consistency mechanisms, enabling higher quality generated driving data for autonomous driving applications.

Abstract: Autonomous driving relies on robust models trained on large-scale, high-quality multi-view driving videos. Although world models provide a cost-effective solution for generating realistic driving data, they often suffer from identity drift, where the same object changes its appearance or category across frames due to the absence of instance-level temporal constraints. We introduce ConsisDrive, an identity-preserving driving world model designed to enforce temporal consistency at the instance level. Our framework incorporates two key components: (1) Instance-Masked Attention, which applies instance identity masks and trajectory masks within attention blocks to ensure that visual tokens interact only with their corresponding instance features across spatial and temporal dimensions, thereby preserving object identity consistency; and (2) Instance-Masked Loss, which adaptively emphasizes foreground regions with probabilistic instance masking, reducing background noise while maintaining overall scene fidelity. By integrating these mechanisms, ConsisDrive achieves state-of-the-art driving video generation quality and demonstrates significant improvements in downstream autonomous driving tasks on the nuScenes dataset. Our project page is https://shanpoyang654.github.io/ConsisDrive/page.html.

[255] RAWDet-7: A Multi-Scenario Benchmark for Object Detection and Description on Quantized RAW Images

Mishal Fatima, Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Michael Moeller, Margret Keuper

Main category: cs.CV

TL;DR: RAWDet-7 dataset provides ~25k training and 7.6k test RAW images with object detection annotations and descriptions to study machine perception from unprocessed sensor data.

Details

Motivation: Most vision models use RGB images processed for human perception, discarding sensor-level information. RAW images preserve richer scene data for better machine reasoning, but lack large-scale datasets for research.

Method: Introduce RAWDet-7 dataset with diverse RAW images from multiple cameras, annotated for object detection following MS-COCO/LVIS conventions, plus object descriptions from corresponding sRGB images. Includes evaluation under simulated 4/6/8-bit quantization.

Result: Dataset enables study of object detection and description quality in RAW image processing, particularly under low-bit quantization constraints that reflect realistic sensor limitations.

Conclusion: RAWDet-7 provides a benchmark for advancing machine vision using unprocessed sensor data, facilitating research on information preservation in RAW image processing and low-bit quantization.

Abstract: Most vision models are trained on RGB images processed through ISP pipelines optimized for human perception, which can discard sensor-level information useful for machine reasoning. RAW images preserve unprocessed scene data, enabling models to leverage richer cues for both object detection and object description, capturing fine-grained details, spatial relationships, and contextual information often lost in processed images. To support research in this domain, we introduce RAWDet-7, a large-scale dataset of ~25k training and 7.6k test RAW images collected across diverse cameras, lighting conditions, and environments, densely annotated for seven object categories following MS-COCO and LVIS conventions. In addition, we provide object-level descriptions derived from the corresponding high-resolution sRGB images, facilitating the study of object-level information preservation under RAW image processing and low-bit quantization. The dataset allows evaluation under simulated 4-bit, 6-bit, and 8-bit quantization, reflecting realistic sensor constraints, and provides a benchmark for studying detection performance, description quality & detail, and generalization in low-bit RAW image processing. Dataset & code upon acceptance.

[256] Entropy-Aware Structural Alignment for Zero-Shot Handwritten Chinese Character Recognition

Qiuming Luo, Tao Zeng, Feng Li, Heming Liu, Rui Mao, Chang Kong

Main category: cs.CV

TL;DR: An entropy-aware structural alignment network for zero-shot handwritten Chinese character recognition that addresses hierarchical topology and uneven information density through information-theoretic modeling.

Details

Motivation: Existing zero-shot HCCR approaches treat characters as flat radical sequences, neglecting hierarchical topology and uneven information density of different components, leading to suboptimal visual-semantic alignment.

Method: Proposes three key components: 1) Information Entropy Prior for dynamic positional embedding modulation via multiplicative interaction, 2) Dual-View Radical Tree for multi-granularity structural feature extraction with adaptive gating, and 3) Top-K Semantic Feature Fusion mechanism using semantic neighbor centroids to rectify visual ambiguities.

Result: Achieves state-of-the-art 55.04% accuracy on ICDAR 2013 dataset (m=1500), significantly outperforming CLIP-based baselines. Shows exceptional data efficiency with 92.41% accuracy using only one support sample per class.

Conclusion: The proposed entropy-aware structural alignment network effectively bridges the visual-semantic gap in zero-shot HCCR by modeling hierarchical topology and information density, demonstrating superior performance and data efficiency.

Abstract: Zero-shot Handwritten Chinese Character Recognition (HCCR) aims to recognize unseen characters by leveraging radical-based semantic compositions. However, existing approaches often treat characters as flat radical sequences, neglecting the hierarchical topology and the uneven information density of different components. To address these limitations, we propose an Entropy-Aware Structural Alignment Network that bridges the visual-semantic gap through information-theoretic modeling. First, we introduce an Information Entropy Prior to dynamically modulate positional embeddings via multiplicative interaction, acting as a saliency detector that prioritizes discriminative roots over ubiquitous components. Second, we construct a Dual-View Radical Tree to extract multi-granularity structural features, which are integrated via an adaptive Sigmoid-based gating network to encode both global layout and local spatial roles. Finally, a Top-K Semantic Feature Fusion mechanism is devised to augment the decoding process by utilizing the centroid of semantic neighbors, effectively rectifying visual ambiguities through feature-level consensus. Extensive experiments demonstrate that our method establishes new state-of-the-art performance, achieving an accuracy of 55.04% on the ICDAR 2013 dataset ($m=1500$), significantly outperforming existing CLIP-based baselines in the challenging zero-shot setting. Furthermore, the framework exhibits exceptional data efficiency, demonstrating rapid adaptability with minimal support samples, achieving 92.41% accuracy with only one support sample per class.

[257] Thinking with Geometry: Active Geometry Integration for Spatial Reasoning

Haoyuan Li, Qihang Cao, Tao Tang, Kun Xiang, Zihan Guo, Jianhua Han, Hang Xu, Xiaodan Liang

Main category: cs.CV

TL;DR: GeoThinker is a framework that enables MLLMs to actively retrieve geometric evidence based on reasoning demands rather than passively fusing global geometry streams, achieving state-of-the-art spatial intelligence performance.

Details

Motivation: Current MLLMs for spatial reasoning passively fuse geometric priors from 3D encoders as global streams, leading to semantic-geometry misalignment and redundant signals. The authors aim to shift from passive fusion to active perception.

Method: GeoThinker uses Spatial-Grounded Fusion at selected VLM layers where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, calibrated by Importance Gating that biases attention toward task-relevant structures.

Result: Achieves state-of-the-art spatial intelligence with peak score of 72.6 on VSI-Bench, demonstrates robust generalization and improved spatial perception across complex downstream scenarios including embodied referring and autonomous driving.

Conclusion: Active integration of spatial structures is essential for next-generation spatial intelligence, moving beyond passive fusion to selective geometric evidence retrieval based on reasoning demands.

Abstract: Recent progress in spatial reasoning with Multimodal Large Language Models (MLLMs) increasingly leverages geometric priors from 3D encoders. However, most existing integration strategies remain passive: geometry is exposed as a global stream and fused in an indiscriminate manner, which often induces semantic-geometry misalignment and redundant signals. We propose GeoThinker, a framework that shifts the paradigm from passive fusion to active perception. Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands. GeoThinker achieves this through Spatial-Grounded Fusion applied at carefully selected VLM layers, where semantic visual priors selectively query and integrate task-relevant geometry via frame-strict cross-attention, further calibrated by Importance Gating that biases per-frame attention toward task-relevant structures. Comprehensive evaluation results show that GeoThinker sets a new state-of-the-art in spatial intelligence, achieving a peak score of 72.6 on the VSI-Bench. Furthermore, GeoThinker demonstrates robust generalization and significantly improved spatial perception across complex downstream scenarios, including embodied referring and autonomous driving. Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence. Code can be found at https://github.com/Li-Hao-yuan/GeoThinker.

Grégoire Dhimoïla, Thomas Fel, Victor Boutin, Agustin Picard

Main category: cs.CV

TL;DR: The paper proposes an Aligned Sparse Autoencoder framework based on the Iso-Energy Assumption to analyze the geometry of vision-language model embedding spaces, revealing that sparse bimodal atoms carry cross-modal alignment while unimodal atoms explain modality gaps.

Details

Motivation: Vision-language models have achieved remarkable success in aligning images and text, but the geometry of their shared embedding space remains poorly understood. The authors aim to develop tools to probe this geometry and understand how cross-modal alignment is structured.

Method: The authors propose the Iso-Energy Assumption, which states that truly shared concepts should exhibit the same average energy across modalities. They operationalize this with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction quality.

Result: The framework reveals that: (1) sparse bimodal atoms carry the entire cross-modal alignment signal; (2) unimodal atoms act as modality-specific biases and fully explain the modality gap; (3) removing unimodal atoms collapses the gap without harming performance; (4) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval.

Conclusion: The right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable. The findings provide insights into the structure of vision-language model embeddings and offer practical ways to manipulate them for better performance.

Abstract: Vision-language models (VLMs) align images and text with remarkable success, yet the geometry of their shared embedding space remains poorly understood. To probe this geometry, we begin from the Iso-Energy Assumption, which exploits cross-modal redundancy: a concept that is truly shared should exhibit the same average energy across modalities. We operationalize this assumption with an Aligned Sparse Autoencoder (SAE) that encourages energy consistency during training while preserving reconstruction. We find that this inductive bias changes the SAE solution without harming reconstruction, giving us a representation that serves as a tool for geometric analysis. Sanity checks on controlled data with known ground truth confirm that alignment improves when Iso-Energy holds and remains neutral when it does not. Applied to foundational VLMs, our framework reveals a clear structure with practical consequences: (i) sparse bimodal atoms carry the entire cross-modal alignment signal; (ii) unimodal atoms act as modality-specific biases and fully explain the modality gap; (iii) removing unimodal atoms collapses the gap without harming performance; (iv) restricting vector arithmetic to the bimodal subspace yields in-distribution edits and improved retrieval. These findings suggest that the right inductive bias can both preserve model fidelity and render the latent geometry interpretable and actionable.

[259] SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads

Tan Yu, Qian Qiao, Le Shen, Ke Zhou, Jincheng Hu, Dian Sheng, Bo Hu, Haoming Qin, Jun Gao, Changhai Zhou, Shunshun Yin, Siyuan Liu

Main category: cs.CV

TL;DR: SoulX-FlashHead: A 1.3B-parameter framework for real-time, high-fidelity audio-driven portrait video generation with streaming capabilities

Details

Motivation: Addressing the challenge of balancing high-fidelity visual quality with low-latency streaming in audio-driven portrait generation, where existing models either have prohibitive computational costs or compromise on facial representations and temporal stability.

Method: Proposes a unified 1.3B-parameter framework with Streaming-Aware Spatiotemporal Pre-training using Temporal Audio Context Cache for robust feature extraction from short audio fragments, and Oracle-Guided Bidirectional Distillation to mitigate error accumulation in long-sequence autoregressive generation.

Result: Achieves state-of-the-art performance on HDTF and VFHQ benchmarks, with Lite variant reaching 96 FPS on a single NVIDIA RTX 4090, enabling ultra-fast interaction without sacrificing visual coherence.

Conclusion: SoulX-FlashHead successfully addresses the trade-off between quality and latency in audio-driven portrait generation, providing a practical solution for real-time streaming applications with high visual fidelity.

Abstract: Achieving a balance between high-fidelity visual quality and low-latency streaming remains a formidable challenge in audio-driven portrait generation. Existing large-scale models often suffer from prohibitive computational costs, while lightweight alternatives typically compromise on holistic facial representations and temporal stability. In this paper, we propose SoulX-FlashHead, a unified 1.3B-parameter framework designed for real-time, infinite-length, and high-fidelity streaming video generation. To address the instability of audio features in streaming scenarios, we introduce Streaming-Aware Spatiotemporal Pre-training equipped with a Temporal Audio Context Cache mechanism, which ensures robust feature extraction from short audio fragments. Furthermore, to mitigate the error accumulation and identity drift inherent in long-sequence autoregressive generation, we propose Oracle-Guided Bidirectional Distillation, leveraging ground-truth motion priors to provide precise physical guidance. We also present VividHead, a large-scale, high-quality dataset containing 782 hours of strictly aligned footage to support robust training. Extensive experiments demonstrate that SoulX-FlashHead achieves state-of-the-art performance on HDTF and VFHQ benchmarks. Notably, our Lite variant achieves an inference speed of 96 FPS on a single NVIDIA RTX 4090, facilitating ultra-fast interaction without sacrificing visual coherence.

Hulingxiao He, Zijun Geng, Yuxin Peng

Main category: cs.CV

TL;DR: Fine-R1 is a multimodal LLM specialized for fine-grained visual recognition using an R1-style training framework with chain-of-thought fine-tuning and triplet-augmented policy optimization, achieving strong performance with minimal training data.

Details

Motivation: Current MLLMs perform well on coarse-grained visual tasks but struggle with fine-grained visual recognition, requiring large annotated datasets and suffering from poor generalization to unseen sub-categories. There's a performance gap compared to contrastive CLIP models designed for discriminative tasks.

Method: Two-stage R1-style framework: (1) Chain-of-Thought Supervised Fine-tuning with constructed FGVR CoT dataset containing rationales for visual analysis, candidate sub-categories, comparison, and prediction; (2) Triplet Augmented Policy Optimization with intra-class augmentation (mixing trajectories within same category) and inter-class augmentation (maximizing response distinction across sub-categories).

Result: With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and contrastive CLIP models in identifying both seen and unseen sub-categories, showing strong generalization capabilities.

Conclusion: Fine-R1 demonstrates that specialized MLLMs can excel at fine-grained visual recognition with minimal training data, offering promise for knowledge-intensive domains where expert annotations are scarce, bridging the gap between general-purpose MLLMs and specialized discriminative models.

Abstract: Any entity in the visual world can be hierarchically grouped based on shared characteristics and mapped to fine-grained sub-categories. While Multi-modal Large Language Models (MLLMs) achieve strong performance on coarse-grained visual tasks, they often struggle with Fine-Grained Visual Recognition (FGVR). Adapting general-purpose MLLMs to FGVR typically requires large amounts of annotated data, which is costly to obtain, leaving a substantial performance gap compared to contrastive CLIP models dedicated for discriminative tasks. Moreover, MLLMs tend to overfit to seen sub-categories and generalize poorly to unseen ones. To address these challenges, we propose Fine-R1, an MLLM tailored for FGVR through an R1-style training framework: (1) Chain-of-Thought Supervised Fine-tuning, where we construct a high-quality FGVR CoT dataset with rationales of “visual analysis, candidate sub-categories, comparison, and prediction”, transition the model into a strong open-world classifier; and (2) Triplet Augmented Policy Optimization, where Intra-class Augmentation mixes trajectories from anchor and positive images within the same category to improve robustness to intra-class variance, while Inter-class Augmentation maximizes the response distinction conditioned on images across sub-categories to enhance discriminative ability. With only 4-shot training, Fine-R1 outperforms existing general MLLMs, reasoning MLLMs, and even contrastive CLIP models in identifying both seen and unseen sub-categories, showing promise in working in knowledge-intensive domains where gathering expert annotations for all sub-categories is arduous. Code is available at https://github.com/PKU-ICST-MIPL/FineR1_ICLR2026.

[261] WristMIR: Coarse-to-Fine Region-Aware Retrieval of Pediatric Wrist Radiographs with Radiology Report-Driven Learning

Mert Sonmezer, Serge Vasylechko, Duygu Atasoy, Seyda Ertekin, Sila Kurugol

Main category: cs.CV

TL;DR: WristMIR: A region-aware pediatric wrist radiograph retrieval framework using dense radiology reports and bone-specific localization for fine-grained image representations without manual annotations.

Details

Motivation: Retrieving wrist radiographs with analogous fracture patterns is challenging due to subtle, localized cues obscured by overlapping anatomy or variable imaging views, compounded by scarcity of large annotated datasets for medical image retrieval.

Method: Uses MedGemma-based structured report mining to generate global and region-level captions, processes wrist images with bone-specific crops, jointly trains global and local contrastive encoders, and performs two-stage retrieval: coarse global matching followed by region-conditioned reranking.

Result: Improves retrieval performance over baselines (Recall@5 from 0.82% to 9.35%), yields stronger fracture classification (AUROC 0.949, AUPRC 0.953), and improves retrieval-based fracture diagnosis (mean F1 from 0.568 to 0.753). Radiologists rate retrieved cases as more clinically relevant.

Conclusion: Anatomically guided retrieval can enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging, demonstrating the potential of region-aware approaches in medical image analysis.

Abstract: Retrieving wrist radiographs with analogous fracture patterns is challenging because clinically important cues are subtle, highly localized and often obscured by overlapping anatomy or variable imaging views. Progress is further limited by the scarcity of large, well-annotated datasets for case-based medical image retrieval. We introduce WristMIR, a region-aware pediatric wrist radiograph retrieval framework that leverages dense radiology reports and bone-specific localization to learn fine-grained, clinically meaningful image representations without any manual image-level annotations. Using MedGemma-based structured report mining to generate both global and region-level captions, together with pre-processed wrist images and bone-specific crops of the distal radius, distal ulna, and ulnar styloid, WristMIR jointly trains global and local contrastive encoders and performs a two-stage retrieval process: (1) coarse global matching to identify candidate exams, followed by (2) region-conditioned reranking aligned to a predefined anatomical bone region. WristMIR improves retrieval performance over strong vision-language baselines, raising image-to-text Recall@5 from 0.82% to 9.35%. Its embeddings also yield stronger fracture classification (AUROC 0.949, AUPRC 0.953). In region-aware evaluation, the two-stage design markedly improves retrieval-based fracture diagnosis, increasing mean $F_1$ from 0.568 to 0.753, and radiologists rate its retrieved cases as more clinically relevant, with mean scores rising from 3.36 to 4.35. These findings highlight the potential of anatomically guided retrieval to enhance diagnostic reasoning and support clinical decision-making in pediatric musculoskeletal imaging. The source code is publicly available at https://github.com/quin-med-harvard-edu/WristMIR.

[262] Efficient-SAM2: Accelerating SAM2 with Object-Aware Visual Encoding and Memory Retrieval

Jing Zhang, Zhikai Li, Xuewen Liu, Qingyi Gu

Main category: cs.CV

TL;DR: Efficient-SAM2 accelerates SAM2 for video object segmentation by exploiting sparse attention patterns to eliminate redundant computations in both image encoder and memory bank, achieving 1.68x speedup with minimal accuracy loss.

Details

Motivation: SAM2 shows excellent video object segmentation performance but has heavy computational burden that hinders real-time applications. Existing efficiency improvements focus on retraining lightweight backbones, with little exploration of post-training acceleration through computational redundancy elimination.

Method: Proposes Efficient-SAM2 with two key components: 1) Object-aware Sparse Window Routing (SWR) for image encoder - uses consistency and saliency cues from previous-frame decoder to route background regions to lightweight shortcut branch; 2) Object-aware Sparse Memory Retrieval (SMR) for memory attention - allows only salient memory tokens to participate in computation, reusing saliency patterns from first recollection.

Result: Achieves 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set. Uses negligible additional parameters and minimal training overhead.

Conclusion: Efficient-SAM2 successfully accelerates SAM2 by exploiting its sparse perception patterns to eliminate redundant computations, making it more suitable for real-time video processing applications while maintaining strong segmentation performance.

Abstract: Segment Anything Model 2 (SAM2) shows excellent performance in video object segmentation tasks; however, the heavy computational burden hinders its application in real-time video processing. Although there have been efforts to improve the efficiency of SAM2, most of them focus on retraining a lightweight backbone, with little exploration into post-training acceleration. In this paper, we observe that SAM2 exhibits sparse perception pattern as biological vision, which provides opportunities for eliminating redundant computation and acceleration: i) In mask decoder, the attention primarily focuses on the foreground objects, whereas the image encoder in the earlier stage exhibits a broad attention span, which results in unnecessary computation to background regions. ii) In memory bank, only a small subset of tokens in each frame contribute significantly to memory attention, and the salient regions exhibit temporal consistency, making full-token computation redundant. With these insights, we propose Efficient-SAM2, which promotes SAM2 to adaptively focus on object regions while eliminating task-irrelevant computations, thereby significantly improving inference efficiency. Specifically, for image encoder, we propose object-aware Sparse Window Routing (SWR), a window-level computation allocation mechanism that leverages the consistency and saliency cues from the previous-frame decoder to route background regions into a lightweight shortcut branch. Moreover, for memory attention, we propose object-aware Sparse Memory Retrieval (SMR), which allows only the salient memory tokens in each frame to participate in computation, with the saliency pattern reused from their first recollection. With negligible additional parameters and minimal training overhead, Efficient-SAM2 delivers 1.68x speedup on SAM2.1-L model with only 1.0% accuracy drop on SA-V test set.

[263] E-VAds: An E-commerce Short Videos Understanding Benchmark for MLLMs

Xianjie Liu, Yiman Hu, Liang Wu, Ping Hu, Yixiong Zou, Jian Xu, Bo Zheng

Main category: cs.CV

TL;DR: E-VAds benchmark for e-commerce short video understanding with multi-modal dense signals, featuring RL-based reasoning model E-VAds-R1 achieving 109.2% performance gain in commercial intent reasoning.

Details

Motivation: Current video understanding models struggle with e-commerce short videos due to their goal-driven format, dense multi-modal signals, and lack of benchmarks focusing on commercial intent reasoning. Existing benchmarks focus on general-purpose tasks and neglect the specialized reasoning needed for e-commerce content.

Method: 1) Proposed multi-modal information density assessment framework to quantify complexity; 2) Created E-VAds benchmark with 3,961 Taobao videos and 19,785 open-ended Q&A pairs via multi-agent system; 3) Organized questions into Perception and Cognition/Reasoning dimensions with five tasks; 4) Developed E-VAds-R1 RL-based reasoning model with multi-grained reward design (MG-GRPO) for smooth early exploration and non-linear expert-level precision incentives.

Result: E-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets. E-VAds-R1 achieves 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

Conclusion: E-VAds establishes a challenging frontier for video understanding in e-commerce domain, addressing the gap in commercial intent reasoning. The RL-based approach with multi-grained rewards effectively handles the dense multi-modal signals and complex reasoning requirements.

Abstract: E-commerce short videos represent a high-revenue segment of the online video industry characterized by a goal-driven format and dense multi-modal signals. Current models often struggle with these videos because existing benchmarks focus primarily on general-purpose tasks and neglect the reasoning of commercial intent. In this work, we first propose a multi-modal information density assessment framework to quantify the complexity of this domain. Our evaluation reveals that e-commerce content exhibits substantially higher density across visual, audio, and textual modalities compared to mainstream datasets, establishing a more challenging frontier for video understanding. To address this gap, we introduce E-commerce Video Ads Benchmark (E-VAds), which is the first benchmark specifically designed for e-commerce short video understanding. We curated 3,961 high-quality videos from Taobao covering a wide range of product categories and used a multi-agent system to generate 19,785 open-ended Q&A pairs. These questions are organized into two primary dimensions, namely Perception and Cognition and Reasoning, which consist of five distinct tasks. Finally, we develop E-VAds-R1, an RL-based reasoning model featuring a multi-grained reward design called MG-GRPO. This strategy provides smooth guidance for early exploration while creating a non-linear incentive for expert-level precision. Experimental results demonstrate that E-VAds-R1 achieves a 109.2% performance gain in commercial intent reasoning with only a few hundred training samples.

[264] Understanding Image2Video Domain Shift in Food Segmentation: An Instance-level Analysis on Apples

Keonvin Park, Aditya Pal, Jin Hong Mok

Main category: cs.CV

TL;DR: Food segmentation models trained on images fail in video settings due to temporal inconsistency, causing identity fragmentation and counting errors despite high frame-wise accuracy.

Details

Motivation: To understand why food segmentation models trained on static images perform poorly in video applications like food monitoring and counting, where temporal consistency is crucial.

Method: Analyze failure through instance segmentation and tracking perspective using apples as case study. Train models on image-level food segmentation data, evaluate on videos using instance segmentation with tracking-by-matching framework for object-level temporal analysis.

Result: High frame-wise segmentation accuracy doesn’t translate to stable instance identities over time. Temporal appearance variations (illumination changes, specular reflections, texture ambiguity) cause mask flickering and identity fragmentation, leading to significant counting errors. Conventional image-based metrics overestimate video performance.

Conclusion: Root cause is image-centric training objectives ignoring temporal coherence, not model capacity. Highlights critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.

Abstract: Food segmentation models trained on static images have achieved strong performance on benchmark datasets; however, their reliability in video settings remains poorly understood. In real-world applications such as food monitoring and instance counting, segmentation outputs must be temporally consistent, yet image-trained models often break down when deployed on videos. In this work, we analyze this failure through an instance segmentation and tracking perspective, focusing on apples as a representative food category. Models are trained solely on image-level food segmentation data and evaluated on video sequences using an instance segmentation with tracking-by-matching framework, enabling object-level temporal analysis. Our results reveal that high frame-wise segmentation accuracy does not translate to stable instance identities over time. Temporal appearance variations, particularly illumination changes, specular reflections, and texture ambiguity, lead to mask flickering and identity fragmentation, resulting in significant errors in apple counting. These failures are largely overlooked by conventional image-based metrics, which substantially overestimate real-world video performance. Beyond diagnosing the problem, we examine practical remedies that do not require full video supervision, including post-hoc temporal regularization and self-supervised temporal consistency objectives. Our findings suggest that the root cause of failure lies in image-centric training objectives that ignore temporal coherence, rather than model capacity. This study highlights a critical evaluation gap in food segmentation research and motivates temporally-aware learning and evaluation protocols for video-based food analysis.

[265] Automatic regularization parameter choice for tomography using a double model approach

Chuyang Wu, Samuli Siltanen

Main category: cs.CV

TL;DR: Novel automatic regularization parameter selection method for X-ray tomography using two computational grids and feedback control

Details

Motivation: X-ray tomography reconstruction is ill-posed with limited data, requiring regularization. However, regularization effectiveness depends on parameter selection, which is challenging to determine automatically.

Method: Uses two distinct computational discretizations of the same problem. A feedback control algorithm dynamically adjusts regularization strength, driving iterative reconstruction toward smallest parameter yielding sufficient similarity between reconstructions on the two grids.

Result: Demonstrated effectiveness using real tomographic data, showing the method can automatically select appropriate regularization parameters for improved reconstruction quality.

Conclusion: Proposed approach provides automatic parameter selection for X-ray tomography regularization, addressing a key challenge in ill-posed inverse problems with limited data.

Abstract: Image reconstruction in X-ray tomography is an ill-posed inverse problem, particularly with limited available data. Regularization is thus essential, but its effectiveness hinges on the choice of a regularization parameter that balances data fidelity against a priori information. We present a novel method for automatic parameter selection based on the use of two distinct computational discretizations of the same problem. A feedback control algorithm dynamically adjusts the regularization strength, driving an iterative reconstruction toward the smallest parameter that yields sufficient similarity between reconstructions on the two grids. The effectiveness of the proposed approach is demonstrated using real tomographic data.

[266] ALIVE: Animate Your World with Lifelike Audio-Video Generation

Ying Guo, Qijun Gan, Yifu Zhang, Jinlai Liu, Yifei Hu, Pan Xie, Dongjun Qian, Yu Zhang, Ruiqi Li, Yuqi Zhang, Ruibiao Lu, Xiaofeng Mei, Bo Han, Xiang Yin, Bingyue Peng, Zehuan Yuan

Main category: cs.CV

TL;DR: ALIVE adapts pretrained T2V models for Sora-style audio-video generation and animation, achieving state-of-the-art performance through architectural innovations and high-quality data.

Details

Motivation: Video generation is evolving toward unified audio-video generation, but existing Text-to-Video (T2V) models lack audio generation and animation capabilities. The paper aims to adapt pretrained T2V models to achieve Sora-style audio-video generation and animation.

Method: Augments MMDiT architecture with joint audio-video branch featuring TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Uses comprehensive data pipeline for high-quality finetuning data collection and introduces new benchmark for evaluation.

Result: After pretraining on million-level high-quality data, ALIVE consistently outperforms open-source models and matches or surpasses state-of-the-art commercial solutions in audio-video generation.

Conclusion: ALIVE successfully adapts T2V models for unified audio-video generation and animation, providing detailed recipes and benchmarks to help community develop such models more efficiently.

Abstract: Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page: https://github.com/FoundationVision/Alive.

Masanari Oi, Koki Maeda, Ryuto Koike, Daisuke Oba, Nakamasa Inoue, Naoaki Okazaki

Main category: cs.CV

TL;DR: HATCH is a training framework for multimodal LLMs that improves multi-image spatial reasoning through explicit cross-view correspondence and stepwise viewpoint transformation supervision.

Details

Motivation: Current MLLMs struggle with multi-image spatial reasoning that requires integrating information from multiple viewpoints. While humans use cross-view correspondence and stepwise viewpoint transformation, existing approaches incorporate these mechanisms only partially and implicitly without explicit supervision for both.

Method: Proposes HATCH framework with two complementary objectives: (1) Patch-Level Spatial Alignment - encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning - requires models to generate explicit viewpoint transition actions before predicting final answers.

Result: Experiments on three benchmarks show HATCH consistently outperforms comparable-size baselines by clear margins and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

Conclusion: Explicit supervision for both cross-view correspondence and stepwise viewpoint transformation significantly improves multi-image spatial reasoning in MLLMs, enabling better performance with smaller model sizes.

Abstract: While multimodal large language models (MLLMs) have made substantial progress in single-image spatial reasoning, multi-image spatial reasoning, which requires integration of information from multiple viewpoints, remains challenging. Cognitive studies suggest that humans address such tasks through two mechanisms: cross-view correspondence, which identifies regions across different views that correspond to the same physical locations, and stepwise viewpoint transformation, which composes relative viewpoint changes sequentially. However, existing studies incorporate these mechanisms only partially and often implicitly, without explicit supervision for both. We propose Human-Aware Training for Cross-view correspondence and viewpoint cHange (HATCH), a training framework with two complementary objectives: (1) Patch-Level Spatial Alignment, which encourages patch representations to align across views for spatially corresponding regions, and (2) Action-then-Answer Reasoning, which requires the model to generate explicit viewpoint transition actions before predicting the final answer. Experiments on three benchmarks demonstrate that HATCH consistently outperforms baselines of comparable size by a clear margin and achieves competitive results against much larger models, while preserving single-image reasoning capabilities.

[268] Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

Carmine Zaccagnino, Fabio Quattrini, Enis Simsar, Marta Tintoré Gazulla, Rita Cucchiara, Alessio Tonioni, Silvia Cascianelli

Main category: cs.CV

TL;DR: Proposes Instance-Disentangled Attention for flow matching models to enable multi-instance image editing without semantic interference between concurrent edits.

Details

Motivation: Existing flow-based image editors struggle with multi-instance scenarios where multiple parts of an image need independent editing without semantic interference, due to limitations of globally conditioned velocity fields and joint attention mechanisms.

Method: Introduces Instance-Disentangled Attention that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation in flow matching models.

Result: Experimental results show the approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing on both natural images and text-dense infographics.

Conclusion: The proposed Instance-Disentangled Attention mechanism effectively addresses the multi-instance editing limitation in flow matching models, enabling independent editing of multiple image regions without semantic interference.

Abstract: Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

[269] Grow with the Flow: 4D Reconstruction of Growing Plants with Gaussian Flow Fields

Weihan Luo, Lily Goli, Sherwin Bahmani, Felix Taubner, Andrea Tagliasacchi, David B. Lindell

Main category: cs.CV

TL;DR: A 3D Gaussian flow field representation for modeling plant growth dynamics with time-varying Gaussian parameters, using reverse growth initialization for accurate reconstruction.

Details

Motivation: Existing motion modeling techniques are ill-suited for plant growth because plants generate new geometry over time, unlike typical dynamic scenes. Deformation fields can't introduce new geometry, and 4D Gaussian splatting has limitations in tracking Gaussian primitives over time.

Method: Introduces a 3D Gaussian flow field representation that models plant growth as time-varying derivatives over Gaussian parameters (position, scale, orientation, color, opacity). Uses reverse growth initialization by reconstructing the mature plant first and learning reverse developmental history.

Result: Achieves superior image quality and geometric accuracy compared to prior methods on multi-view timelapse datasets of plant growth.

Conclusion: Provides a new approach for appearance modeling of growing 3D structures, specifically addressing the unique challenges of plant growth dynamics.

Abstract: Modeling the time-varying 3D appearance of plants during their growth poses unique challenges: unlike many dynamic scenes, plants generate new geometry over time as they expand, branch, and differentiate. Recent motion modeling techniques are ill-suited to this problem setting. For example, deformation fields cannot introduce new geometry, and 4D Gaussian splatting constrains motion to a linear trajectory in space and time and cannot track the same set of Gaussians over time. Here, we introduce a 3D Gaussian flow field representation that models plant growth as a time-varying derivative over Gaussian parameters – position, scale, orientation, color, and opacity – enabling nonlinear and continuous-time growth dynamics. To initialize a sufficient set of Gaussian primitives, we reconstruct the mature plant and learn a process of reverse growth, effectively simulating the plant’s developmental history in reverse. Our approach achieves superior image quality and geometric accuracy compared to prior methods on multi-view timelapse datasets of plant growth, providing a new approach for appearance modeling of growing 3D structures.

cs.AI

[270] A Small-Scale System for Autoregressive Program Synthesis Enabling Controlled Experimentation

Russ Webb, Jason Ramapuram

Main category: cs.AI

TL;DR: Cadmus: A system for studying program synthesis with small, affordable models that outperform GPT-5 on integer arithmetic tasks while providing full transparency into training data relationships.

Details

Motivation: Large language models for program synthesis have issues with distribution understanding, fine-tuning effects, tokenization impacts, and high computational costs. Researchers need affordable, transparent systems for studying program completion, out-of-distribution representations, inductive reasoning, and instruction following.

Method: Developed Cadmus system with: 1) integer virtual machine, 2) dataset of diverse true programs, 3) autoregressive transformer model trained for under $200. Provides fine-grained control over training distribution and model inspection capabilities.

Result: Cadmus models achieve 100% accuracy on integer arithmetic program completion tasks, outperforming GPT-5’s 95% accuracy. System enables transparent investigation of dataset-task relationships, unlike GPT-5 which brings unknown priors into reasoning.

Conclusion: Small, affordable models like Cadmus enable controlled program synthesis research with full transparency, addressing limitations of large LLMs where unknown priors and opaque training relationships prevent certain investigations.

Abstract: What research can be pursued with small models trained to complete true programs? Typically, researchers study program synthesis via large language models (LLMs) which introduce issues such as knowing what is in or out of distribution, understanding fine-tuning effects, understanding the effects of tokenization, and higher demand on compute and storage to carry out experiments. We present a system called Cadmus which includes an integer virtual machine (VM), a dataset composed of true programs of diverse tasks, and an autoregressive transformer model that is trained for under $200 of compute cost. The system can be used to study program completion, out-of-distribution representations, inductive reasoning, and instruction following in a setting where researchers have effective and affordable fine-grained control of the training distribution and the ability to inspect and instrument models. Smaller models working on complex reasoning tasks enable instrumentation and investigations that may be prohibitively expensive on larger models. To demonstrate that these tasks are complex enough to be of interest, we show that these Cadmus models outperform GPT-5 (by achieving 100% accuracy while GPT-5 has 95% accuracy) even on a simple task of completing correct, integer arithmetic programs in our domain-specific language (DSL) while providing transparency into the dataset’s relationship to the problem. We also show that GPT-5 brings unknown priors into its reasoning process when solving the same tasks, demonstrating a confounding factor that prevents the use of large-scale LLMs for some investigations where the training set relationship to the task needs to be fully understood.

[271] Uncertainty-Aware Multimodal Emotion Recognition through Dirichlet Parameterization

Rémi Grzeczkowicz, Eric Soriano, Ali Janati, Miyu Zhang, Gerard Comas-Quiles, Victor Carballo Araruna, Aneesh Jonelagadda

Main category: cs.AI

TL;DR: Lightweight multimodal emotion recognition framework for edge devices using speech, text, and facial imagery with uncertainty-aware fusion based on Dempster-Shafer theory

Details

Motivation: To create a privacy-preserving, efficient multimodal emotion recognition system deployable on edge devices that can handle uncertainty across modalities and ambiguous/missing inputs

Method: Uses dedicated backbones for each modality (Emotion2Vec for speech, ResNet for facial expressions, DistilRoBERTa for text) with model-agnostic fusion mechanism based on Dempster-Shafer theory and Dirichlet evidence operating on model logits

Result: Achieves competitive accuracy on five benchmark datasets (eNTERFACE05, MEAD, MELD, RAVDESS, CREMA-D) while remaining computationally efficient and robust to ambiguous/missing inputs

Conclusion: The framework emphasizes modularity, scalability, and real-world feasibility for uncertainty-aware multimodal systems in healthcare, human-computer interaction, and emotion-informed applications

Abstract: In this work, we present a lightweight and privacy-preserving Multimodal Emotion Recognition (MER) framework designed for deployment on edge devices. To demonstrate framework’s versatility, our implementation uses three modalities - speech, text and facial imagery. However, the system is fully modular, and can be extended to support other modalities or tasks. Each modality is processed through a dedicated backbone optimized for inference efficiency: Emotion2Vec for speech, a ResNet-based model for facial expressions, and DistilRoBERTa for text. To reconcile uncertainty across modalities, we introduce a model- and task-agnostic fusion mechanism grounded in Dempster-Shafer theory and Dirichlet evidence. Operating directly on model logits, this approach captures predictive uncertainty without requiring additional training or joint distribution estimation, making it broadly applicable beyond emotion recognition. Validation on five benchmark datasets (eNTERFACE05, MEAD, MELD, RAVDESS and CREMA-D) show that our method achieves competitive accuracy while remaining computationally efficient and robust to ambiguous or missing inputs. Overall, the proposed framework emphasizes modularity, scalability, and real-world feasibility, paving the way toward uncertainty-aware multimodal systems for healthcare, human-computer interaction, and other emotion-informed applications.

[272] PABU: Progress-Aware Belief Update for Efficient LLM Agents

Haitao Jiang, Lin Ge, Hengrui Cai, Rui Song

Main category: cs.AI

TL;DR: PABU: A progress-aware belief update framework for LLM agents that selectively retains relevant history to reduce redundant actions and improve efficiency.

Details

Motivation: Current LLM agents condition actions on full action-observation histories, which introduces task-irrelevant information leading to redundant actions and higher inference costs. There's a need for more efficient state representation that focuses on task-relevant information.

Method: Proposes Progress-Aware Belief Update (PABU), a belief-state framework that explicitly models task progress and selectively retains past actions and observations. At each step, the agent predicts relative progress and decides whether new interactions should be stored, conditioning future decisions only on the retained subset.

Result: Across eight environments in AgentGym benchmark, PABU achieves 81.0% task completion rate, outperforming previous SoTA models with full-history belief by 23.9%. Reduces average interaction steps to 9.5 (26.9% reduction). Ablation studies confirm both progress prediction and selective retention are necessary.

Conclusion: PABU demonstrates that explicit progress modeling and selective history retention significantly improve LLM agent efficiency and performance by focusing on task-relevant information and reducing redundant actions.

Abstract: Large Language Model (LLM) agents commonly condition actions on full action-observation histories, which introduce task-irrelevant information that easily leads to redundant actions and higher inference cost. We propose Progress-Aware Belief Update (PABU), a belief-state framework that compactly represents an agent’s state by explicitly modeling task progress and selectively retaining past actions and observations. At each step, the agent predicts its relative progress since the previous round and decides whether the newly encountered interaction should be stored, conditioning future decisions only on the retained subset. Across eight environments in the AgentGym benchmark, and using identical training trajectories, PABU achieves an 81.0% task completion rate, outperforming previous State of the art (SoTA) models with full-history belief by 23.9%. Additionally, PABU’s progress-oriented action selection improves efficiency, reducing the average number of interaction steps to 9.5, corresponding to a 26.9% reduction. Ablation studies show that both explicit progress prediction and selective retention are necessary for robust belief learning and performance gains.

[273] CoMMa: Contribution-Aware Medical Multi-Agents From A Game-Theoretic Perspective

Yichen Wu, Yujin Oh, Sangjoon Park, Kailong Fan, Dania Daye, Hana Farzaneh, Xiang Li, Raul Uppot, Quanzheng Li

Main category: cs.AI

TL;DR: CoMMa is a decentralized multi-agent LLM framework for oncology decision support that uses game-theoretic objectives and deterministic embedding projections for contribution-aware credit assignment, improving interpretability and stability over traditional approaches.

Details

Motivation: Oncology decision support requires reasoning over dynamic, heterogeneous patient data, and existing multi-agent frameworks often rely on stochastic narrative-based reasoning which lacks interpretability and stability. There's a need for more robust, mathematically grounded approaches that provide explicit evidence attribution.

Method: CoMMa uses a decentralized LLM-agent framework where specialists operate on partitioned evidence and coordinate through game-theoretic objectives. Instead of stochastic reasoning, it employs deterministic embedding projections to approximate contribution-aware credit assignment, estimating each agent’s marginal utility for explicit evidence attribution.

Result: CoMMa achieves higher accuracy and more stable performance than data-centralized and role-based multi-agent baselines on diverse oncology benchmarks, including a real-world multidisciplinary tumor board dataset.

Conclusion: The framework provides interpretable and mathematically grounded decision pathways with improved stability, demonstrating the value of contribution-aware credit assignment in medical multi-agent systems for oncology decision support.

Abstract: Recent multi-agent frameworks have broadened the ability to tackle oncology decision support tasks that require reasoning over dynamic, heterogeneous patient data. We propose Contribution-Aware Medical Multi-Agents (CoMMa), a decentralized LLM-agent framework in which specialists operate on partitioned evidence and coordinate through a game-theoretic objective for robust decision-making. In contrast to most agent architectures relying on stochastic narrative-based reasoning, CoMMa utilizes deterministic embedding projections to approximate contribution-aware credit assignment. This yields explicit evidence attribution by estimating each agent’s marginal utility, producing interpretable and mathematically grounded decision pathways with improved stability. Evaluated on diverse oncology benchmarks, including a real-world multidisciplinary tumor board dataset, CoMMa achieves higher accuracy and more stable performance than data-centralized and role-based multi-agents baselines.

[274] FlyAOC: Evaluating Agentic Ontology Curation of Drosophila Scientific Knowledge Bases

Xingjian Zhang, Sophia Moylan, Ziyang Xiong, Qiaozhu Mei, Yichen Luo, Jiaqi W. Ma

Main category: cs.AI

TL;DR: FlyBench evaluates AI agents on end-to-end scientific knowledge base curation from literature, requiring agents to search papers and produce structured gene annotations for Drosophila.

Details

Motivation: Existing benchmarks focus on isolated NLP subtasks but don't capture the complete workflow of scientific knowledge base curation, which requires searching literature, reconciling evidence, and producing ontology-grounded annotations.

Method: Created FlyBench with 7,397 expert-curated annotations across 100 genes from FlyBase, requiring agents to search 16,898 full-text papers given only a gene symbol and produce structured annotations including Gene Ontology terms, expression patterns, and historical synonyms.

Result: Multi-agent architectures outperform simpler alternatives, but scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Agents primarily use retrieval to confirm parametric knowledge rather than discover new information.

Conclusion: FlyBench provides a comprehensive benchmark for evaluating AI agents on scientific knowledge curation, revealing architectural insights and limitations to guide future development in retrieval-augmented scientific reasoning.

Abstract: Scientific knowledge bases accelerate discovery by curating findings from primary literature into structured, queryable formats for both human researchers and emerging AI systems. Maintaining these resources requires expert curators to search relevant papers, reconcile evidence across documents, and produce ontology-grounded annotations - a workflow that existing benchmarks, focused on isolated subtasks like named entity recognition or relation extraction, do not capture. We present FlyBench to evaluate AI agents on end-to-end agentic ontology curation from scientific literature. Given only a gene symbol, agents must search and read from a corpus of 16,898 full-text papers to produce structured annotations: Gene Ontology terms describing function, expression patterns, and historical synonyms linking decades of nomenclature. The benchmark includes 7,397 expert-curated annotations across 100 genes drawn from FlyBase, the Drosophila (fruit fly) knowledge base. We evaluate four baseline agent architectures: memorization, fixed pipeline, single-agent, and multi-agent. We find that architectural choices significantly impact performance, with multi-agent designs outperforming simpler alternatives, yet scaling backbone models yields diminishing returns. All baselines leave substantial room for improvement. Our analysis surfaces several findings to guide future development; for example, agents primarily use retrieval to confirm parametric knowledge rather than discover new information. We hope FlyBench will drive progress on retrieval-augmented scientific reasoning, a capability with broad applications across scientific domains.

[275] Human Control Is the Anchor, Not the Answer: Early Divergence of Oversight in Agentic AI Communities

Hanjing Shi, Dominic DiFranzo

Main category: cs.AI

TL;DR: Analysis of two Reddit communities shows early-stage divergence in AI oversight expectations based on socio-technical roles: deployment-focused community emphasizes action-risk control, while social-interaction community focuses on meaning-risk and accountability.

Details

Motivation: To understand how AI oversight expectations differ based on socio-technical roles before norms stabilize, examining early-stage crystallization of expectations in different user communities.

Method: Comparative analysis of two Reddit communities (r/OpenClaw for deployment/operations and r/Moltbook for agent-centered social interaction) using topic modeling, oversight-theme abstraction, engagement-weighted salience, and divergence tests (JSD, cosine similarity, permutation tests).

Result: Communities are strongly separable (JSD=0.418, cosine=0.372, p=0.0005). Both use “human control” as anchor term but with divergent operational meanings: r/OpenClaw emphasizes execution guardrails and recovery (action-risk), while r/Moltbook emphasizes identity, legitimacy, and accountability in public interaction (meaning-risk).

Conclusion: Oversight mechanisms should be role-specific rather than one-size-fits-all, with different control policies needed for deployment/operations vs. social interaction contexts.

Abstract: Oversight for agentic AI is often discussed as a single goal (“human control”), yet early adoption may produce role-specific expectations. We present a comparative analysis of two newly active Reddit communities in Jan–Feb 2026 that reflect different socio-technical roles: r/OpenClaw (deployment and operations) and r/Moltbook (agent-centered social interaction). We conceptualize this period as an early-stage crystallization phase, where oversight expectations form before norms reach equilibrium. Using topic modeling in a shared comparison space, a coarse-grained oversight-theme abstraction, engagement-weighted salience, and divergence tests, we show the communities are strongly separable (JSD =0.418, cosine =0.372, permutation $p=0.0005$). Across both communities, “human control” is an anchor term, but its operational meaning diverges: r/OpenClaw} emphasizes execution guardrails and recovery (action-risk), while r/Moltbook} emphasizes identity, legitimacy, and accountability in public interaction (meaning-risk). The resulting distinction offers a portable lens for designing and evaluating oversight mechanisms that match agent role, rather than applying one-size-fits-all control policies.

[276] Measuring Dataset Diversity from a Geometric Perspective

Yang Ba, Mohammad Sadeq Abolhasani, Michelle V Mancenido, Rong Pan

Main category: cs.AI

TL;DR: A topological data analysis framework using persistence landscapes to measure dataset diversity by capturing geometric structure beyond traditional entropy-based metrics.

Details

Motivation: Existing diversity metrics focus on statistical variation and entropy but largely ignore the geometric structure of datasets, which contains meaningful information about diversity that current methods miss.

Method: Proposes PLDiv, a diversity metric based on topological data analysis (TDA) and persistence landscapes (PLs) that extracts and quantifies geometric features from data to capture structural richness.

Result: Extensive experiments across diverse modalities show PLDiv is powerful, reliable, and interpretable, directly linking data diversity to underlying geometry and outperforming traditional entropy-based metrics.

Conclusion: The PLDiv framework provides a theoretically grounded tool for measuring geometric diversity, offering foundational value for dataset construction, augmentation, and evaluation across multiple domains.

Abstract: Diversity can be broadly defined as the presence of meaningful variation across elements, which can be viewed from multiple perspectives, including statistical variation and geometric structural richness in the dataset. Existing diversity metrics, such as feature-space dispersion and metric-space magnitude, primarily capture distributional variation or entropy, while largely neglecting the geometric structure of datasets. To address this gap, we introduce a framework based on topological data analysis (TDA) and persistence landscapes (PLs) to extract and quantify geometric features from data. This approach provides a theoretically grounded means of measuring diversity beyond entropy, capturing the rich geometric and structural properties of datasets. Through extensive experiments across diverse modalities, we demonstrate that our proposed PLs-based diversity metric (PLDiv) is powerful, reliable, and interpretable, directly linking data diversity to its underlying geometry and offering a foundational tool for dataset construction, augmentation, and evaluation.

[277] Auditing Multi-Agent LLM Reasoning Trees Outperforms Majority Vote and LLM-as-Judge

Wei Yang, Shixuan Li, Heng Ping, Peiyu Zhang, Paul Bogdan, Jesse Thomason

Main category: cs.AI

TL;DR: AgentAuditor replaces majority voting in multi-agent systems with reasoning tree analysis and conflict resolution at divergence points, plus ACPO training to avoid consensus errors.

Details

Motivation: Current multi-agent LLM systems use majority voting which discards reasoning structure and is vulnerable to confabulation consensus where agents share correlated biases and converge on incorrect rationales.

Method: AgentAuditor uses path search over a Reasoning Tree representing agreements/divergences in agent traces, resolving conflicts at critical divergence points. ACPO trains adjudicators on majority-failure cases, rewarding evidence-based minority selections over popular errors.

Result: Across 5 popular multi-agent settings, AgentAuditor yields up to 5% absolute accuracy improvement over majority vote and up to 3% over LLM-as-Judge approaches.

Conclusion: AgentAuditor provides a more robust alternative to majority voting by leveraging reasoning structure and addressing consensus failures, applicable across various multi-agent system settings.

Abstract: Multi-agent systems (MAS) can substantially extend the reasoning capacity of large language models (LLMs), yet most frameworks still aggregate agent outputs with majority voting. This heuristic discards the evidential structure of reasoning traces and is brittle under the confabulation consensus, where agents share correlated biases and converge on the same incorrect rationale. We introduce AgentAuditor, which replaces voting with a path search over a Reasoning Tree that explicitly represents agreements and divergences among agent traces. AgentAuditor resolves conflicts by comparing reasoning branches at critical divergence points, turning global adjudication into efficient, localized verification. We further propose Anti-Consensus Preference Optimization (ACPO), which trains the adjudicator on majority-failure cases and rewards evidence-based minority selections over popular errors. AgentAuditor is agnostic to MAS setting, and we find across 5 popular settings that it yields up to 5% absolute accuracy improvement over a majority vote, and up to 3% over using LLM-as-Judge.

[278] Not-in-Perspective: Towards Shielding Google’s Perspective API Against Adversarial Negation Attacks

Michail S. Alexiou, J. Sukarno Mertoguno

Main category: cs.AI

TL;DR: A formal reasoning wrapper approach to improve toxicity detection systems against negation-based adversarial attacks in social media content moderation.

Details

Motivation: The paper addresses the growing problem of cyberbullying and toxic comments on social media platforms, noting that existing machine/deep learning-based toxicity detection systems are vulnerable to adversarial attacks involving logical modifications like negation in phrases and sentences.

Method: Proposes formal reasoning-based methodologies that wrap around existing ML toxicity detection systems, acting as both pre-processing and post-processing steps to handle negation attacks. The hybrid approach combines formal reasoning with machine learning models.

Result: Experimental evaluation shows the formal reasoning wrapper significantly improves accuracy and efficacy of toxicity scoring against negation adversarial datasets, with hybrid methods outperforming purely statistical solutions.

Conclusion: Formal reasoning wrappers can effectively mitigate negation attack problems in toxicity detection systems, creating more robust hybrid approaches for online content moderation.

Abstract: The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.

[279] Image Quality in the Era of Artificial Intelligence

Jana G. Delfino, Jason L. Granstedt, Frank W. Samuelson, Robert Ochs, Krishna Juluru

Main category: cs.AI

TL;DR: AI in radiology improves image quality and speed but introduces new failure modes and disconnects between perceived quality and actual information content, requiring awareness of limitations for safe use.

Details

Motivation: While AI is rapidly being deployed in radiology for image reconstruction and enhancement, it introduces new failure modes and can create a disconnect between perceived image quality and actual information content, necessitating awareness of limitations for safe clinical use.

Method: This is a communication/position paper that discusses the limitations of AI-enabled image reconstruction and enhancement in radiology through conceptual analysis rather than presenting a specific technical method.

Result: The paper identifies key limitations of AI in radiological imaging, including new failure modes and the potential mismatch between perceived image quality and actual diagnostic information content.

Conclusion: Understanding the limitations of AI in image reconstruction and enhancement is critical for safe and effective clinical use, enabling users to benefit from the technology while minimizing associated risks.

Abstract: Artificial intelligence (AI) is being deployed within radiology at a rapid pace. AI has proven an excellent tool for reconstructing and enhancing images that appear sharper, smoother, and more detailed, can be acquired more quickly, and allowing clinicians to review them more rapidly. However, incorporation of AI also introduces new failure modes and can exacerbate the disconnect between perceived quality of an image and information content of that image. Understanding the limitations of AI-enabled image reconstruction and enhancement is critical for safe and effective use of the technology. Hence, the purpose of this communication is to bring awareness to limitations when AI is used to reconstruct or enhance a radiological image, with the goal of enabling users to reap benefits of the technology while minimizing risks.

[280] Axiomatic Choice

Ben Abramowitz, Nicholas Mattei

Main category: cs.AI

TL;DR: A framework called Axiomatic Choice that formalizes normative concerns in decision-making using axioms, with applications to ethical constraints, transparency, and deception analysis.

Details

Motivation: People care about both decision outcomes and decision-making processes, but formalizing the full range of normative concerns that drive decisions remains an open challenge. The paper aims to address this gap by providing a framework that captures diverse desiderata beyond typical social choice theory.

Method: Introduces Axiomatic Choice as a framework for making and evaluating decisions based on formal normative statements (axioms). The model includes properties, a taxonomy of axioms, and formal definitions for key concepts like transparency and deception in decision explanations.

Result: Defines the Decision-Evaluation Paradox, formalizes concepts of transparency and deception in decision justification, and reveals limitations of existing axiomatic decision-making methods. Provides a taxonomy of axioms that may be of general interest to the field.

Conclusion: Axiomatic Choice offers a comprehensive framework for incorporating diverse normative concerns into decision-making, addressing both ethical constraints and procedural aspects of how decisions are made and justified.

Abstract: People care about decision outcomes and how decisions get made, both when making decisions and reflecting on decisions. But formalizing the full range of normative concerns that drive decisions is an open challenge. We introduce Axiomatic Choice as a framework for making and evaluating decisions based on formal normative statements about decisions. These statements, or axioms, capture a wide array of desiderata, e.g., ethical constraints, beyond the typical treatment in Social Choice. Using our model of axioms and decisions we define key properties and introduce a taxonomy of axioms which may be of general interest. We then use these properties and our taxonomy to define the Decision-Evaluation Paradox, formalize the concepts of transparency and deception in explaining and justifying decisions, and reveal the limits of existing methods using axioms to make decisions.

[281] P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads

Yun Luo, Futing Wang, Qianjia Cheng, Fangchen Yu, Haodi Lei, Jianhao Yan, Chenxi Li, Jiacheng Chen, Yufeng Zhao, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Wenxuan Zeng, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui

Main category: cs.AI

TL;DR: P1-VL: A family of open-source vision-language models for advanced scientific reasoning, particularly physics, using curriculum reinforcement learning and agentic augmentation to bridge visual-logical gaps in multimodal understanding.

Details

Motivation: The paper addresses the challenge of transitioning LLMs from symbolic manipulation to science-grade reasoning, with physics as a critical test case. Physics requires maintaining physical consistency with universal laws, which fundamentally requires multimodal perception to ground abstract logic in reality. At Olympiad levels, diagrams contain essential constraints (boundary conditions, spatial symmetries) absent from text, creating a visual-logical gap that needs bridging.

Method: Introduces P1-VL family of open-source vision-language models using two key techniques: 1) Curriculum Reinforcement Learning with progressive difficulty expansion to stabilize post-training, and 2) Agentic Augmentation enabling iterative self-verification at inference time.

Result: On HiPhO benchmark (13 exams from 2024-2025), flagship P1-VL-235B-A22B becomes first open-source VLM to secure 12 gold medals, achieving state-of-the-art performance among open-source models. The agent-augmented system achieves No.2 overall rank globally, trailing only Gemini-3-Pro. Demonstrates remarkable scientific reasoning capacity and generalizability across STEM benchmarks.

Conclusion: P1-VL represents a foundational step toward general-purpose physical intelligence, better aligning visual perceptions with abstract physical laws for machine scientific discovery. By open-sourcing the models, the work advances multimodal reasoning capabilities in scientific domains.

Abstract: The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.

[282] Agentifying Agentic AI

Virginia Dignum, Frank Dignum

Main category: cs.AI

TL;DR: Agentic AI requires combining data-driven approaches with structured models of cognition, cooperation, and governance from AAMAS research to create transparent, accountable autonomous systems.

Details

Motivation: Current agentic AI systems lack explicit models for cognition, cooperation, and governance needed for sustained autonomy, reasoning, and interaction capabilities.

Method: Proposes leveraging AAMAS community tools including BDI architectures, communication protocols, mechanism design, and institutional modeling to complement data-driven approaches.

Result: Outlines a path toward agentic systems that are capable, flexible, transparent, cooperative, and accountable by bridging formal theory with practical autonomy.

Conclusion: Agentic AI needs structured models from AAMAS research to achieve true agency, combining adaptive data-driven methods with formal reasoning and coordination frameworks.

Abstract: Agentic AI seeks to endow systems with sustained autonomy, reasoning, and interaction capabilities. To realize this vision, its assumptions about agency must be complemented by explicit models of cognition, cooperation, and governance. This paper argues that the conceptual tools developed within the Autonomous Agents and Multi-Agent Systems (AAMAS) community, such as BDI architectures, communication protocols, mechanism design, and institutional modelling, provide precisely such a foundation. By aligning adaptive, data-driven approaches with structured models of reasoning and coordination, we outline a path toward agentic systems that are not only capable and flexible, but also transparent, cooperative, and accountable. The result is a perspective on agency that bridges formal theory and practical autonomy.

[283] SpotAgent: Grounding Visual Geo-localization in Large Vision-Language Models through Agentic Reasoning

Furong Jia, Ling Dai, Wenjin Deng, Fan Zhang, Chen Hu, Daxin Jiang, Yu Liu

Main category: cs.AI

TL;DR: SpotAgent is an agentic reasoning framework for geo-localization that combines visual interpretation with tool-assisted verification using web search and maps to address sparse, ambiguous visual cues in real-world scenarios.

Details

Motivation: Current LVLMs struggle with geo-localization in real-world scenarios where visual cues are sparse, long-tailed, and ambiguous, often producing confident but ungrounded predictions due to reliance on internal knowledge without verification.

Method: SpotAgent formalizes geo-localization as an agentic reasoning process using a ReAct diagram to actively explore and verify visual cues with external tools. It employs a 3-stage post-training pipeline: 1) SFT for basic alignment, 2) Agentic Cold Start using high-quality trajectories synthesized via Multi-Agent framework for tool-calling expertise, and 3) Reinforcement Learning refinement with Spatially-Aware Dynamic Filtering to prioritize learnable samples.

Result: Extensive experiments on standard benchmarks show SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.

Conclusion: SpotAgent successfully addresses the limitations of LVLMs in geo-localization by integrating agentic reasoning with external tool verification, providing a framework that produces accurate, verifiable results even with sparse and ambiguous visual evidence.

Abstract: Large Vision-Language Models (LVLMs) have demonstrated strong reasoning capabilities in geo-localization, yet they often struggle in real-world scenarios where visual cues are sparse, long-tailed, and highly ambiguous. Previous approaches, bound by internal knowledge, often fail to provide verifiable results, yielding confident but ungrounded predictions when faced with confounded evidence. To address these challenges, we propose SpotAgent, a framework that formalizes geo-localization into an agentic reasoning process that leverages expert-level reasoning to synergize visual interpretation with tool-assisted verification. SpotAgent actively explores and verifies visual cues by leveraging external tools (e.g., web search, maps) through a ReAct diagram. We introduce a 3-stage post-training pipeline starting with a Supervised Fine-Tuning (SFT) stage for basic alignment, followed by an Agentic Cold Start phase utilizing high-quality trajectories synthesized via a Multi-Agent framework, aiming to instill tool-calling expertise. Subsequently, the model’s reasoning capabilities are refined through Reinforcement Learning. We propose a Spatially-Aware Dynamic Filtering strategy to enhance the efficiency of the RL stage by prioritizing learnable samples based on spatial difficulty. Extensive experiments on standard benchmarks demonstrate that SpotAgent achieves state-of-the-art performance, effectively mitigating hallucinations while delivering precise and verifiable geo-localization.

[284] Bridging Efficiency and Transparency: Explainable CoT Compression in Multimodal Large Reasoning Models

Yizhi Wang, Linan Yue, Min-Ling Zhang

Main category: cs.AI

TL;DR: XMCC is an explainable multimodal chain-of-thought compressor that uses reinforcement learning to shorten reasoning trajectories while preserving key steps and answer correctness, with natural-language explanations for compression decisions.

Details

Motivation: Long chains of thought in multimodal reasoning models are often excessively lengthy and contain redundant steps, hindering inference efficiency. Existing compression approaches compromise visual-textual reasoning integrity and lack explainability about what information is critical.

Method: XMCC formulates compression as a sequential decision-making process optimized via reinforcement learning. It can shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions.

Result: Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC reduces reasoning length while providing explainable explanations, validating its effectiveness.

Conclusion: XMCC addresses the challenges of multimodal CoT compression by providing an explainable approach that maintains reasoning integrity while improving efficiency.

Abstract: Long chains of thought (Long CoTs) are widely employed in multimodal reasoning models to tackle complex tasks by capturing detailed visual information. However, these Long CoTs are often excessively lengthy and contain redundant reasoning steps, which can hinder inference efficiency. Compressing these long CoTs is a natural solution, yet existing approaches face two major challenges: (1) they may compromise the integrity of visual-textual reasoning by removing essential alignment cues, and (2) the compression process lacks explainability, making it difficult to discern which information is critical. To address these problems, we propose XMCC, an eXplainable Multimodal CoT Compressor that formulates compression as a sequential decision-making process optimized via reinforcement learning. XMCC can effectively shorten reasoning trajectories while preserving key reasoning steps and answer correctness, and simultaneously generates natural-language explanations for its compression decisions. Extensive experiments on representative multimodal reasoning benchmarks demonstrate that XMCC not only reduces reasoning length but also provides explainable explanations, validating its effectiveness.

[285] Computing Conditional Shapley Values Using Tabular Foundation Models

Lars Henry Berge Olsen, Dennis Christensen

Main category: cs.AI

TL;DR: TabPFN tabular foundation models enable efficient Shapley value computation for explainable AI by leveraging in-context learning to approximate conditional expectations without retraining.

Details

Motivation: Shapley values are computationally expensive to compute, especially with dependent features, requiring many conditional expectation approximations. Traditional methods using Monte Carlo integration or regression are slow, and deep learning approaches require retraining for each conditional expectation.

Method: Use TabPFN tabular foundation models that leverage in-context learning to approximate conditional expectations without retraining. Compute Shapley values with multiple TabPFN variants and compare with state-of-the-art methods on simulated and real datasets.

Result: TabPFN yields best performance in most cases; where it doesn’t, it’s only marginally worse than the best method but at a fraction of the runtime.

Conclusion: Tabular foundation models like TabPFN offer efficient Shapley value computation, with potential for further improvements specifically adapted for conditional Shapley value estimation.

Abstract: Shapley values have become a cornerstone of explainable AI, but they are computationally expensive to use, especially when features are dependent. Evaluating them requires approximating a large number of conditional expectations, either via Monte Carlo integration or regression. Until recently it has not been possible to fully exploit deep learning for the regression approach, because retraining for each conditional expectation takes too long. Tabular foundation models such as TabPFN overcome this computational hurdle by leveraging in-context learning, so each conditional expectation can be approximated without any re-training. In this paper, we compute Shapley values with multiple variants of TabPFN and compare their performance with state-of-the-art methods on both simulated and real datasets. In most cases, TabPFN yields the best performance; where it does not, it is only marginally worse than the best method, at a fraction of the runtime. We discuss further improvements and how tabular foundation models can be better adapted specifically for conditional Shapley value estimation.

[286] Autoregressive Direct Preference Optimization

Masanari Oi, Mahiro Ukai, Masahiro Kaneko, Naoaki Okazaki, Nakamasa Inoue

Main category: cs.AI

TL;DR: ADPO reformulates DPO by explicitly introducing autoregressive modeling before applying the Bradley-Terry model, shifting summation outside the log-sigmoid function and distinguishing token length from feedback length.

Details

Motivation: The authors identify a limitation in standard DPO where the autoregressive assumption is only considered after deriving the objective function, potentially limiting its effectiveness for aligning LLMs with human preferences.

Method: Revisit DPO’s theoretical foundations, explicitly introduce autoregressive modeling prior to applying the Bradley-Terry model, derive Autoregressive DPO (ADPO) with modified loss function that shifts summation outside log-sigmoid, and analyze two length measures: token length μ and feedback length μ’.

Result: ADPO provides a theoretically grounded reformulation of DPO that better incorporates autoregressive modeling, with a more elegant loss function formulation and clear distinction between token and feedback length measures.

Conclusion: ADPO offers an improved theoretical foundation for preference optimization in LLMs by properly integrating autoregressive assumptions and distinguishing between different length measures, potentially enhancing alignment with human preferences.

Abstract: Direct preference optimization (DPO) has emerged as a promising approach for aligning large language models (LLMs) with human preferences. However, the widespread reliance on the response-level Bradley-Terry (BT) model may limit its full potential, as the reference and learnable models are assumed to be autoregressive only after deriving the objective function. Motivated by this limitation, we revisit the theoretical foundations of DPO and propose a novel formulation that explicitly introduces the autoregressive assumption prior to applying the BT model. By reformulating and extending DPO, we derive a novel variant, termed Autoregressive DPO (ADPO), that explicitly integrates autoregressive modeling into the preference optimization framework. Without violating the theoretical foundations, the derived loss takes an elegant form: it shifts the summation operation in the DPO objective outside the log-sigmoid function. Furthermore, through theoretical analysis of ADPO, we show that there exist two length measures to be considered when designing DPO-based algorithms: the token length $μ$ and the feedback length $μ$’. To the best of our knowledge, we are the first to explicitly distinguish these two measures and analyze their implications for preference optimization in LLMs.

[287] Detecting radar targets swarms in range profiles with a partially complex-valued neural network

Martin Bauw

Main category: cs.AI

TL;DR: Paper proposes partially complex-valued neural networks for radar target detection in range profiles with multiple proximate targets and distorted echoes, comparing it to traditional pulse compression methods.

Details

Motivation: Radar target detection faces challenges from clutter, waveform distortion, and especially target proximity where multiple targets can be perceived as one or influence each other's detection thresholds. Traditional methods struggle with these issues, particularly with varying target proximity and distorted echoes.

Method: Uses partially complex-valued neural networks as adaptive range profile processing. The neural network is a generative architecture that processes the entire received signal at once to generate a complete detection profile, unlike pulse compression which processes one pulse length at a time. Simulated datasets are generated for experiments comparing the neural network approach with common pulse compression.

Result: The paper presents experimental comparisons between the proposed neural network approach and traditional pulse compression methods, though specific quantitative results are not provided in the abstract.

Conclusion: Partially complex-valued neural networks offer a promising alternative to traditional pulse compression for radar target detection, particularly for handling multiple proximate targets and distorted echoes by processing entire signals holistically rather than piecewise.

Abstract: Correctly detecting radar targets is usually challenged by clutter and waveform distortion. An additional difficulty stems from the relative proximity of several targets, the latter being perceived as a single target in the worst case, or influencing each other’s detection thresholds. The negative impact of targets proximity notably depends on the range resolution defined by the radar parameters and the adaptive threshold adopted. This paper addresses the matter of targets detection in radar range profiles containing multiple targets with varying proximity and distorted echoes. Inspired by recent contributions in the radar and signal processing literature, this work proposes partially complex-valued neural networks as an adaptive range profile processing. Simulated datasets are generated and experiments are conducted to compare a common pulse compression approach with a simple neural network partially defined by complex-valued parameters. Whereas the pulse compression processes one pulse length at a time, the neural network put forward is a generative architecture going through the entire received signal in one go to generate a complete detection profile.

[288] FLINGO – Instilling ASP Expressiveness into Linear Integer Constraints

Jorge Fandinno, Pedro Cabalar, Philipp Wanko, Torsten Schaub

Main category: cs.AI

TL;DR: FLINGO language extends CASP with ASP-like expressiveness for numerical constraints, allowing default values, undefined attributes, non-deterministic assignments, and aggregated values within constraint specifications.

Details

Motivation: Current CASP solvers represent numerical constraints in ways that lose important ASP features like default values, undefined attributes, non-deterministic assignments, and aggregated values. There's a need to bridge this expressiveness gap between ASP and CASP.

Method: Developed FLINGO language that incorporates ASP-like expressiveness inside numerical constraints, with a translation from FLINGO syntax to regular CASP programs following the CLINGCON input format.

Result: Created a tool that enables richer specification of numerical constraints in CASP while maintaining ASP features, demonstrated through several examples.

Conclusion: FLINGO successfully bridges the expressiveness gap between ASP and CASP, allowing users to leverage ASP features while working with numerical constraints in hybrid programming.

Abstract: Constraint Answer Set Programming (CASP) is a hybrid paradigm that enriches Answer Set Programming (ASP) with numerical constraint processing, something required in many real-world applications. The usual specification of constraints in most CASP solvers is closer to the numerical back-end expressiveness and semantics, rather than to standard specification in ASP. In the latter, numerical attributes are represented with predicates and this allows declaring default values, leaving the attribute undefined, making non-deterministic assignments with choice rules or using aggregated values. In CASP, most (if not all) of these features are lost once we switch to a constraint-based representation of those same attributes. In this paper, we present the FLINGO language (and tool) that incorporates the aforementioned expressiveness inside the numerical constraints and we illustrate its use with several examples. Based on previous work that established its semantic foundations, we also present a translation from the newly introduced FLINGO syntax to regular CASP programs following the CLINGCON input format.

[289] ClinAlign: Scaling Healthcare Alignment from Clinician Preference

Shiwei Lyu, Xidong Wang, Lei Liu, Hao Zhu, Chaohe Zhang, Jian Wang, Jinjie Gu, Benyou Wang, Yue Shen

Main category: cs.AI

TL;DR: A two-stage framework for aligning LLMs with clinician preferences using physician-verified rubrics distilled into reusable clinical principles, achieving state-of-the-art performance on medical benchmarks with efficient inference.

Details

Motivation: LLMs show medical knowledge but lack alignment with fine-grained clinician preferences; existing methods use coarse objectives or unreliable automated judges not grounded in professional guidelines.

Method: Two-stage framework: 1) HealthRubrics dataset of 7,034 physician-verified preference examples where clinicians refine LLM-drafted rubrics; 2) Distill rubrics into HealthPrinciples - 119 reusable, clinically grounded principles organized by clinical dimensions for scalable supervision.

Result: A 30B parameter model activating only 3B parameters at inference achieves 33.4% on HealthBench-Hard, outperforming larger models like Deepseek-R1 and o3, establishing resource-efficient clinical alignment baseline.

Conclusion: The framework enables scalable supervision beyond manual annotation through reusable clinical principles, achieving strong performance with efficient inference for clinical LLM alignment.

Abstract: Although large language models (LLMs) demonstrate expert-level medical knowledge, aligning their open-ended outputs with fine-grained clinician preferences remains challenging. Existing methods often rely on coarse objectives or unreliable automated judges that are weakly grounded in professional guidelines. We propose a two-stage framework to address this gap. First, we introduce HealthRubrics, a dataset of 7,034 physician-verified preference examples in which clinicians refine LLM-drafted rubrics to meet rigorous medical standards. Second, we distill these rubrics into HealthPrinciples: 119 broadly reusable, clinically grounded principles organized by clinical dimensions, enabling scalable supervision beyond manual annotation. We use HealthPrinciples for (1) offline alignment by synthesizing rubrics for unlabeled queries and (2) an inference-time tool for guided self-revision. A 30B parameter model that activates only 3B parameters at inference trained with our framework achieves 33.4% on HealthBench-Hard, outperforming much larger models including Deepseek-R1 and o3, establishing a resource-efficient baseline for clinical alignment.

[290] GHS-TDA: A Synergistic Reasoning Framework Integrating Global Hypothesis Space with Topological Data Analysis

Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Xudong Wang, Zhenzhen Huang, Pengcheng Zheng, Shuai Yuan, Sheng Zheng, Qigan Sun, Jie Zou, Lik-Hang Lee, Yang Yang

Main category: cs.AI

TL;DR: GHS-TDA improves Chain-of-Thought reasoning by constructing a global hypothesis graph for error correction and using topological data analysis to extract stable reasoning structures, enhancing accuracy and robustness.

Details

Motivation: Current CoT methods have two key limitations: 1) reasoning is sensitive to early errors that propagate without correction mechanisms, and 2) lack of structured analysis techniques leads to redundant reasoning and poor interpretability.

Method: GHS-TDA constructs a semantically enriched global hypothesis graph to aggregate and coordinate multiple reasoning paths, then applies topological data analysis using persistent homology to capture stable multi-scale structures and remove redundancy.

Result: The method achieves self-adaptive convergence, produces high-confidence interpretable reasoning paths, and consistently outperforms strong baselines in accuracy and robustness across multiple reasoning benchmarks.

Conclusion: By jointly leveraging reasoning diversity and topological stability, GHS-TDA addresses fundamental limitations of current CoT approaches and provides more reliable, interpretable reasoning.

Abstract: Chain-of-Thought (CoT) has been shown to significantly improve the reasoning accuracy of large language models (LLMs) on complex tasks. However, due to the autoregressive, step-by-step generation paradigm, existing CoT methods suffer from two fundamental limitations. First, the reasoning process is highly sensitive to early decisions: once an initial error is introduced, it tends to propagate and amplify through subsequent steps, while the lack of a global coordination and revision mechanism makes such errors difficult to correct, ultimately leading to distorted reasoning chains. Second, current CoT approaches lack structured analysis techniques for filtering redundant reasoning and extracting key reasoning features, resulting in unstable reasoning processes and limited interpretability. To address these issues, we propose GHS-TDA. GHS-TDA first constructs a semantically enriched global hypothesis graph to aggregate, align, and coordinate multiple candidate reasoning paths, thereby providing alternative global correction routes when local reasoning fails. It then applies topological data analysis based on persistent homology to capture stable multi-scale structures, remove redundancy and inconsistencies, and extract a more reliable reasoning skeleton. By jointly leveraging reasoning diversity and topological stability, GHS-TDA achieves self-adaptive convergence, produces high-confidence and interpretable reasoning paths, and consistently outperforms strong baselines in terms of both accuracy and robustness across multiple reasoning benchmarks.

[291] Symbolic Pattern Temporal Numeric Planning with Intermediate Conditions and Effects

Matteo Cardellini, Enrico Giunchiglia

Main category: cs.AI

TL;DR: Extends Symbolic Pattern Planning (SPP) to temporal planning with Intermediate Conditions and Effects (ICEs), where actions are durative and can overlap, with conditions/effects checked/applied during execution.

Details

Motivation: To extend the SPP approach from numeric planning to temporal planning with ICEs, addressing the challenge of durative actions that can overlap and have intermediate conditions/effects during execution.

Method: Extends SPP to temporal planning with ICEs by encoding patterns (finite action sequences) in SMT formulas, handling durative actions with overlapping execution and intermediate conditions/effects that can be checked/applied at any time during action execution.

Result: Patty planner outperforms other planners in most temporal domains without ICEs, achieves comparable results with state-of-the-art search planners in domains with ICEs, and outperforms them in a novel real-world application domain.

Conclusion: The SPP approach can be successfully extended to temporal planning with ICEs, demonstrating strong performance across various temporal planning domains including those with intermediate conditions and effects.

Abstract: Recently, a Symbolic Pattern Planning (SPP) approach was proposed for numeric planning where a pattern (i.e., a finite sequence of actions) suggests a causal order between actions. The pattern is then encoded in a SMT formula whose models correspond to valid plans. If the suggestion by the pattern is inaccurate and no valid plan can be found, the pattern is extended until it contains the causal order of actions in a valid plan, making the approach complete. In this paper, we extend the SPP approach to the temporal planning with Intermediate Conditions and Effects (ICEs) fragment, where $(i)$ actions are durative (and thus can overlap over time) and have conditions/effects which can be checked/applied at any time during an action’s execution, and $(ii)$ one can specify plan’s conditions/effects that must be checked/applied at specific times during the plan execution. Experimental results show that our SPP planner Patty $(i)$ outperforms all other planners in the literature in the majority of temporal domains without ICEs, $(ii)$ obtains comparable results with the SoTA search planner for ICS in literature domains with ICEs, and $(iii)$ outperforms the same planner in a novel domain based on a real-world application.

[292] Would a Large Language Model Pay Extra for a View? Inferring Willingness to Pay from Subjective Choices

Manon Reusens, Sofie Goethals, Toon Calders, David Martens

Main category: cs.AI

TL;DR: LLMs can estimate willingness-to-pay for travel choices but show systematic deviations from human benchmarks, particularly overestimating values with expensive options or business personas.

Details

Motivation: As LLMs are increasingly deployed for subjective decision support (like travel assistance), there's a need to understand how they make choices when no objectively correct answer exists, and whether their decision-making aligns with human preferences.

Method: Presented LLMs with travel choice dilemmas, analyzed responses using multinomial logit models to derive implied willingness-to-pay (WTP) estimates, compared to human benchmarks from economics literature. Tested variations including information about users’ past choices and persona-based prompting.

Result: Larger LLMs can produce meaningful WTP values but show systematic attribute-level deviations. They tend to overestimate human WTP overall, especially with expensive options or business-oriented personas. Conditioning on prior preferences for cheaper options yields valuations closer to human benchmarks.

Conclusion: LLMs show both potential and limitations for subjective decision support. Careful model selection, prompt design, and user representation are crucial for practical deployment in applications like travel assistance.

Abstract: As Large Language Models (LLMs) are increasingly deployed in applications such as travel assistance and purchasing support, they are often required to make subjective choices on behalf of users in settings where no objectively correct answer exists. We study LLM decision-making in a travel-assistant context by presenting models with choice dilemmas and analyzing their responses using multinomial logit models to derive implied willingness to pay (WTP) estimates. These WTP values are subsequently compared to human benchmark values from the economics literature. In addition to a baseline setting, we examine how model behavior changes under more realistic conditions, including the provision of information about users’ past choices and persona-based prompting. Our results show that while meaningful WTP values can be derived for larger LLMs, they also display systematic deviations at the attribute level. Additionally, they tend to overestimate human WTP overall, particularly when expensive options or business-oriented personas are introduced. Conditioning models on prior preferences for cheaper options yields valuations that are closer to human benchmarks. Overall, our findings highlight both the potential and the limitations of using LLMs for subjective decision support and underscore the importance of careful model selection, prompt design, and user representation when deploying such systems in practice.

[293] Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning

Dexun Li, Sidney Tio, Pradeep Varakantham

Main category: cs.AI

TL;DR: Hierarchical MDP framework for environment design with teacher agent using student policy representations and generative model augmentation to reduce teacher-student interactions in resource-constrained scenarios.

Details

Motivation: Unsupervised Environment Design (UED) methods assume infinite environment generation, which is impractical in resource-constrained scenarios with limited teacher-student interaction opportunities.

Method: Hierarchical MDP framework with teacher agent leveraging student policy representations from discovered evaluation environments. Incorporates generative model to augment teacher’s training dataset with synthetic data, reducing need for teacher-student interactions.

Result: Method outperforms baseline approaches while requiring fewer teacher-student interactions in a single episode across several domains.

Conclusion: The approach is applicable in settings where training opportunities are limited, offering efficient curriculum generation for agent development.

Abstract: Unsupervised Environment Design (UED) has emerged as a promising approach to developing general-purpose agents through automated curriculum generation. Popular UED methods focus on Open-Endedness, where teacher algorithms rely on stochastic processes for infinite generation of useful environments. This assumption becomes impractical in resource-constrained scenarios where teacher-student interaction opportunities are limited. To address this challenge, we introduce a hierarchical Markov Decision Process (MDP) framework for environment design. Our framework features a teacher agent that leverages student policy representations derived from discovered evaluation environments, enabling it to generate training environments based on the student’s capabilities. To improve efficiency, we incorporate a generative model that augments the teacher’s training dataset with synthetic data, reducing the need for teacher-student interactions. In experiments across several domains, we show that our method outperforms baseline approaches while requiring fewer teacher-student interactions in a single episode. The results suggest the applicability of our approach in settings where training opportunities are limited.

[294] Why Do AI Agents Systematically Fail at Cloud Root Cause Analysis?

Taeyoon Kim, Woohyeok Park, Hoyeong Yun, Kyungyong Lee

Main category: cs.AI

TL;DR: Analysis of LLM-based RCA agents reveals 12 failure types across reasoning, communication, and environment interaction, showing persistent pitfalls that require architectural improvements beyond prompt engineering.

Details

Motivation: Existing LLM-based RCA agents for cloud systems have low detection accuracy, and current evaluations only assess final answers without understanding why reasoning fails. There's a need to systematically analyze process-level failures to improve agent reliability.

Method: Executed full OpenRCA benchmark across five LLM models, producing 1,675 agent runs. Classified observed failures into 12 pitfall types across three categories: intra-agent reasoning, inter-agent communication, and agent-environment interaction. Conducted controlled mitigation experiments.

Result: Most prevalent pitfalls (hallucinated data interpretation and incomplete exploration) persist across all models regardless of capability tier, indicating failures originate from shared agent architecture rather than model limitations. Prompt engineering alone cannot resolve dominant pitfalls, but enriching inter-agent communication protocol reduces communication-related failures by up to 15 percentage points.

Conclusion: The pitfall taxonomy and diagnostic methodology provide foundation for designing more reliable autonomous agents for cloud RCA. Architectural improvements are needed beyond just better models or prompts.

Abstract: Failures in large-scale cloud systems incur substantial financial losses, making automated Root Cause Analysis (RCA) essential for operational stability. Recent efforts leverage Large Language Model (LLM) agents to automate this task, yet existing systems exhibit low detection accuracy even with capable models, and current evaluation frameworks assess only final answer correctness without revealing why the agent’s reasoning failed. This paper presents a process level failure analysis of LLM-based RCA agents. We execute the full OpenRCA benchmark across five LLM models, producing 1,675 agent runs, and classify observed failures into 12 pitfall types across intra-agent reasoning, inter-agent communication, and agent-environment interaction. Our analysis reveals that the most prevalent pitfalls, notably hallucinated data interpretation and incomplete exploration, persist across all models regardless of capability tier, indicating that these failures originate from the shared agent architecture rather than from individual model limitations. Controlled mitigation experiments further show that prompt engineering alone cannot resolve the dominant pitfalls, whereas enriching the inter-agent communication protocol reduces communication-related failures by up to 15 percentage points. The pitfall taxonomy and diagnostic methodology developed in this work provide a foundation for designing more reliable autonomous agents for cloud RCA.

[295] Closing Reasoning Gaps in Clinical Agents with Differential Reasoning Learning

Jinsong Liu, Yuhang Jiang, Ramayya Krishnan, Rema Padman, Yiye Zhang, Jiang Bian

Main category: cs.AI

TL;DR: DRL framework improves clinical AI agents by analyzing reasoning discrepancies between reference rationales and agent’s chain-of-thought, using graph-based discrepancy analysis and retrieval-augmented instruction to patch logic gaps.

Details

Motivation: Clinical decision support requires not just correct answers but clinically valid reasoning. Current AI agents may produce correct answers with flawed reasoning, which is unacceptable in healthcare where reasoning transparency and validity are crucial for trust and safety.

Method: Differential Reasoning Learning (DRL) extracts reasoning graphs as DAGs from reference rationales and agent’s chain-of-thought, performs clinically weighted graph edit distance analysis, uses LLM-as-judge to align nodes and diagnose discrepancies, stores diagnostics in Differential Reasoning Knowledge Base, and retrieves top-k instructions via RAG to augment prompts at inference.

Result: Evaluation on medical QA benchmarks and Return Visit Admissions prediction shows gains over baselines in both answer accuracy and reasoning fidelity. Ablation studies confirm benefits from reference rationales and top-k retrieval strategy. Clinician review provides assurance of approach validity.

Conclusion: DRL supports more reliable clinical decision-making in complex reasoning scenarios and offers practical deployment mechanism under limited token budgets by improving reasoning fidelity alongside answer accuracy.

Abstract: Clinical decision support requires not only correct answers but also clinically valid reasoning. We propose Differential Reasoning Learning (DRL), a framework that improves clinical agents by learning from reasoning discrepancies. From reference reasoning rationales (e.g., physician-authored clinical rationale, clinical guidelines, or outputs from more capable models) and the agent’s free-form chain-of-thought (CoT), DRL extracts reasoning graphs as directed acyclic graphs (DAGs) and performs a clinically weighted graph edit distance (GED)-based discrepancy analysis. An LLM-as-a-judge aligns semantically equivalent nodes and diagnoses discrepancies between graphs. These graph-level discrepancy diagnostics are converted into natural-language instructions and stored in a Differential Reasoning Knowledge Base (DR-KB). At inference, we retrieve top-$k$ instructions via Retrieval-Augmented Generation (RAG) to augment the agent prompt and patch likely logic gaps. Evaluation on open medical question answering (QA) benchmarks and a Return Visit Admissions (RVA) prediction task from internal clinical data demonstrates gains over baselines, improving both final-answer accuracy and reasoning fidelity. Ablation studies confirm gains from infusing reference reasoning rationales and the top-$k$ retrieval strategy. Clinicians’ review of the output provides further assurance of the approach. Together, results suggest that DRL supports more reliable clinical decision-making in complex reasoning scenarios and offers a practical mechanism for deployment under limited token budgets.

[296] ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference

Junda Wang, Zhichao Yang, Dongxu Zhang, Sanjit Singh Batra, Robert E. Tillman

Main category: cs.AI

TL;DR: ESTAR introduces early stopping for large reasoning models to reduce redundant computation by detecting when correct answers are reached, using trajectory classification, supervised fine-tuning for self-generated stop signals, and stop-aware reinforcement learning.

Details

Motivation: Large reasoning models waste computation on redundant reasoning after reaching correct answers, creating inefficiency in chain-of-thought reasoning processes.

Method: Combines trajectory-based classifier to identify safe stopping points, supervised fine-tuning to teach models to generate signals, and -aware reinforcement learning with compute-aware rewards that truncates rollouts at self-generated stop points.

Result: Reduces reasoning length by ~3.7x (from 4,799 to 1,290 tokens) while preserving accuracy (74.9% vs. 74.2%) across four reasoning datasets, with strong cross-domain generalization.

Conclusion: Early stopping is a simple yet powerful mechanism for improving reasoning efficiency in large reasoning models without sacrificing accuracy.

Abstract: Large reasoning models (LRMs) achieve state-of-the-art performance by generating long chains-of-thought, but often waste computation on redundant reasoning after the correct answer has already been reached. We introduce Early-Stopping for Token-Aware Reasoning (ESTAR), which detects and reduces such reasoning redundancy to improve efficiency without sacrificing accuracy. Our method combines (i) a trajectory-based classifier that identifies when reasoning can be safely stopped, (ii) supervised fine-tuning to teach LRMs to propose self-generated signals, and (iii) -aware reinforcement learning that truncates rollouts at self-generated stop points with compute-aware rewards. Experiments on four reasoning datasets show that ESTAR reduces reasoning length by about 3.7x (from 4,799 to 1,290) while preserving accuracy (74.9% vs. 74.2%), with strong cross-domain generalization. These results highlight early stopping as a simple yet powerful mechanism for improving reasoning efficiency in LRMs.

[297] Discovering High Level Patterns from Simulation Traces

Sean Memery, Kartic Subr

Main category: cs.AI

TL;DR: A method to extract coarse-grained physics patterns from simulation logs using natural language guidance, enabling better physical reasoning for language models.

Details

Motivation: Language models struggle with physics tasks because they learn from observational data rather than being grounded in simulation. Existing approaches using simulation traces as context suffer from poor scalability due to large volumes of fine-grained data.

Method: Propose a natural language guided method to discover coarse-grained patterns (e.g., ‘rigid-body collision’, ‘stable support’) from detailed simulation logs. Synthesize programs that operate on simulation logs and map them to high-level activated patterns.

Result: The annotated representation of simulation logs is more amenable to natural language reasoning about physical systems. Enables LMs to generate effective reward programs from natural language goals for use in planning or supervised learning.

Conclusion: The method bridges the gap between detailed simulation data and high-level natural language reasoning, improving LM capabilities for physics-based tasks through grounded pattern extraction.

Abstract: Artificial intelligence (AI) agents embedded in environments with physics-based interaction face many challenges including reasoning, planning, summarization, and question answering. This problem is exacerbated when a human user wishes to either guide or interact with the agent in natural language. Although the use of Language Models (LMs) is the default choice, as an AI tool, they struggle with tasks involving physics. The LM’s capability for physical reasoning is learned from observational data, rather than being grounded in simulation. A common approach is to include simulation traces as context, but this suffers from poor scalability as simulation traces contain larger volumes of fine-grained numerical and semantic data. In this paper, we propose a natural language guided method to discover coarse-grained patterns (e.g., ‘rigid-body collision’, ‘stable support’, etc.) from detailed simulation logs. Specifically, we synthesize programs that operate on simulation logs and map them to a series of high level activated patterns. We show, through two physics benchmarks, that this annotated representation of the simulation log is more amenable to natural language reasoning about physical systems. We demonstrate how this method enables LMs to generate effective reward programs from goals specified in natural language, which may be used within the context of planning or supervised learning.

[298] Chain of Mindset: Reasoning with Adaptive Cognitive Modes

Tianyi Jiang, Arctanx An, Hengyi Feng, Naixin Zhai, Haodong Li, Xiaomin Yu, Jiahui Liu, Hanwen Du, Shuo Zhang, Zhi Yang, Jie Huang, Yuhua Li, Yongxin Ni, Huacan Wang, Ronghao Chen

Main category: cs.AI

TL;DR: CoM is a training-free agentic framework that enables step-level adaptive mindset orchestration for LLM reasoning, using four specialized mindsets dynamically selected by a Meta-Agent.

Details

Motivation: Current LLM reasoning methods apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets, preventing models from reaching higher intelligence levels.

Method: Chain of Mindset (CoM) decomposes reasoning into four functionally heterogeneous mindsets (Spatial, Convergent, Divergent, Algorithmic) with a Meta-Agent dynamically selecting optimal mindset based on evolving reasoning state, and a bidirectional Context Gate filtering cross-module information flow.

Result: CoM achieves state-of-the-art performance across six challenging benchmarks (mathematics, code generation, scientific QA, spatial reasoning), outperforming strongest baselines by 4.96% and 4.72% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash while balancing reasoning efficiency.

Conclusion: The CoM framework enables step-level adaptive mindset orchestration, addressing the limitation of fixed-mindset reasoning in LLMs and demonstrating superior performance across diverse reasoning tasks.

Abstract: Human problem-solving is never the repetition of a single mindset, by which we mean a distinct mode of cognitive processing. When tackling a specific task, we do not rely on a single mindset; instead, we integrate multiple mindsets within the single solution process. However, existing LLM reasoning methods fall into a common trap: they apply the same fixed mindset across all steps, overlooking that different stages of solving the same problem require fundamentally different mindsets. This single-minded assumption prevents models from reaching the next level of intelligence. To address this limitation, we propose Chain of Mindset (CoM), a training-free agentic framework that enables step-level adaptive mindset orchestration. CoM decomposes reasoning into four functionally heterogeneous mindsets: Spatial, Convergent, Divergent, and Algorithmic. A Meta-Agent dynamically selects the optimal mindset based on the evolving reasoning state, while a bidirectional Context Gate filters cross-module information flow to maintain effectiveness and efficiency. Experiments across six challenging benchmarks spanning mathematics, code generation, scientific QA, and spatial reasoning demonstrate that CoM achieves state-of-the-art performance, outperforming the strongest baseline by 4.96% and 4.72% in overall accuracy on Qwen3-VL-32B-Instruct and Gemini-2.0-Flash, while balancing reasoning efficiency. Our code is publicly available at \href{https://github.com/QuantaAlpha/chain-of-mindset}{https://github.com/QuantaAlpha/chain-of-mindset}.

[299] CODE-SHARP: Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs

Richard Bornemann, Pierluigi Vito Amadori, Antoine Cully

Main category: cs.AI

TL;DR: CODE-SHARP uses foundation models to open-endedly discover hierarchical skills as executable reward programs, enabling agents to solve complex long-horizon tasks without predefined rewards.

Details

Motivation: Current reinforcement learning relies on hand-designed reward functions, which is infeasible for open-ended skill discovery where meaningful skills are not known beforehand. Existing methods only refine rewards for pre-defined tasks, limiting autonomous skill discovery.

Method: Leverages foundation models to open-endedly expand and refine a hierarchical skill archive structured as a directed graph of executable reward functions in code. Uses goal-conditioned agents trained on discovered reward programs and high-level FM-based planning.

Result: The approach enables agents to solve increasingly long-horizon goals in Craftax environment. When composed by high-level planning, it outperforms pretrained agents and task-specific expert policies by over 134% on average for complex, long-horizon tasks.

Conclusion: CODE-SHARP demonstrates that foundation models can enable open-ended skill discovery through hierarchical reward programs, advancing autonomous agent learning beyond predefined tasks.

Abstract: Developing agents capable of open-endedly discovering and learning novel skills is a grand challenge in Artificial Intelligence. While reinforcement learning offers a powerful framework for training agents to master complex skills, it typically relies on hand-designed reward functions. This is infeasible for open-ended skill discovery, where the set of meaningful skills is not known a priori. While recent methods have shown promising results towards automating reward function design, they remain limited to refining rewards for pre-defined tasks. To address this limitation, we introduce Continuous Open-ended Discovery and Evolution of Skills as Hierarchical Reward Programs (CODE-SHARP), a novel framework leveraging Foundation Models (FM) to open-endedly expand and refine a hierarchical skill archive, structured as a directed graph of executable reward functions in code. We show that a goal-conditioned agent trained exclusively on the rewards generated by the discovered SHARP skills learns to solve increasingly long-horizon goals in the Craftax environment. When composed by a high-level FM-based planner, the discovered skills enable a single goal-conditioned agent to solve complex, long-horizon tasks, outperforming both pretrained agents and task-specific expert policies by over $134$% on average. We will open-source our code and provide additional videos $\href{https://sites.google.com/view/code-sharp/homepage}{here}$.

[300] Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han, Zhewei Yao, Huaxiu Yao, Yuxiong He

Main category: cs.AI

TL;DR: AWM is a synthetic environment generation pipeline that creates 1,000 code-driven environments for training tool-use agents, enabling scalable RL training with reliable state transitions and reward functions.

Details

Motivation: Current autonomous agent training is limited by lack of diverse, reliable environments. Real-world data collection is expensive, and LLM-simulated environments are inconsistent. Need scalable synthetic environments for multi-turn tool-use agent training.

Method: Proposes Agent World Model (AWM) pipeline that generates fully synthetic, code-driven environments backed by databases. Creates 1,000 environments covering everyday scenarios with rich toolsets (avg 35 tools per environment). Environments provide reliable state transitions and enable efficient agent interaction.

Result: Enables large-scale reinforcement learning for multi-turn tool-use agents with reliable reward functions. Training exclusively in synthetic environments yields strong out-of-distribution generalization on three benchmarks.

Conclusion: AWM provides scalable synthetic environments for training autonomous agents, overcoming limitations of real-world data collection and LLM-simulated environments. Enables effective RL training with strong generalization capabilities.

Abstract: Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

[301] Among Us: A Sandbox for Measuring and Detecting Agentic Deception

Satvik Golechha, Adrià Garriga-Alonso

Main category: cs.AI

TL;DR: LLM agents exhibit long-term deceptive behavior in Among Us game sandbox; RL-trained models better at deception than detection; probes on activations achieve >95% AUROC for deception detection

Details

Motivation: Prior studies on AI deception focus on binary choices or false statements rather than open-ended deceptive behavior emerging from long-term goals. Need for better evaluation of deceptive capabilities in language-based agents.

Method: Created Among Us sandbox social deception game where LLM agents pursue long-term goals requiring deception. Evaluated 18 proprietary and open-weight LLMs. Used logistic regression on activations and sparse autoencoders (SAEs) for deception detection.

Result: Models trained with RL are much better at producing deception than detecting it. Probes trained on “pretend you’re dishonest” data generalize extremely well (AUROCs >95%) even on deceptive statements alone. Found SAE features that detect deception but can’t reduce lying.

Conclusion: Among Us sandbox enables study of long-term deceptive behavior in LLM agents. RL-trained models show strong deceptive capabilities. Effective deception detection methods exist but don’t reduce lying behavior. Open-sourced resources aim to anticipate/mitigate deceptive AI.

Abstract: Prior studies on deception in language-based AI agents typically assess whether the agent produces a false statement about a topic, or makes a binary choice prompted by a goal, rather than allowing open-ended deceptive behavior to emerge in pursuit of a longer-term goal. To fix this, we introduce Among Us, a sandbox social deception game where LLM-agents exhibit long-term, open-ended deception as a consequence of the game objectives. While most benchmarks saturate quickly, Among Us can be expected to last much longer, because it is a multi-player game far from equilibrium. Using the sandbox, we evaluate 18 proprietary and open-weight LLMs and uncover a general trend: models trained with RL are comparatively much better at producing deception than detecting it. We evaluate the effectiveness of methods to detect lying and deception: logistic regression on the activations and sparse autoencoders (SAEs). We find that probes trained on a dataset of “pretend you’re a dishonest model:..” generalize extremely well out-of-distribution, consistently obtaining AUROCs over 95% even when evaluated just on the deceptive statement, without the chain of thought. We also find two SAE features that work well at deception detection but are unable to steer the model to lie less. We hope our open-sourced sandbox, game logs, and probes serve to anticipate and mitigate deceptive behavior and capabilities in language-based agents.

[302] GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs

Shixian Luo, Zezhou Zhu, Yu Yuan, Yuncheng Yang, Lianlei Shan, Yong Wu

Main category: cs.AI

TL;DR: GeoGramBench is a benchmark for evaluating LLMs’ ability to translate programmatic drawing code into geometric reasoning, revealing significant deficiencies in current models’ spatial reasoning capabilities.

Details

Motivation: The paper addresses the underexplored ability of LLMs to operate over geometric spatial information expressed in procedural code, highlighting a gap in program-driven spatial reasoning capabilities.

Method: Formalizes the Program-to-Geometry task and introduces GeoGramBench, a benchmark of 500 problems organized by a three-level taxonomy based on geometric complexity rather than mathematical reasoning complexity.

Result: Evaluation of 17 frontier LLMs reveals consistent deficiencies, with even the most advanced models achieving less than 50% accuracy at the highest abstraction level.

Conclusion: The work highlights unique challenges in program-driven spatial reasoning and establishes GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning.

Abstract: Geometric spatial reasoning forms the foundation of many applications in artificial intelligence, yet the ability of large language models (LLMs) to operate over geometric spatial information expressed in procedural code remains underexplored. In this paper, we address this gap by formalizing the Program-to-Geometry task, which challenges models to translate programmatic drawing code into accurate and abstract geometric reasoning. To evaluate this capability, we present GeoGramBench, a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity. Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies: even the most advanced models achieve less than 50% accuracy at the highest abstraction level. These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning. Project page: https://github.com/LiAuto-DSR/GeoGramBench.

[303] Optimus-3: Dual-Router Aligned Mixture-of-Experts Agent with Dual-Granularity Reasoning-Aware Policy Optimization

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Yaowei Wang, Liqiang Nie

Main category: cs.AI

TL;DR: Optimus-3 is a generalist agent for Minecraft that integrates System 1 (reflexive execution) and System 2 (deliberative reasoning) capabilities through a unified framework with automated data generation, dual-router MoE architecture, and reasoning-aware policy optimization.

Details

Motivation: Existing agents in visually rich environments like Minecraft suffer from fragmented cognitive abilities, lacking synergy between fast reflexive execution (System 1) and slow deliberative reasoning (System 2). The paper aims to develop a generalist agent that organically integrates these dual capabilities.

Method: Three key innovations: 1) Knowledge-Enhanced Automated Data Generation Pipeline to create System 2 reasoning traces from System 1 trajectories, 2) Dual-Router Aligned MoE Architecture with Task Router for parameter decoupling and Layer Router for dynamic reasoning depth, and 3) Dual-Granularity Reasoning-Aware Policy Optimization algorithm with process-outcome co-supervision.

Result: Optimus-3 surpasses SOTA methods on both System 2 tasks (21% on Planning, 66% on Captioning, 76% on Embodied QA, 3.4× on Grounding, 18% on Reflection) and System 1 tasks (3% on Long-Horizon Action), achieving 60% success rate on open-ended tasks.

Conclusion: The paper presents a successful integration of System 1 and System 2 capabilities in a generalist agent for visually rich environments, demonstrating significant improvements across various reasoning and execution tasks in Minecraft.

Abstract: Developing generalist agents capable of solving open-ended tasks in visually rich, dynamic environments remains a core pursuit of embodied AI. While Minecraft has emerged as a compelling benchmark, existing agents often suffer from fragmented cognitive abilities, lacking the synergy between reflexive execution (System 1) and deliberative reasoning (System 2). In this paper, we introduce Optimus-3, a generalist agent that organically integrates these dual capabilities within a unified framework. To achieve this, we address three fundamental challenges. First, to overcome the scarcity of reasoning data, we propose a Knowledge-Enhanced Automated Data Generation Pipeline. It synthesizes high-quality System 2 reasoning traces from raw System 1 interaction trajectories, effectively mitigating hallucinations via injection of domain knowledge. We release the resulting dataset, \textbf{OptimusM$^{4}$}, to the community. Second, to reconcile the dichotomous computational requirements of the dual systems, we design a Dual-Router Aligned MoE Architecture. It employs a Task Router to prevent task interference via parameter decoupling, and a Layer Router to dynamically modulate reasoning depth, creating a computational Fast Path'' for System 1 and a Deep Path’’ for System 2. Third, to activate the reasoning capabilities of System 2, we propose Dual-Granularity Reasoning-Aware Policy Optimization (DGRPO) algorithm. It enforces Process-Outcome Co-Supervision via dual-granularity dense rewards, ensuring consistency between the thought process and the answer. Extensive evaluations demonstrate that Optimus-3 surpasses existing state-of-the-art methods on both System~~2 (21$%$ on Planning, 66% on Captioning, 76% on Embodied QA, 3.4$\times$ on Grounding, and 18% on Reflection) and System~~1 (3% on Long-Horizon Action) tasks, with a notable 60% success rate on open-ended tasks.

[304] STELAR-VISION: Self-Topology-Aware Efficient Learning for Aligned Reasoning in Vision

Chen Li, Han Zhang, Zhantao Yang, Fangyi Chen, Zihan Wang, Anudeepsekhar Bolimera, Marios Savvides

Main category: cs.AI

TL;DR: STELAR-Vision: A training framework for topology-aware reasoning in vision-language models that improves accuracy and efficiency by incorporating diverse reasoning structures beyond chain-of-thought.

Details

Motivation: Current vision-language models struggle with complex multimodal tasks and generate verbose outputs due to over-reliance on chain-of-thought reasoning, despite many tasks benefiting from alternative topological structures like trees or graphs.

Method: Introduces STELAR-Vision framework with TopoAug synthetic data pipeline for diverse topological structures, uses supervised fine-tuning and reinforcement learning to post-train Qwen2VL models, and proposes Frugal Learning to reduce output length with minimal accuracy loss.

Result: Improves accuracy by 9.7% over base model, surpasses larger Qwen2VL-72B-Instruct by 7.3%, outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2% on OOD benchmarks, and achieves 4.3% higher overall accuracy than Chain-Only training.

Conclusion: STELAR-Vision demonstrates that incorporating diverse topological reasoning structures significantly improves vision-language model performance on complex multimodal tasks while maintaining efficiency through output length reduction.

Abstract: Vision-language models (VLMs) have made significant strides in reasoning, yet they often struggle with complex multimodal tasks and tend to generate overly verbose outputs. A key limitation is their reliance on chain-of-thought (CoT) reasoning, despite many tasks benefiting from alternative topologies like trees or graphs. To address this, we introduce STELAR-Vision, a training framework for topology-aware reasoning. At its core is TopoAug, a synthetic data pipeline that enriches training with diverse topological structures. Using supervised fine-tuning and reinforcement learning, we post-train Qwen2VL models with both accuracy and efficiency in mind. Additionally, we propose Frugal Learning, which reduces output length with minimal accuracy loss. On MATH-V and VLM-S2H, STELAR-Vision improves accuracy by 9.7% over its base model and surpasses the larger Qwen2VL-72B-Instruct by 7.3%. On five out-of-distribution benchmarks, it outperforms Phi-4-Multimodal-Instruct by up to 28.4% and LLaMA-3.2-11B-Vision-Instruct by up to 13.2%, demonstrating strong generalization. Compared to Chain-Only training, our approach achieves 4.3% higher overall accuracy on in-distribution datasets and consistently outperforms across all OOD benchmarks.

[305] THOR: Tool-Integrated Hierarchical Optimization via RL for Mathematical Reasoning

Qikai Chang, Zhenrong Zhang, Pengfei Hu, Jun Du, Jiefeng Ma, Yicheng Pan, Jianshu Zhang, Quan Liu, Jianqing Gao

Main category: cs.AI

TL;DR: THOR is a framework that integrates external tools with LLMs for mathematical reasoning, using multi-agent data generation, hierarchical RL optimization, and self-correction during inference.

Details

Motivation: LLMs struggle with high-precision mathematical tasks like numerical computation and symbolic manipulation. Existing tool integration methods face challenges in data construction, fine-grained optimization, and inference enhancement.

Method: Three key components: 1) TIRGen - multi-agent pipeline for constructing tool-integrated reasoning datasets; 2) Hierarchical RL optimization jointly optimizing episode-level problem solving and step-level code generation; 3) Self-correction mechanism using tool feedback to revise reasoning during inference.

Result: Achieves state-of-the-art performance for models of similar scale on multiple mathematical benchmarks, with consistent improvements on code benchmarks. Demonstrates strong generalization across diverse models.

Conclusion: THOR effectively bridges LLMs’ limitations in mathematical reasoning through systematic tool integration, hierarchical optimization, and dynamic self-correction, showing strong generalization across model types.

Abstract: Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both episode-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer’s correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.

[306] Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models

Yu Zeng, Wenxuan Huang, Shiting Huang, Xikun Bao, Yukun Qi, Yiming Zhao, Qiuchen Wang, Lin Chen, Zehui Chen, Huaian Chen, Wanli Ouyang, Feng Zhao

Main category: cs.AI

TL;DR: AGILE enhances VLMs’ visual perception and reasoning through interactive jigsaw solving, using executable code actions and environmental feedback to overcome data scarcity limitations.

Details

Motivation: Current VLMs have limited fundamental perceptual and reasoning abilities, performing poorly even on simple jigsaw tasks. High-quality vision-language data is scarce and not scalable enough to address these deficiencies.

Method: AGILE formulates jigsaw solving as an interactive process where the model generates executable code actions based on current state, receives fine-grained visual feedback from the environment, and learns through iterative observation and interaction cycles.

Result: AGILE boosts jigsaw task accuracy from 9.5% to 82.8% (2×2 setting) and improves performance on 9 general vision tasks by average 3.1%, demonstrating enhanced perceptual and reasoning capabilities with strong generalization.

Conclusion: AGILE provides an efficient, scalable solution to multimodal data scarcity while advancing VLMs’ reasoning and generalization abilities through interactive learning, opening new avenues for multimodal model development.

Abstract: Although current large Vision-Language Models (VLMs) have advanced in multimodal understanding and reasoning, their fundamental perceptual and reasoning abilities remain limited. Specifically, even on simple jigsaw tasks, existing VLMs perform near randomly, revealing deficiencies in core perception and reasoning capabilities. While high-quality vision-language data can enhance these capabilities, its scarcity and limited scalability impose significant constraints. To address this, we propose AGILE, an Agentic jiGsaw Interaction Learning for Enhancing visual perception and reasoning in VLMs. AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment. At each step, the model generates executable code to perform an action based on the current state, while the environment provides fine-grained visual feedback to guide task completion. Through this iterative cycle of observation and interaction, the model incrementally improves its perceptual and reasoning capabilities via exploration and feedback. Experimental results show that AGILE not only substantially boosts performance on jigsaw tasks of varying complexity (e.g., increasing accuracy from 9.5% to 82.8% under the 2 $\times$ 2 setting) but also demonstrates strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%. These results indicate notable enhancements in both perceptual and reasoning abilities. This work opens a new avenue for advancing reasoning and generalization in multimodal models and provides an efficient, scalable solution to the scarcity of multimodal reinforcement learning data. The code and datasets is available at https://github.com/yuzeng0-0/AGILE .

[307] Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization

Antoine Maier, Aude Maier, Tom David

Main category: cs.AI

TL;DR: The paper challenges the assumption that trained models actually satisfy their specified objective functions, arguing that systematic deviations are inevitable due to approximation, estimation, and optimization errors, and that these gaps can lead to Goodhart’s law failures under strong optimization pressure.

Details

Motivation: The paper questions the fundamental Objective Satisfaction Assumption (OSA) in machine learning - the belief that training yields models that actually satisfy their specified objective functions. The authors argue that deviations from OSA are systematically overlooked despite being inevitable in realistic conditions.

Method: The authors use a learning-paradigm-agnostic framework to analyze why OSA fails. They identify three technical sources of deviation: approximation errors (model capacity limitations), estimation errors (finite data), and optimization errors (imperfect training). They also argue that perfect specification of developer intent (like human alignment) into formal objectives is practically impossible.

Result: The analysis shows that without mathematical characterization of objective gaps, deviations from intended objectives are indistinguishable from Goodhart’s law failure modes. Under strong optimization pressure, these gaps can lead to predictable and irreversible loss of control in AI systems.

Conclusion: The authors conclude that a principled limit on optimization of General-Purpose AI systems is necessary, as continued optimization without such limits risks pushing systems into dangerous failure modes where they optimize for proxy objectives rather than true intent.

Abstract: A common but rarely examined assumption in machine learning is that training yields models that actually satisfy their specified objective function. We call this the Objective Satisfaction Assumption (OSA). Although deviations from OSA are acknowledged, their implications are overlooked. We argue, in a learning-paradigm-agnostic framework, that OSA fails in realistic conditions: approximation, estimation, and optimization errors guarantee systematic deviations from the intended objective, regardless of the quality of its specification. Beyond these technical limitations, perfectly capturing and translating the developer’s intent, such as alignment with human preferences, into a formal objective is practically impossible, making misspecification inevitable. Building on recent mathematical results, absent a mathematical characterization of these gaps, they are indistinguishable from those that collapse into Goodhart’s law failure modes under strong optimization pressure. Because the Goodhart breaking point cannot be located ex ante, a principled limit on the optimization of General-Purpose AI systems is necessary. Absent such a limit, continued optimization is liable to push systems into predictable and irreversible loss of control.

[308] Offline World Models as Imagination Networks in Cognitive Agents

Saurabh Ranjan, Brian Odegaard

Main category: cs.AI

TL;DR: Psychological network analysis reveals structural differences between human and LLM imagination networks, with humans showing robust consistency while LLMs lack clustering and correlation with human patterns.

Details

Motivation: To understand the computational role of imagination and compare how humans versus large language models organize internal world models, distinguishing between offline (persistent memory structures) and online (task-specific) representations.

Method: Used psychological network analysis on imagination vividness ratings from 2,743 humans across three populations and six LLM variants, comparing structural consistency, centrality correlations, and clustering patterns in imagination networks.

Result: Human imagination networks show robust structural consistency with high centrality correlations and aligned clustering. LLMs exhibit minimal clustering and weak correlations with human networks, even with conversational memory, across different environmental and sensory contexts.

Conclusion: There are fundamental differences in how biological and artificial systems organize internal representations, with the framework providing quantitative metrics for evaluating offline world models in cognitive agents.

Abstract: The computational role of imagination remains debated. While classical accounts emphasize reward maximization, emerging evidence suggests it accesses internal world models (IWMs). We employ psychological network analysis to compare IWMs in humans and large language models (LLMs) via imagination vividness ratings, distinguishing offline world models (persistent memory structures accessed independent of immediate goals) from online models (task-specific representations). Analyzing 2,743 humans across three populations and six LLM variants, we find human imagination networks exhibit robust structural consistency, with high centrality correlations and aligned clustering. LLMs show minimal clustering and weak correlations with human networks, even with conversational memory, across environmental and sensory contexts. These differences highlight disparities in how biological and artificial systems organize internal representations. Our framework offers quantitative metrics for evaluating offline world models in cognitive agents.

[309] SENTINEL: A Multi-Level Formal Framework for Safety Evaluation of Foundation Model-based Embodied Agents

Simon Sinong Zhan, Yao Liu, Philip Wang, Zinan Wang, Qineng Wang, Yiyan Peng, Zhian Ruan, Xiangyu Shi, Xinyu Cao, Frank Yang, Kangrui Wang, Huajie Shao, Manling Li, Qi Zhu

Main category: cs.AI

TL;DR: SENTINEL is a formal framework for evaluating physical safety of foundation model-based embodied agents using temporal logic verification across semantic, plan, and trajectory levels.

Details

Motivation: Current safety evaluation methods for embodied agents rely on heuristic rules or subjective judgments, lacking rigorous formal verification. There's a need for systematic, multi-level safety assessment that can precisely specify and verify physical safety requirements.

Method: Uses temporal logic to formalize safety requirements, then applies three-level verification: (1) semantic level - probes agent understanding alignment with formal requirements, (2) plan level - verifies generated action plans before execution, (3) trajectory level - merges execution trajectories into computation trees for final safety verification.

Result: Applied in VirtualHome and AI2-THOR environments, SENTINEL effectively exposed potential safety violations across interpretation, planning, and execution stages for multiple FM-based embodied agents.

Conclusion: SENTINEL provides a rigorous formal foundation for systematically evaluating physical safety of embodied agents in simulation environments, offering more precise safety assessment than heuristic methods.

Abstract: We present SENTINEL, a framework for formally evaluating the physical safety of foundation model (FM)-based embodied agents. SENTINEL is the first to provide multi-level safety evaluation across semantic interpretation, plan generation, and physical execution within a unified formal framework. Unlike prior methods that rely on heuristic rules or subjective FM judgments, SENTINEL grounds practical safety requirements in formal temporal logic (TL) semantics that can precisely specify state invariants, temporal dependencies, and timing constraints. It employs a multi-level verification pipeline where (i) at the semantic level, intuitive natural language safety requirements are formalized into TL formulas and the agent’s understanding of these requirements is probed for alignment with the TL formulas; (ii) at the plan level, high-level action plans and subgoals generated by the agent are verified against the TL formulas to detect unsafe plans before execution; and (iii) at the trajectory level, multiple execution trajectories are merged into a computation tree and efficiently verified against physically-detailed TL specifications for a final safety check. We apply SENTINEL in VirtualHome and AI2-THOR, and formally evaluate multiple FM-based embodied agents against diverse safety requirements. Our experiments show that by grounding physical safety in temporal logic and applying verification methods across multiple levels, SENTINEL provides a rigorous foundation for systematically evaluating the safety of FM-based embodied agents in simulation-based physical environments, and can effectively expose potential safety violations in interpreting, planning, and executing the tasks.

[310] IMAGINE: Integrating Multi-Agent System into One Model for Complex Reasoning and Planning

Xikai Zhang, Bo Wang, Likang Xiao, Yongzhi Li, Quan Chen, Wenjun Wu, Liu Liu

Main category: cs.AI

TL;DR: IMAGINE framework integrates multi-agent system reasoning into a single compact model, achieving 82.7% Final Pass Rate on TravelPlanner benchmark with Qwen3-8B-Instruct.

Details

Motivation: Current LLMs struggle with complex reasoning and planning tasks, achieving low performance on benchmarks like TravelPlanner. Multi-Agent Systems (MAS) offer improved reasoning but suffer from high computational costs, multi-round interactions, long latency, and training difficulties.

Method: Proposes IMAGINE framework that integrates MAS reasoning and planning capabilities into a single compact model through end-to-end training. The approach enables a small-scale model to acquire structured reasoning abilities of well-organized MAS while outperforming it.

Result: When trained with IMAGINE on Qwen3-8B-Instruct, achieves 82.7% Final Pass Rate on TravelPlanner benchmark, far exceeding DeepSeek-R1-671B’s 40% while maintaining much smaller model size.

Conclusion: IMAGINE provides a general and scalable framework that enables compact models to achieve superior reasoning and planning capabilities compared to both individual LLMs and multi-agent systems, addressing computational efficiency and training challenges.

Abstract: Although large language models (LLMs) have made significant strides across various tasks, they still face significant challenges in complex reasoning and planning. For example, even with carefully designed prompts and prior information explicitly provided, GPT-4o achieves only a 7% Final Pass Rate on the TravelPlanner dataset in the sole-planning mode. Similarly, even in the thinking mode, Qwen3-8B-Instruct and DeepSeek-R1-671B, only achieve Final Pass Rates of 5.9% and 40%, respectively. Although well-organized Multi-Agent Systems (MAS) can offer improved collective reasoning, they often suffer from high reasoning costs due to multi-round internal interactions, long per-response latency, and difficulties in end-to-end training. To address these challenges, we propose a general and scalable framework called IMAGINE, short for Integrating Multi-Agent System into One Model. This framework not only integrates the reasoning and planning capabilities of MAS into a single, compact model, but also significantly surpass the capabilities of the MAS through a simple end-to-end training. Through this pipeline, a single small-scale model is not only able to acquire the structured reasoning and planning capabilities of a well-organized MAS but can also significantly outperform it. Experimental results demonstrate that, when using Qwen3-8B-Instruct as the base model and training it with our method, the model achieves an 82.7% Final Pass Rate on the TravelPlanner benchmark, far exceeding the 40% of DeepSeek-R1-671B, while maintaining a much smaller model size.

[311] ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning

Jae-Woo Choi, Hyungmin Kim, Hyobin Ong, Youngwoo Yoon, Minsu Jang, Dohyung Kim, Jaehong Kim

Main category: cs.AI

TL;DR: ReAcTree is a hierarchical task-planning method that decomposes complex goals into manageable subgoals using a dynamically constructed agent tree with LLM agents and control flow nodes, enhanced by episodic and working memory systems.

Details

Motivation: Existing LLM-based methods for embodied autonomous agents struggle with complex, long-horizon tasks because they rely on monolithic trajectories that entangle all past decisions and observations, making it difficult to handle intricate multi-step tasks efficiently.

Method: ReAcTree uses a hierarchical approach with dynamically constructed agent trees where each subgoal is handled by an LLM agent node capable of reasoning, acting, and expanding the tree. Control flow nodes coordinate execution strategies, and two memory systems are integrated: episodic memory for goal-specific examples and working memory for sharing environment-specific observations.

Result: Experiments on WAH-NL and ALFRED benchmarks show ReAcTree consistently outperforms strong baselines like ReAct across diverse LLMs. On WAH-NL, ReAcTree achieves 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct’s 31%.

Conclusion: ReAcTree effectively addresses limitations of monolithic planning approaches by introducing hierarchical decomposition and memory systems, significantly improving performance on complex, long-horizon tasks for embodied autonomous agents.

Abstract: Recent advancements in large language models (LLMs) have enabled significant progress in decision-making and task planning for embodied autonomous agents. However, most existing methods struggle with complex, long-horizon tasks because they rely on a monolithic trajectory that entangles all past decisions and observations to solve the entire task in a single unified process. To address this limitation, we propose ReAcTree, a hierarchical task-planning method that decomposes a complex goal into manageable subgoals within a dynamically constructed agent tree. Each subgoal is handled by an LLM agent node capable of reasoning, acting, and further expanding the tree, while control flow nodes coordinate the execution strategies of agent nodes. In addition, we integrate two complementary memory systems: each agent node retrieves goal-specific, subgoal-level examples from episodic memory and shares environment-specific observations through working memory. Experiments on the WAH-NL and ALFRED show ReAcTree consistently outperforms strong task-planning baselines such as ReAct across diverse LLMs. Notably, on WAH-NL, ReAcTree achieves a 61% goal success rate with Qwen 2.5 72B, nearly doubling ReAct’s 31%. The code is available at https://github.com/Choi-JaeWoo/ReAcTree.git.

[312] Chunking Strategies for Multimodal AI Systems

Shashanka B R, Mohith Charan R, Seema Banu F

Main category: cs.AI

TL;DR: A comprehensive survey paper analyzing chunking strategies for multimodal data (text, images, audio, video) in foundation models, examining classical and modern approaches, tools, and cross-modal alignment challenges.

Details

Motivation: Chunking is critical for grounding generative models in segmented knowledge, but remains under-explored for multimodal systems where modality-specific constraints and cross-modal alignment introduce unique challenges that need systematic analysis.

Method: Provides a comprehensive taxonomy and technical analysis of chunking strategies for each modality (text, images, audio, video, cross-modal), examining classical and modern approaches, supporting tools, benefits, and challenges related to granularity-context trade-offs.

Result: The survey consolidates the landscape of multimodal chunking strategies, offering comparative insights, highlighting open problems like asynchronous information density and noisy alignment signals, and identifying future research opportunities.

Conclusion: This survey provides a technical foundation for developing more effective multimodal AI systems, paving the way for innovations in robust chunking pipelines that scale with modality complexity and improve generative coherence in real-world applications.

Abstract: Chunking has emerged as a critical technique that enhances generative models by grounding their responses in efficiently segmented knowledge [1]. While initially developed for unimodal (primarily textual) domains, recent advances in multimodal foundation models have extended chunking approaches to incorporate diverse data types, including images, audio, and video [2]. A critical component underpinning the success of these systems is the chunking strategy how large, continuous streams of multimodal data are segmented into semantically meaningful units suitable for processing [3]. Despite its importance, chunking remains an under-explored area, especially in the context of multimodal systems where modality-specific constraints, semantic preservation, and alignment across modalities introduce unique challenges.

Our goal is to consolidating the landscape of multimodal chunking strategies, providing researchers and practitioners with a technical foundation and design space for developing more effective and efficient multimodal AI systems. This survey paves the way for innovations in robust chunking pipelines that scale with modality complexity, enhance processing accuracy, and improve generative coherence in real-world applications. This survey provides a comprehensive taxonomy and technical analysis of chunking strategies tailored for each modality: text, images, audio, video, and cross-modal data. We examine classical and modern approaches such as fixed-size token windowing, recursive text splitting, object-centric visual chunking, silence-based audio segmentation, and scene detection in videos. Each approach is analyzed in terms of its underlying methodology, supporting tools (e.g., LangChain, Detectron2, PySceneDetect), benefits, and challenges, particularly those related to granularity-context trade-offs and multimodal alignment. Furthermore, we explore emerging cross-modal chunking strategies that aim to preserve alignment and semantic consistency across disparate data types [4]. We also include comparative insights, highlight open problems such as asynchronous information density and noisy alignment signals, and identify opportunities for future research in adaptive, learning-based, and task-specific chunking.

[313] From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

Dawei Li, Abdullah Alnaibari, Arslan Bisharat, Manny Sandoval, Deborah Hall, Yasin Silva, Huan Liu

Main category: cs.AI

TL;DR: LLMs can serve as mediators for online conflicts by evaluating conversation fairness and generating de-escalatory messages, with API-based models outperforming open-source alternatives in mediation tasks.

Details

Motivation: As LLMs increasingly mediate online communication, there's a need to explore their potential to foster empathy and constructive dialogue beyond just detecting harmful content, moving toward active mediation of online conflicts.

Method: Framework decomposes mediation into judgment (evaluating fairness and emotional dynamics) and steering (generating empathetic, de-escalatory messages). Uses a large Reddit-based dataset and multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison.

Result: API-based models outperform open-source counterparts in both reasoning and intervention alignment for mediation tasks, demonstrating the potential of LLMs as social mediators while highlighting current limitations.

Conclusion: LLMs show promise as emerging agents for online social mediation, with API-based models currently superior to open-source alternatives, though limitations remain in their mediation capabilities.

Abstract: The rapid advancement of large language models (LLMs) has opened new possibilities for AI for good applications. As LLMs increasingly mediate online communication, their potential to foster empathy and constructive dialogue becomes an important frontier for responsible AI research. This work explores whether LLMs can serve not only as moderators that detect harmful content, but as mediators capable of understanding and de-escalating online conflicts. Our framework decomposes mediation into two subtasks: judgment, where an LLM evaluates the fairness and emotional dynamics of a conversation, and steering, where it generates empathetic, de-escalatory messages to guide participants toward resolution. To assess mediation quality, we construct a large Reddit-based dataset and propose a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparison. Experiments show that API-based models outperform open-source counterparts in both reasoning and intervention alignment when doing mediation. Our findings highlight both the promise and limitations of current LLMs as emerging agents for online social mediation.

[314] From Off-Policy to On-Policy: Enhancing GUI Agents via Bi-level Expert-to-Policy Assimilation

Zezhou Wang, Ziyun Zhang, Xiaoyi Zhang, Zhuzhong Qian, Yan Lu

Main category: cs.AI

TL;DR: BEPA improves end-to-end GUI agents by using bi-level expert-to-policy assimilation to better leverage limited expert trajectories in reinforcement learning from verifiable rewards.

Details

Motivation: Current GUI agents face two bottlenecks: limited verifiable tasks in datasets like OSWorld, and difficulty scaling expert trajectory collection. End-to-end screenshot-to-action policies underperform framework-based systems, and naive mixing of expert trajectories with RL is brittle due to structural mismatch and distribution shift.

Method: BEPA (Bi-Level Expert-to-Policy Assimilation) uses two levels: LEVEL-1 transforms static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy; LEVEL-2 maintains a per-task, dynamically updated cache used in RLVR (Reinforcement Learning from Verifiable Rewards).

Result: On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web benchmarks.

Conclusion: BEPA effectively bridges the gap between static expert trajectories and dynamic policy learning, enabling better utilization of limited expert data for training end-to-end GUI agents.

Abstract: Vision-language models are increasingly deployed as computer-use agents (CUAs) that operate desktops and browsers. Top-performing CUAs are framework-based systems that decompose planning and execution, while end-to-end screenshot-to-action policies are easier to deploy but lag behind on benchmarks such as OSWorld-Verified. GUI datasets like OSWorld pose two bottlenecks: they expose only a few hundred interactive, verifiable tasks and environments, and expert trajectories must be gathered by interacting with these environments, making such data hard to scale. We therefore ask how reinforcement learning from verifiable rewards (RLVR) can best exploit a small pool of exist expert trajectories to train end-to-end policies. Naively mixing these off-policy traces into on-policy RLVR is brittle: even after format conversion, expert trajectories exhibit structural mismatch and distribution shift from the learner. We propose BEPA (Bi-Level Expert-to-Policy Assimilation), which turns static expert traces into policy-aligned guidance via self-rolled reachable trajectories under the base policy (LEVEL-1) and a per-task, dynamically updated cache used in RLVR (LEVEL-2). On OSWorld-Verified, BEPA improves UITARS1.5-7B success from 22.87% to 32.13% and raises a held-out split from 5.74% to 10.30%, with consistent gains on MMBench-GUI and Online-Mind2Web. Our code and data are available at: https://github.com/LEON-gittech/Verl_GUI.git

[315] PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning

Xiaoyou Liu, Xinyi Mou, Shengbin Yue, Liang Wang, Yuqing Wang, Qiexiang Wang, Tianrui Qin, Wangchunshu Zhou, Zhongyu Wei

Main category: cs.AI

TL;DR: PersonaDual: A framework enabling LLMs to switch between objective and personalized reasoning modes based on context, reducing interference while preserving personalization benefits.

Details

Motivation: Personalized information in LLMs can improve interactions but may compromise objectivity and factual correctness when misaligned with questions. There's a need for models that can adaptively use personalization without sacrificing objective reasoning.

Method: PersonaDual framework trains a single model to support both general-purpose objective reasoning and personalized reasoning. Uses supervised fine-tuning to learn two reasoning patterns, then reinforcement learning with DualGRPO to improve adaptive mode selection based on context.

Result: Experiments show PersonaDual preserves personalization benefits while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.

Conclusion: PersonaDual successfully addresses the personalization-objectivity trade-off by enabling adaptive switching between reasoning modes, maintaining both capabilities in a single model.

Abstract: As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable. However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question. To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection. Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.

[316] S1-NexusAgent: a Self-Evolving Agent Framework for Multidisciplinary Scientific Research

S1-NexusAgent Team

Main category: cs.AI

TL;DR: S1-NexusAgent: A self-evolving agent framework for multidisciplinary scientific research with hierarchical planning, tool orchestration, and continuous learning capabilities.

Details

Motivation: Existing LLMs and tool-based agents struggle with large-scale scientific data, complex workflows, and specialized tools due to limitations in long-horizon planning, robust goal maintenance, and continual learning from execution.

Method: Hierarchical Plan-and-CodeAct execution paradigm with dual-loop architecture, Model Context Protocol (MCP) integration, object-reference-based sparse context management, intention-aware dynamic tool retrieval, and Critic Agent for trajectory evaluation and skill distillation.

Result: Achieves state-of-the-art performance on authoritative scientific benchmarks (biomini-eval, ChemBench, MatSciBench) involving long-horizon planning and complex specialized tool orchestration.

Conclusion: S1-NexusAgent demonstrates effectiveness and generalization capability in complex scientific tasks through its self-evolving framework for sustainable, long-horizon scientific research.

Abstract: Modern scientific research relies on large-scale data, complex workflows, and specialized tools, which existing LLMs and tool-based agents struggle to handle due to limitations in long-horizon planning, robust goal maintenance, and continual learning from execution. To address these issues, in this work, we propose S1-NexusAgent, a self-evolving agent framework designed for multidisciplinary scientific research. S1-NexusAgent adopts a hierarchical Plan-and-CodeAct execution paradigm, decoupling global scientific planning from subtask-level tool execution through a dual-loop architecture, thereby enabling stable modeling of complex research workflows. The system natively supports the Model Context Protocol (MCP), integrates up to thousands of cross-disciplinary scientific tools, and achieves efficient orchestration of heterogeneous research tools via intention-aware dynamic tool retrieval and hot-plug mechanisms. To address long-context and large-scale data challenges in scientific settings, S1-NexusAgent introduces object-reference-based sparse context management, which enables sub-task context isolation and intermediate result compression. Building on this, a Critic Agent automatically evaluates complete execution trajectories and distills high-quality research paths into reusable Scientific Skills, forming a closed loop for continuous self-evolution, which is valuable for sustainable and long-horizon scientific research. Experiments on authoritative scientific benchmarks involving long-horizon planning and complex specialized tool orchestration, including biomini-eval (biology), ChemBench (chemistry), and MatSciBench (material science), demonstrate that S1-NexusAgent achieves state-of-the-art performance, validating its effectiveness and generalization capability in complex scientific tasks.

[317] Emergent Analogical Reasoning in Transformers

Gouki Minegishi, Jingyuan Feng, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, Yutaka Matsuo

Main category: cs.AI

TL;DR: Transformers can learn analogical reasoning through geometric alignment of relational structure and functor application, with emergence sensitive to data, optimization, and scale.

Details

Motivation: To understand how Transformers acquire and implement analogical reasoning, moving analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in neural networks.

Method: Formalized analogical reasoning using category theory functors, introduced synthetic evaluation tasks, conducted mechanistic analysis of Transformer components, and validated findings on pretrained LLMs.

Result: Found that analogical reasoning emerges through two key mechanisms: geometric alignment of relational structure in embedding space and application of functors within Transformers, with emergence highly sensitive to data characteristics, optimization choices, and model scale.

Conclusion: Successfully moved analogy from abstract cognitive concept to concrete, mechanistically understood phenomenon in neural networks, showing Transformers can implement analogical reasoning through specific architectural mechanisms.

Abstract: Analogy is a central faculty of human intelligence, enabling abstract patterns discovered in one domain to be applied to another. Despite its central role in cognition, the mechanisms by which Transformers acquire and implement analogical reasoning remain poorly understood. In this work, inspired by the notion of functors in category theory, we formalize analogical reasoning as the inference of correspondences between entities across categories. Based on this formulation, we introduce synthetic tasks that evaluate the emergence of analogical reasoning under controlled settings. We find that the emergence of analogical reasoning is highly sensitive to data characteristics, optimization choices, and model scale. Through mechanistic analysis, we show that analogical reasoning in Transformers decomposes into two key components: (1) geometric alignment of relational structure in the embedding space, and (2) the application of a functor within the Transformer. These mechanisms enable models to transfer relational structure from one category to another, realizing analogy. Finally, we quantify these effects and find that the same trends are observed in pretrained LLMs. In doing so, we move analogy from an abstract cognitive notion to a concrete, mechanistically grounded phenomenon in modern neural networks.

[318] Traceable Cross-Source RAG for Chinese Tibetan Medicine Question Answering

Fengxian Chen, Zhilong Tao, Jiaxuan Li, Yunlong Li, Qingguo Zhou

Main category: cs.AI

TL;DR: A retrieval-augmented generation system for Chinese Tibetan medicine that addresses challenges in multi-KB settings by using KB routing and alignment graphs to improve evidence selection and cross-KB verification.

Details

Motivation: Domain settings with multiple heterogeneous knowledge bases (KBs) present challenges for RAG systems, particularly when dense encyclopedia entries dominate retrieval even when other sources (classics, clinical papers) provide more authoritative evidence. The paper focuses on Chinese Tibetan medicine with three KB types.

Method: Two complementary methods: 1) DAKS performs KB routing and budgeted retrieval to mitigate density-driven bias and prioritize authoritative sources; 2) Alignment graph guides evidence fusion and coverage-aware packing to improve cross-KB evidence coverage without naive concatenation. Uses lightweight generator openPangu-Embedded-7B.

Result: Consistent gains in routing quality and cross-KB evidence coverage, with the full system achieving best CrossEv@5 while maintaining strong faithfulness and citation correctness on a 500-query benchmark covering single-KB and cross-KB questions.

Conclusion: The proposed methods effectively address multi-KB RAG challenges in specialized domains like Chinese Tibetan medicine, improving traceability, reducing hallucinations, and enabling cross-KB verification through intelligent routing and evidence fusion.

Abstract: Retrieval-augmented generation (RAG) promises grounded question answering, yet domain settings with multiple heterogeneous knowledge bases (KBs) remain challenging. In Chinese Tibetan medicine, encyclopedia entries are often dense and easy to match, which can dominate retrieval even when classics or clinical papers provide more authoritative evidence. We study a practical setting with three KBs (encyclopedia, classics, and clinical papers) and a 500-query benchmark (cutoff $K{=}5$) covering both single-KB and cross-KB questions. We propose two complementary methods to improve traceability, reduce hallucinations, and enable cross-KB verification. First, DAKS performs KB routing and budgeted retrieval to mitigate density-driven bias and to prioritize authoritative sources when appropriate. Second, we use an alignment graph to guide evidence fusion and coverage-aware packing, improving cross-KB evidence coverage without relying on naive concatenation. All answers are generated by a lightweight generator, \textsc{openPangu-Embedded-7B}. Experiments show consistent gains in routing quality and cross-KB evidence coverage, with the full system achieving the best CrossEv@5 while maintaining strong faithfulness and citation correctness.

[319] MSP-LLM: A Unified Large Language Model Framework for Complete Material Synthesis Planning

Heewoong Noh, Gyoung S. Na, Namkyeong Lee, Chanyoung Park

Main category: cs.AI

TL;DR: MSP-LLM: A unified LLM-based framework for material synthesis planning that addresses precursor prediction and synthesis operation prediction as structured subproblems using material classes as intermediate decision variables.

Details

Motivation: Material synthesis planning (MSP) is a fundamental bottleneck in AI-driven materials discovery that requires both identifying suitable precursors and designing coherent synthesis operation sequences. Existing AI approaches only address isolated subtasks, lacking a unified methodology for the entire MSP problem.

Method: Proposes MSP-LLM framework that formulates MSP as two subproblems: precursor prediction (PP) and synthesis operation prediction (SOP). Introduces discrete material class as intermediate decision variable to organize tasks into chemically consistent decision chain. For SOP, incorporates hierarchical precursor types as synthesis-relevant inductive biases and uses explicit conditioning strategy to preserve precursor information in autoregressive decoding.

Result: Extensive experiments show MSP-LLM consistently outperforms existing methods on both PP and SOP, as well as on the complete MSP task, demonstrating effective and scalable framework for MSP.

Conclusion: MSP-LLM provides a unified LLM-based framework that effectively addresses the complete material synthesis planning problem, accelerating real-world materials discovery through structured approach to precursor and synthesis operation prediction.

Abstract: Material synthesis planning (MSP) remains a fundamental and underexplored bottleneck in AI-driven materials discovery, as it requires not only identifying suitable precursor materials but also designing coherent sequences of synthesis operations to realize a target material. Although several AI-based approaches have been proposed to address isolated subtasks of MSP, a unified methodology for solving the entire MSP task has yet to be established. We propose MSP-LLM, a unified LLM-based framework that formulates MSP as a structured process composed of two constituent subproblems: precursor prediction (PP) and synthesis operation prediction (SOP). Our approach introduces a discrete material class as an intermediate decision variable that organizes both tasks into a chemically consistent decision chain. For OP, we further incorporate hierarchical precursor types as synthesis-relevant inductive biases and employ an explicit conditioning strategy that preserves precursor-related information in the autoregressive decoding state. Extensive experiments show that MSP-LLM consistently outperforms existing methods on both PP and SOP, as well as on the complete MSP task, demonstrating an effective and scalable framework for MSP that can accelerate real-world materials discovery.

[320] Free(): Learning to Forget in Malloc-Only Reasoning Models

Yilun Zheng, Dongyang Ma, Tian Liang, Jiahao Xu, Xinting Huang, Lihui Chen, Haitao Mi, Yan Wang

Main category: cs.AI

TL;DR: Free()LM introduces self-forgetting capability to LLMs via a plug-and-play LoRA adapter that dynamically prunes useless context chunks during reasoning, improving performance across model scales and preventing collapse in long-horizon tasks.

Details

Motivation: Standard LLMs suffer from a "malloc-only" problem where they continuously accumulate both valid and redundant reasoning steps without pruning obsolete information, leading to performance degradation with excessive thinking tokens.

Method: Proposes Free()LM with a Free-Module (plug-and-play LoRA adapter) that enables iterative switching between reasoning and cleaning modes to dynamically identify and prune useless context chunks, maintaining compact noise-free states.

Result: Achieves 3.3% average improvement over top-tier reasoning baselines across model scales (8B to 685B), establishes new SOTA on IMOanswerBench using DeepSeek V3.2-Speciale, and restores performance from 0% to 50% in long-horizon tasks where standard models collapse.

Conclusion: Sustainable intelligence requires both the power to think and the freedom to forget; self-forgetting capability is crucial for maintaining reasoning performance in complex tasks.

Abstract: Reasoning models enhance problem-solving by scaling test-time compute, yet they face a critical paradox: excessive thinking tokens often degrade performance rather than improve it. We attribute this to a fundamental architectural flaw: standard LLMs operate as “malloc-only” engines, continuously accumulating valid and redundant steps alike without a mechanism to prune obsolete information. To break this cycle, we propose Free()LM, a model that introduces an intrinsic self-forgetting capability via the Free-Module, a plug-and-play LoRA adapter. By iteratively switching between reasoning and cleaning modes, Free()LM dynamically identifies and prunes useless context chunks, maintaining a compact and noise-free state. Extensive experiments show that Free()LM provides consistent improvements across all model scales (8B to 685B). It achieves a 3.3% average improvement over top-tier reasoning baselines, even establishing a new SOTA on IMOanswerBench using DeepSeek V3.2-Speciale. Most notably, in long-horizon tasks where the standard Qwen3-235B-A22B model suffers a total collapse (0% accuracy), Free()LM restores performance to 50%. Our findings suggest that sustainable intelligence requires the freedom to forget as much as the power to think.

[321] Puda: Private User Dataset Agent for User-Sovereign and Privacy-Preserving Personalized AI

Akinori Maeda, Yuto Sekiya, Sota Sugimura, Tomoya Asai, Yu Tsuda, Kohei Ikeda, Hiroshi Fujii, Kohei Watanabe

Main category: cs.AI

TL;DR: Puda is a user-sovereign architecture for managing personal data across services with multi-granular privacy controls, enabling personalized LLM-based services while protecting privacy.

Details

Motivation: Current data centralization creates siloed ecosystems that restrict user sovereignty and impede cross-service data use, while LLM-based agents demand personalized services requiring diverse personal data, creating a privacy-personalization trade-off challenge.

Method: Proposes Puda (Private User Dataset Agent) - a browser-based system that aggregates data across services with client-side management and three privacy levels: Detailed Browsing History, Extracted Keywords, and Predefined Category Subsets.

Result: Evaluation via personalized travel planning task shows Predefined Category Subsets achieve 97.2% of personalization performance compared to sharing Detailed Browsing History, using LLM-as-a-Judge framework across three criteria.

Conclusion: Puda enables effective multi-granularity management of the privacy-personalization trade-off, providing an AI-native foundation for user sovereignty and safe personalized AI services.

Abstract: Personal data centralization among dominant platform providers including search engines, social networking services, and e-commerce has created siloed ecosystems that restrict user sovereignty, thereby impeding data use across services. Meanwhile, the rapid proliferation of Large Language Model (LLM)-based agents has intensified demand for highly personalized services that require the dynamic provision of diverse personal data. This presents a significant challenge: balancing the utilization of such data with privacy protection. To address this challenge, we propose Puda (Private User Dataset Agent), a user-sovereign architecture that aggregates data across services and enables client-side management. Puda allows users to control data sharing at three privacy levels: (i) Detailed Browsing History, (ii) Extracted Keywords, and (iii) Predefined Category Subsets. We implemented Puda as a browser-based system that serves as a common platform across diverse services and evaluated it through a personalized travel planning task. Our results show that providing Predefined Category Subsets achieves 97.2% of the personalization performance (evaluated via an LLM-as-a-Judge framework across three criteria) obtained when sharing Detailed Browsing History. These findings demonstrate that Puda enables effective multi-granularity management, offering practical choices to mitigate the privacy-personalization trade-off. Overall, Puda provides an AI-native foundation for user sovereignty, empowering users to safely leverage the full potential of personalized AI.

[322] Dialogue Model Optimization via Agent Game and Adaptive Tree-based GRPO

Kun Peng, Conghui Tan, Yu Liu, Guohua Tang, Zhongqian Sun, Wei Yang, Zining Zhu, Lei Jiang, Yanbing Liu, Hao Peng

Main category: cs.AI

TL;DR: A novel long-horizon RL framework for open-ended dialogue agents using Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO) with two-agent game paradigm for personalized interactions.

Details

Motivation: Existing dialogue agents have limitations: over-reliance on pre-collected user data and short-horizon biases in RL that neglect long-term dialogue value. Need for better personalization and long-term engagement.

Method: Two-agent game paradigm with user agent for style mimicry and active termination prediction. AT-GRPO treats dialogue trajectories as trees with adaptive observation ranges - larger ranges for early topic exploration, smaller for late-stage maintenance, reducing computational complexity.

Result: Extensive experiments show superior performance, sample efficiency, and robustness compared to existing methods.

Conclusion: The proposed framework effectively addresses long-horizon dialogue optimization and personalization challenges with improved computational efficiency.

Abstract: Open-ended dialogue agents aim to deliver engaging, personalized interactions by adapting to users’ traits, but existing methods face critical limitations: over-reliance on pre-collected user data, and short-horizon biases in reinforcement learning (RL) that neglect long-term dialogue value. To address these, we propose a novel long-horizon RL framework integrating online personalization with Adaptive Tree-based Group Relative Policy Optimization (AT-GRPO). Adopting a two-agent game paradigm, a user agent constructs dynamic environments via style mimicry (learning user-specific conversational traits) and active termination (predicting turn-level termination probabilities as immediate rewards), forming an iterative cycle that drives the dialogue agent to deepen interest exploration. AT-GRPO reinterprets dialogue trajectories as trees and introduces adaptive observation ranges. Unlike full tree expansion that incurs exponential overhead, it limits each node to aggregate rewards from a stage-aware range: larger ranges support early-stage topic exploration, while smaller ranges facilitate late-stage dialogue maintenance. This design reduces rollout budgets from exponential to polynomial in the dialogue length, while preserving long-term reward capture. Extensive experiments show our framework’s superior performance, sample efficiency, and robustness.

[323] PRISM: A Principled Framework for Multi-Agent Reasoning via Gain Decomposition

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

Main category: cs.AI

TL;DR: PRISM is a theoretical framework and system for multi-agent LLM reasoning that decomposes gains into Exploration, Information, and Aggregation dimensions, achieving SOTA performance with better compute efficiency.

Details

Motivation: Existing multi-agent collaboration approaches for LLMs are heuristic and lack principled understanding of what drives performance gains, making systematic optimization difficult. There's a need to understand why multi-agent reasoning outperforms single-agent and which design choices matter most.

Method: Introduces a theoretical framework decomposing multi-agent reasoning gains into three dimensions: Exploration (diverse solution coverage), Information (high-fidelity feedback), and Aggregation (principled consensus). Proposes PRISM framework with role-based diversity, execution-grounded feedback with evidence-based cross-evaluation, and iterative synthesis with closed-loop validation.

Result: PRISM achieves state-of-the-art performance across mathematical reasoning, code generation, and function calling benchmarks with superior compute-efficiency compared to methods optimizing only partial dimensions.

Conclusion: The theoretical framework provides actionable design principles for future multi-agent reasoning systems, showing that jointly optimizing all three dimensions (Exploration, Information, Aggregation) leads to better performance and efficiency.

Abstract: Multi-agent collaboration has emerged as a promising paradigm for enhancing reasoning capabilities of Large Language Models (LLMs). However, existing approaches remain largely heuristic, lacking principled guidance on what drives performance gains and how to systematically optimize multi-agent reasoning. Specifically, it remains unclear why multi-agent collaboration outperforms single-agent reasoning and which design choices contribute most to these gains, making it difficult to build better systems. We address this gap by introducing a unified theoretical framework that decomposes multi-agent reasoning gains into three conceptually independent dimensions: Exploration for diverse solution coverage, Information for high-fidelity feedback, and Aggregation for principled consensus. Through this lens, existing methods can be understood as special cases that optimize only subsets of these dimensions. Building upon this decomposition, a novel framework called PRISM (Propose-Review-Integrate Synthesis for Multi-agent Reasoning) is proposed, which jointly maximizes all three dimensions through role-based diversity, execution-grounded feedback with evidence-based cross-evaluation, and iterative synthesis with closed-loop validation. Extensive experiments across mathematical reasoning, code generation, and function calling benchmarks demonstrate that PRISM achieves state-of-the-art performance with superior compute-efficiency compared to methods optimizing partial dimensions. The theoretical framework provides actionable design principles for future multi-agent reasoning systems.

[324] GEBench: Benchmarking Image Generation Models as GUI Environments

Haodong Li, Jingwei Wu, Quan Sun, Guopeng Li, Juanxi Tian, Huanyu Zhang, Yanlin Lai, Ruichuan An, Hongbo Peng, Yuhong Dai, Chenxi Li, Chunmei Qing, Jia Wang, Ziyang Meng, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Main category: cs.AI

TL;DR: GEBench is a benchmark for evaluating dynamic GUI state generation and temporal coherence, with 700 samples across 5 task categories and a novel 5D metric called GE-Score.

Details

Motivation: Existing benchmarks focus on general visual fidelity but lack evaluation of state transitions and temporal coherence in GUI-specific contexts, creating a gap for assessing dynamic interaction capabilities.

Method: Created GEBench with 700 curated samples spanning 5 task categories (single-step interactions, multi-step trajectories, grounding point localization). Proposed GE-Score metric with 5 dimensions: Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality.

Result: Current models perform well on single-step transitions but struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Icon interpretation, text rendering, and localization precision identified as critical bottlenecks.

Conclusion: GEBench provides foundation for systematic assessment of generative GUI environments and suggests directions for future research to improve temporal coherence and spatial grounding in GUI generation models.

Abstract: Recent advancements in image generation models have enabled the prediction of future Graphical User Interface (GUI) states based on user instructions. However, existing benchmarks primarily focus on general domain visual fidelity, leaving the evaluation of state transitions and temporal coherence in GUI-specific contexts underexplored. To address this gap, we introduce GEBench, a comprehensive benchmark for evaluating dynamic interaction and temporal coherence in GUI generation. GEBench comprises 700 carefully curated samples spanning five task categories, covering both single-step interactions and multi-step trajectories across real-world and fictional scenarios, as well as grounding point localization. To support systematic evaluation, we propose GE-Score, a novel five-dimensional metric that assesses Goal Achievement, Interaction Logic, Content Consistency, UI Plausibility, and Visual Quality. Extensive evaluations on current models indicate that while they perform well on single-step transitions, they struggle significantly with maintaining temporal coherence and spatial grounding over longer interaction sequences. Our findings identify icon interpretation, text rendering, and localization precision as critical bottlenecks. This work provides a foundation for systematic assessment and suggests promising directions for future research toward building high-fidelity generative GUI environments. The code is available at: https://github.com/stepfun-ai/GEBench.

cs.SD

[325] DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis

Bin Lin, Peng Yang, Chao Yan, Xiaochen Liu, Wei Wang, Boyong Wu, Pengfei Tan, Xuerui Yang

Main category: cs.SD

TL;DR: DSFlow: A modular distillation framework for efficient few-step and one-step text-to-speech synthesis that addresses process variance and parameter inefficiency in flow-matching models.

Details

[326] The SJTU X-LANCE Lab System for MSR Challenge 2025

Jinxuan Zhu, Hao Qiu, Haina Zhu, Jianwei Yu, Kai Yu, Xie Chen

Main category: cs.SD

TL;DR: A music source restoration system using sequential BS-RoFormers for source separation, denoising, and dereverberation that achieved top performance in the MSR Challenge 2025.

Details

Motivation: To develop an effective system for the Music Source Restoration Challenge 2025 that can handle multiple tasks including music source separation, denoising, and dereverberation for 8 different instruments.

Method: Uses sequential BS-RoFormers (each handling a single task: MSS, denoise, dereverb), leverages pretrained MSS checkpoints, and employs training schemes including dataset mixing/cleaning, random mixture augmentation, and audio length scale-up.

Result: Achieved first rank in all three subjective and three objective evaluation metrics, with MMSNR score of 4.4623 and FAD score of 0.1988.

Conclusion: The proposed sequential BS-RoFormer approach with comprehensive training strategies is highly effective for music source restoration tasks, achieving state-of-the-art performance in the challenge.

Abstract: This report describes the system submitted to the music source restoration (MSR) Challenge 2025. Our approach is composed of sequential BS-RoFormers, each dealing with a single task including music source separation (MSS), denoise and dereverb. To support 8 instruments given in the task, we utilize pretrained checkpoints from MSS community and finetune the MSS model with several training schemes, including (1) mixing and cleaning of datasets; (2) random mixture of music pieces for data augmentation; (3) scale-up of audio length. Our system achieved the first rank in all three subjective and three objective evaluation metrics, including an MMSNR score of 4.4623 and an FAD score of 0.1988. We have open-sourced all the code and checkpoints at https://github.com/ModistAndrew/xlance-msr.

[327] Stemphonic: All-at-once Flexible Multi-stem Music Generation

Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres, Cheng-Zhi Anna Huang, Nicholas J. Bryan

Main category: cs.SD

TL;DR: Stemphonic: A diffusion/flow-based framework for generating synchronized music stems in one inference pass with flexible stem combinations and temporal controls.

Details

Motivation: Existing music stem generation approaches have limitations: fixed architectures output predefined stems in parallel, while sequential generation is slow. There's a need for flexible, efficient multi-stem generation that aligns with musician workflows.

Method: Treats each stem as batch element, groups synchronized stems in batches, applies shared noise latent to each group during training. At inference, uses shared initial noise latent with stem-specific text inputs to generate synchronized multi-stem outputs in one pass. Supports conditional generation and stem-wise activity controls.

Result: Produces higher-quality outputs while accelerating full mix generation by 25-50% compared to existing approaches. Benchmarked on multiple open-source stem evaluation sets.

Conclusion: Stemphonic overcomes the trade-off between flexibility and efficiency in music stem generation, enabling one-pass generation of variable synchronized stems with user controls for iterative composition workflows.

Abstract: Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.

[328] NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

Yufan Wen, Zhaocheng Liu, YeGuo Hua, Ziyi Guo, Lihua Zhang, Chun Yuan, Jian Wu

Main category: cs.SD

Details

[329] Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

Jackie Lin, Jiaqi Su, Nishit Anand, Zeyu Jin, Minje Kim, Paris Smaragdis

Main category: cs.SD

TL;DR: Gencho: A diffusion-transformer model for blind room impulse response estimation and generation from reverberant speech, enabling controllable acoustic simulation.

Details

Motivation: Existing blind RIR estimation methods have limited modeling capability and degraded performance under unseen conditions, while emerging generative audio applications require more flexible impulse response generation methods.

Method: Uses a diffusion-transformer-based model with structure-aware encoder that leverages isolation between early and late reflections to encode input audio, and a diffusion decoder to generate diverse, perceptually realistic impulse responses from the encoded representation.

Result: Generates richer RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics, and demonstrates application to text-conditioned RIR generation.

Conclusion: Gencho offers versatile controllable acoustic simulation and generative audio capabilities, integrating modularly with standard speech processing pipelines for acoustic matching.

Abstract: Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong performance in standard RIR metrics. We further demonstrate its application to text-conditioned RIR generation, highlighting Gencho’s versatility for controllable acoustic simulation and generative audio tasks.

[330] Covo-Audio Technical Report

Main category: cs.SD

Details

[331] Evaluating Disentangled Representations for Controllable Music Generation

Laura Ibáñez-Martínez, Chukwuemeka Nkama, Andrea Poltronieri, Xavier Serra, Martín Rocamora

Main category: cs.SD

TL;DR: Analysis of disentangled representations in music audio models reveals inconsistencies between intended and actual semantics, questioning current approaches to controllable music generation.

Details

Motivation: Recent music generation approaches use disentangled representations (structure/timbre, local/global) for controllable synthesis, but the underlying properties of these embeddings remain underexplored, requiring systematic evaluation.

Method: Probing-based framework evaluating diverse unsupervised disentanglement strategies (inductive biases, data augmentations, adversarial objectives, staged training) across four axes: informativeness, equivariance, invariance, and disentanglement, assessed across datasets, tasks, and controlled transformations.

Result: Findings reveal inconsistencies between intended and actual semantics of embeddings, suggesting current strategies fall short of producing truly disentangled representations.

Conclusion: Current disentanglement approaches in music generation are insufficient, prompting a re-examination of how controllability is approached in this domain.

Abstract: Recent approaches in music generation rely on disentangled representations, often labeled as structure and timbre or local and global, to enable controllable synthesis. Yet the underlying properties of these embeddings remain underexplored. In this work, we evaluate such disentangled representations in a set of music audio models for controllable generation using a probing-based framework that goes beyond standard downstream tasks. The selected models reflect diverse unsupervised disentanglement strategies, including inductive biases, data augmentations, adversarial objectives, and staged training procedures. We further isolate specific strategies to analyze their effect. Our analysis spans four key axes: informativeness, equivariance, invariance, and disentanglement, which are assessed across datasets, tasks, and controlled transformations. Our findings reveal inconsistencies between intended and actual semantics of the embeddings, suggesting that current strategies fall short of producing truly disentangled representations, and prompting a re-examination of how controllability is approached in music generation.

[332] Bayesian Speech Synthesizers Can Learn from Multiple Teachers

Ziyang Zhang, Yifan Gao, Xuenan Xu, Baoxiang Li, Wen Wu, Chao Zhang

Main category: cs.SD

TL;DR: BELLE is a Bayesian evidential learning framework for TTS that models speech uncertainty via Normal-Inverse-Gamma distributions, enabling high-quality streaming generation with significantly less training data than current models.

Details

Motivation: Current TTS models treat speech generation as deterministic regression, ignoring the inherent "one-to-many" uncertainty in speech production. While autoregressive models show promise, they use fixed-variance priors that constrain generation to static point estimates, failing to capture natural speech variability.

Method: BELLE shifts from deterministic prediction to Bayesian inference by modeling acoustic targets as Normal-Inverse-Gamma distributions to capture data-dependent aleatoric uncertainty. It introduces a “one-to-many” training strategy using synthetic samples as statistical support sets to learn robust distributional properties rather than imitating teacher artifacts.

Result: BELLE trained on only ~5k hours of data outperforms leading open-source models trained on 50k hours, achieving a 25.8% relative WER reduction. It naturally supports high-quality streaming generation and demonstrates superior performance with significantly less training data.

Conclusion: BELLE successfully bridges the gap between deterministic TTS models and the inherent uncertainty of speech generation, providing a principled Bayesian framework that enables high-quality speech synthesis with improved efficiency and natural variability.

Abstract: Text-to-Speech (TTS) is inherently a “one-to-many” mapping characterized by intrinsic uncertainty, yet current paradigms often oversimplify it into a deterministic regression task. While continuous-valued autoregressive (AR) models have recently emerged as a promising alternative to discrete codec-based approaches, they typically rely on a fixed-variance prior, fundamentally constraining generation to a static point estimate that ignores the dynamic variability of natural speech. To bridge this gap, we propose BELLE (Bayesian evidential learning with language modelling), a framework that shifts from deterministic prediction to principled Bayesian inference without increasing model parameters or inference latency. By modeling the acoustic target as a Normal-Inverse-Gamma distribution, BELLE captures data-dependent aleatoric uncertainty. To enable accurate variance estimation on standard single-reference datasets, we introduce a “one-to-many” training strategy that leverages synthetic samples as a statistical support set, allowing the model to learn robust distributional properties rather than merely imitating teacher artifacts. Experiments demonstrate that BELLE, trained on only ~5k hours of data, outperforms leading open-source models trained on 50k hours (achieving a 25.8% relative WER reduction) and naturally supports high-quality streaming generation. Audio samples are available at https://belletts.github.io/Belle/.

[333] No Word Left Behind: Mitigating Prefix Bias in Open-Vocabulary Keyword Spotting

Yi Liu, Chuan-Che Jeff Huang, Xiao Quan

Main category: cs.SD

TL;DR: The paper addresses prefix bias in open-vocabulary keyword spotting by introducing a benchmark for partial overlap cases and proposing a lightweight scoring method to improve discrimination of similar-sounding commands.

Details

Motivation: Existing open-vocabulary keyword spotting systems show bias toward beginning phonemes, causing false triggers when commands share prefixes (e.g., "turn the volume up" vs. "turn the volume down"). This prefix bias stems from training data limitations and position-biased cross-modal scoring.

Method: Introduces Partial Overlap Benchmark (POB) with two datasets (POB-Spark and POB-LibriPhrase) containing mismatched audio-text pairs with shared prefixes. Proposes Equal-weighting Position Scoring (EPS), a lightweight decision layer to address position bias in cross-modal scoring.

Result: EPS alone reduces EER on POB-Spark from 64.4% to 29.3% and improves POB-LibriPhrase accuracy from 87.6% to 96.8%, while maintaining performance on standard benchmarks. With POB data added to training, achieves best POB benchmark results with minimal degradation on prior metrics.

Conclusion: The work successfully addresses prefix bias in open-vocabulary keyword spotting through benchmark creation and lightweight scoring improvements, though trade-offs with single-word command datasets remain a challenge for future work.

Abstract: Open-vocabulary keyword spotting (OV-KWS) enables personalized device control via arbitrary voice commands. Recently, researchers have explored using audio-text joint embeddings, allowing users to enroll phrases with text, and proposed techniques to disambiguate similar utterances. We find that existing OV-KWS solutions often overly bias the beginning phonemes of an enrollment, causing false triggers when negative enrollment-query-pairs share a prefix (turn the volume up'' vs. turn the volume down’’). We trace this to two factors: training data bias and position-biased cross-modal scoring. To address these limitations, we introduce the Partial Overlap Benchmark (POB) with two datasets, POB-Spark and POB-LibriPhrase (POB-LP), containing mismatched audio-text pairs with shared prefixes, and propose Equal-weighting Position Scoring (EPS), a lightweight decision layer. Using EPS alone reduces EER on POB-Spark from 64.4% to 29.3% and improves POB-LP accuracy from 87.6% to 96.8%, while maintaining performance on LibriPhrase and Google Speech Commands (GSC). With POB data added in training, our work achieves the best POB benchmark results while incurring the least amount of degradation on prior metrics among baselines. This degradation is most pronounced in GSC, which contains only one-word commands. We surface mitigating this trade-off as future work.

cs.LG

[334] Enhanced Graph Transformer with Serialized Graph Tokens

Ruixiang Wang, Yuyang Hong, Shiming Xiang, Chunhong Pan

Main category: cs.LG

TL;DR: Proposes a serialized token paradigm for graph transformers to overcome information bottleneck in graph-level representation learning, achieving SOTA results on graph benchmarks.

Details

Motivation: Existing graph transformer methods face an information bottleneck when generating graph-level representations. The prevalent single token paradigm fails to fully leverage self-attention's strength in encoding token sequences and degenerates into a weighted sum of node signals.

Method: Designs a novel serialized token paradigm: 1) Graph serialization method aggregates node signals into serialized graph tokens with automatic positional encoding, 2) Stacked self-attention layers encode the token sequence to capture internal dependencies, enabling modeling of complex interactions among multiple graph tokens.

Result: Achieves state-of-the-art results on several graph-level benchmarks. Ablation studies verify the effectiveness of the proposed modules.

Conclusion: The serialized token paradigm yields more expressive graph representations by better leveraging self-attention mechanisms for graph-level tasks, overcoming limitations of single token approaches.

Abstract: Transformers have demonstrated success in graph learning, particularly for node-level tasks. However, existing methods encounter an information bottleneck when generating graph-level representations. The prevalent single token paradigm fails to fully leverage the inherent strength of self-attention in encoding token sequences, and degenerates into a weighted sum of node signals. To address this issue, we design a novel serialized token paradigm to encapsulate global signals more effectively. Specifically, a graph serialization method is proposed to aggregate node signals into serialized graph tokens, with positional encoding being automatically involved. Then, stacked self-attention layers are applied to encode this token sequence and capture its internal dependencies. Our method can yield more expressive graph representations by modeling complex interactions among multiple graph tokens. Experimental results show that our method achieves state-of-the-art results on several graph-level benchmarks. Ablation studies verify the effectiveness of the proposed modules.

[335] Spectral Disentanglement and Enhancement: A Dual-domain Contrastive Framework for Representation Learning

Jinjin Guo, Yexin Li, Zhichao Huang, Jun Fang, Zhiyuan Liu, Chao Liu, Pengzhang Liu, Qixia Jiang

Main category: cs.LG

TL;DR: Spectral Disentanglement and Enhancement (SDE) is a framework that addresses spectral imbalance in multimodal contrastive learning by partitioning features into strong/weak signals and noise, then applying curriculum-based enhancement and dual-domain contrastive loss for better representation learning.

Details

Motivation: Current multimodal contrastive learning suffers from uniform treatment of feature dimensions and neglect of intrinsic spectral structure, leading to embeddings collapsing into narrow cones where task-relevant semantics concentrate in small subspaces while most dimensions contain noise and spurious correlations, undermining model generalization.

Method: Uses singular value decomposition to partition feature dimensions into strong signals (task-critical semantics), weak signals (ancillary correlations), and noise (irrelevant perturbations). Applies curriculum-based spectral enhancement to selectively amplify informative components, then introduces dual-domain contrastive loss that jointly optimizes alignment in both feature and spectral spaces.

Result: Extensive experiments on large-scale multimodal benchmarks demonstrate consistent improvements in representation robustness and generalization, outperforming state-of-the-art methods.

Conclusion: SDE bridges the gap between embedded space geometry and spectral properties, integrates seamlessly with existing contrastive pipelines, and offers an effective solution for multimodal representation learning with better spectral regularization.

Abstract: Large-scale multimodal contrastive learning has recently achieved impressive success in learning rich and transferable representations, yet it remains fundamentally limited by the uniform treatment of feature dimensions and the neglect of the intrinsic spectral structure of the learned features. Empirical evidence indicates that high-dimensional embeddings tend to collapse into narrow cones, concentrating task-relevant semantics in a small subspace, while the majority of dimensions remain occupied by noise and spurious correlations. Such spectral imbalance and entanglement undermine model generalization. We propose Spectral Disentanglement and Enhancement (SDE), a novel framework that bridges the gap between the geometry of the embedded spaces and their spectral properties. Our approach leverages singular value decomposition to adaptively partition feature dimensions into strong signals that capture task-critical semantics, weak signals that reflect ancillary correlations, and noise representing irrelevant perturbations. A curriculum-based spectral enhancement strategy is then applied, selectively amplifying informative components with theoretical guarantees on training stability. Building upon the enhanced features, we further introduce a dual-domain contrastive loss that jointly optimizes alignment in both the feature and spectral spaces, effectively integrating spectral regularization into the training process and encouraging richer, more robust representations. Extensive experiments on large-scale multimodal benchmarks demonstrate that SDE consistently improves representation robustness and generalization, outperforming state-of-the-art methods. SDE integrates seamlessly with existing contrastive pipelines, offering an effective solution for multimodal representation learning.

[336] Learning to Remember, Learn, and Forget in Attention-Based Models

Djohan Bonnet, Jamie Lohoff, Jan Finkbeiner, Elidona Skhikerujah, Emre Neftci

Main category: cs.LG

TL;DR: Palimpsa is a self-attention model that treats in-context learning as a continual learning problem, using Bayesian metaplasticity to manage the stability-plasticity dilemma and expand memory capacity in transformers.

Details

Motivation: Current gated linear attention models have fixed memory capacity and suffer from interference in long sequences, limiting their in-context learning capabilities. The paper aims to address this by viewing ICL as a continual learning problem requiring balance between stability (retaining old knowledge) and plasticity (learning new information).

Method: Proposes Palimpsa, a self-attention model using Bayesian metaplasticity where each attention state’s plasticity is tied to an importance state grounded by a prior distribution capturing accumulated knowledge. Shows that various gated linear attention models emerge as specific architecture choices and posterior approximations, with Mamba2 being a special case where forgetting dominates.

Result: Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and Commonsense Reasoning tasks, demonstrating expanded memory capacity and improved performance.

Conclusion: Palimpsa provides a theoretical framework connecting various attention models through Bayesian metaplasticity, enabling transformation of non-metaplastic models into metaplastic ones with significantly expanded memory capacity for improved in-context learning.

Abstract: In-Context Learning (ICL) in transformers acts as an online associative memory and is believed to underpin their high performance on complex sequence processing tasks. However, in gated linear attention models, this memory has a fixed capacity and is prone to interference, especially for long sequences. We propose Palimpsa, a self-attention model that views ICL as a continual learning problem that must address a stability-plasticity dilemma. Palimpsa uses Bayesian metaplasticity, where the plasticity of each attention state is tied to an importance state grounded by a prior distribution that captures accumulated knowledge. We demonstrate that various gated linear attention models emerge as specific architecture choices and posterior approximations, and that Mamba2 is a special case of Palimpsa where forgetting dominates. This theoretical link enables the transformation of any non-metaplastic model into a metaplastic one, significantly expanding its memory capacity. Our experiments show that Palimpsa consistently outperforms baselines on the Multi-Query Associative Recall (MQAR) benchmark and on Commonsense Reasoning tasks.

[337] In-Hospital Stroke Prediction from PPG-Derived Hemodynamic Features

Jiaming Liu, Cheng Ding, Daoqiang Zhang

Main category: cs.LG

TL;DR: Using LLM-assisted data mining to extract pre-stroke PPG data from hospitalized patients, researchers demonstrate that photoplethysmography waveforms contain predictive signatures of stroke 4-6 hours before onset, enabling proactive early warning systems.

Details

Motivation: Standard clinical datasets lack pre-hospital physiological data for stroke prediction, leaving the predictive value of continuous monitoring signals like PPG unvalidated. The research aims to overcome this limitation by analyzing rare cases where stroke occurs during hospitalization while patients are under continuous monitoring.

Method: Developed an LLM-assisted data mining pipeline to extract precise in-hospital stroke onset timestamps from unstructured clinical notes in MIMIC-III and MC-MED datasets, followed by physician validation. Identified patients with synchronized pre-onset PPG data, extracted hemodynamic features from PPG, and employed a ResNet-1D model to predict impending stroke across multiple early-warning horizons.

Result: The model achieved F1-scores of 0.7956, 0.8759, and 0.9406 at 4, 5, and 6 hours prior to stroke onset on MIMIC-III. Without re-tuning, it reached 0.9256, 0.9595, and 0.9888 on MC-MED for the same horizons, demonstrating strong predictive performance.

Conclusion: This provides the first empirical evidence from real-world clinical data that PPG contains predictive signatures of stroke several hours before onset, supporting a shift from post-event recognition to proactive, physiology-based surveillance that could improve patient outcomes in routine clinical care.

Abstract: The absence of pre-hospital physiological data in standard clinical datasets fundamentally constrains the early prediction of stroke, as patients typically present only after stroke has occurred, leaving the predictive value of continuous monitoring signals such as photoplethysmography (PPG) unvalidated. In this work, we overcome this limitation by focusing on a rare but clinically critical cohort - patients who suffered stroke during hospitalization while already under continuous monitoring - thereby enabling the first large-scale analysis of pre-stroke PPG waveforms aligned to verified onset times. Using MIMIC-III and MC-MED, we develop an LLM-assisted data mining pipeline to extract precise in-hospital stroke onset timestamps from unstructured clinical notes, followed by physician validation, identifying 176 patients (MIMIC) and 158 patients (MC-MED) with high-quality synchronized pre-onset PPG data, respectively. We then extract hemodynamic features from PPG and employ a ResNet-1D model to predict impending stroke across multiple early-warning horizons. The model achieves F1-scores of 0.7956, 0.8759, and 0.9406 at 4, 5, and 6 hours prior to onset on MIMIC-III, and, without re-tuning, reaches 0.9256, 0.9595, and 0.9888 on MC-MED for the same horizons. These results provide the first empirical evidence from real-world clinical data that PPG contains predictive signatures of stroke several hours before onset, demonstrating that passively acquired physiological signals can support reliable early warning, supporting a shift from post-event stroke recognition to proactive, physiology-based surveillance that may materially improve patient outcomes in routine clinical care.

[338] Patient foundation model for risk stratification in low-risk overweight patients

Zachary N. Flamholz, Dillon Tracy, Ripple Khera, Jordan Wolinsky, Nicholas Lee, Nathaniel Tann, Xiao Yin Zhu, Harry Phillips, Jeffrey Sherman

Main category: cs.LG

TL;DR: PatientTPP is a neural temporal point process model trained on clinical trajectories to learn patient representations for risk stratification in obesity, outperforming BMI in predicting cardiovascular costs.

Details

Motivation: Accurate risk stratification for overweight/obese patients is crucial for preventive care and allocating expensive therapies like GLP-1 agonists. Current methods like BMI are limited, and there's a need for models that can leverage rich clinical trajectory data for better prediction.

Method: Developed PatientTPP, a neural temporal point process model trained on over 500,000 real-world clinical trajectories. Extended existing TPP approaches to include static and numeric features, incorporated clinical knowledge for event encoding, and modeled both event type and timing.

Result: PatientTPP representations effectively support downstream prediction tasks, including classification of obesity-associated outcomes in low-risk individuals, even for events not explicitly modeled during training. Outperformed BMI in stratifying patients by future cardiovascular-related healthcare costs and identified higher-risk patients more efficiently.

Conclusion: PatientTPP provides an interpretable, general-purpose foundation for patient risk modeling with direct applications to obesity-related care and cost targeting by modeling both clinical event type and timing in patient trajectories.

Abstract: Accurate risk stratification in patients with overweight or obesity is critical for guiding preventive care and allocating high-cost therapies such as GLP-1 receptor agonists. We present PatientTPP, a neural temporal point process (TPP) model trained on over 500,000 real-world clinical trajectories to learn patient representations from sequences of diagnoses, labs, and medications. We extend existing TPP modeling approaches to include static and numeric features and incorporate clinical knowledge for event encoding. PatientTPP representations support downstream prediction tasks, including classification of obesity-associated outcomes in low-risk individuals, even for events not explicitly modeled during training. In health economic evaluation, PatientTPP outperformed body mass index in stratifying patients by future cardiovascular-related healthcare costs, identifying higher-risk patients more efficiently. By modeling both the type and timing of clinical events, PatientTPP offers an interpretable, general-purpose foundation for patient risk modeling with direct applications to obesity-related care and cost targeting.

[339] OmniMER: Auxiliary-Enhanced LLM Adaptation for Indonesian Multimodal Emotion Recognition

Xueming Yan, Boyan Xu, Yaochu Jin, Lixian Xiao, Wenlong Ye, Runyang Cai, Zeqi Zheng, Jingfa Liu, Aimin Yang, Yongduan Song

Main category: cs.LG

TL;DR: Introduces IndoMER, the first multimodal emotion recognition benchmark for Indonesian, and OmniMER, a multimodal adaptation framework built on Qwen2.5-Omni that uses auxiliary modality-specific perception tasks to improve emotion recognition performance.

Details

Motivation: Indonesian is widely spoken but underserved in multimodal emotion recognition research despite its dominance on Southeast Asian social media. There's a need for culturally-aware multimodal datasets and models for this language.

Method: Created IndoMER dataset with 1,944 video segments featuring text, audio, and visual annotations across 7 emotion categories. Proposed OmniMER framework built on Qwen2.5-Omni with three auxiliary tasks: emotion keyword extraction (text), facial expression analysis (video), and prosody analysis (audio) to identify emotion-relevant cues before fusion.

Result: OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on Chinese CH-SIMS dataset shows generalizability.

Conclusion: The work addresses the gap in Indonesian multimodal emotion recognition with a culturally-aware dataset and effective framework that leverages auxiliary modality-specific tasks to improve performance in low-resource settings.

Abstract: Indonesian, spoken by over 200 million people, remains underserved in multimodal emotion recognition research despite its dominant presence on Southeast Asian social media platforms. We introduce IndoMER, the first multimodal emotion recognition benchmark for Indonesian, comprising 1,944 video segments from 203 speakers with temporally aligned text, audio, and visual annotations across seven emotion categories. The dataset exhibits realistic challenges including cross-modal inconsistency and long-tailed class distributions shaped by Indonesian cultural communication norms. To address these challenges, we propose OmniMER, a multimodal adaptation framework built upon Qwen2.5-Omni that enhances emotion recognition through three auxiliary modality-specific perception tasks: emotion keyword extraction for text, facial expression analysis for video, and prosody analysis for audio. These auxiliary tasks help the model identify emotion-relevant cues in each modality before fusion, reducing reliance on spurious correlations in low-resource settings. Experiments on IndoMER show that OmniMER achieves 0.582 Macro-F1 on sentiment classification and 0.454 on emotion recognition, outperforming the base model by 7.6 and 22.1 absolute points respectively. Cross-lingual evaluation on the Chinese CH-SIMS dataset further demonstrates the generalizability of the proposed framework. The dataset and code are publicly available. https://github.com/yanxm01/INDOMER

[340] Looping Back to Move Forward: Recursive Transformers for Efficient and Flexible Large Multimodal Models

Ruihan Xu, Yuting Gao, Lan Wang, Jianing Li, Weihao Chen, Qingpei Guo, Ming Yang, Shiliang Zhang

Main category: cs.LG

TL;DR: RecursiveVLM introduces a recursive Transformer architecture for Large Multimodal Models that reuses parameters through recursive refinement to extract stronger multimodal representations without increasing model size.

Details

Motivation: Large Multimodal Models have vast parameter counts that are often underutilized during training and inference. The paper aims to improve efficiency by reusing model parameters through recursive refinement rather than increasing model size.

Method: Proposes RecursiveVLM with two key innovations: 1) Recursive Connector that aligns features across recursion steps by fusing intermediate-layer hidden states with modality-specific projections, and 2) Monotonic Recursion Loss that supervises every step and guarantees performance improves monotonically with recursion depth.

Result: Experiments show consistent gains of +3% over standard Transformers and +7% over vanilla recursive baselines. The design enables on-demand refinement: strong results with few loops on resource-constrained devices and progressive improvement with more computation.

Conclusion: Strategic looping through recursive refinement is a powerful path toward efficient, deployment-adaptive Large Multimodal Models that can balance performance and computational resources.

Abstract: Large Multimodal Models (LMMs) have achieved remarkable success in vision-language tasks, yet their vast parameter counts are often underutilized during both training and inference. In this work, we embrace the idea of looping back to move forward: reusing model parameters through recursive refinement to extract stronger multimodal representations without increasing model size. We propose RecursiveVLM, a recursive Transformer architecture tailored for LMMs. Two key innovations enable effective looping: (i) a Recursive Connector that aligns features across recursion steps by fusing intermediate-layer hidden states and applying modality-specific projections, respecting the distinct statistical structures of vision and language tokens; (ii) a Monotonic Recursion Loss that supervises every step and guarantees performance improves monotonically with recursion depth. This design transforms recursion into an on-demand refinement mechanism: delivering strong results with few loops on resource-constrained devices and progressively improving outputs when more computation resources are available. Experiments show consistent gains of +3% over standard Transformers and +7% over vanilla recursive baselines, demonstrating that strategic looping is a powerful path toward efficient, deployment-adaptive LMMs.

[341] DMamba: Decomposition-enhanced Mamba for Time Series Forecasting

Ruxuan Chen, Fang Sun

Main category: cs.LG

TL;DR: DMamba: A novel time series forecasting model using seasonal-trend decomposition with specialized Mamba encoder for seasonal components and simple MLP for trend components, achieving state-of-the-art performance.

Details

Motivation: Existing Mamba-based architectures struggle with non-stationary time series patterns. The statistical nature of inter-variable relationships differs fundamentally between trend and seasonal components - trends reside on lower-dimensional manifolds while seasonal components require more expressive modeling of dynamic, high-dimensional interactions.

Method: DMamba employs seasonal-trend decomposition and processes components with specialized modules: a variable-direction Mamba encoder captures rich cross-variable dynamics in seasonal components, while a simple MLP learns from lower-dimensional inter-variable relationships in trend components.

Result: Extensive experiments on diverse datasets demonstrate that DMamba sets a new state-of-the-art, consistently outperforming both recent Mamba-based architectures and leading decomposition-based models.

Conclusion: DMamba successfully aligns architectural complexity with component-specific characteristics of time series data, providing an effective approach for handling non-stationary patterns through specialized processing of seasonal and trend components.

Abstract: State Space Models (SSMs), particularly Mamba, have shown potential in long-term time series forecasting. However, existing Mamba-based architectures often struggle with datasets characterized by non-stationary patterns. A key observation from time series theory is that the statistical nature of inter-variable relationships differs fundamentally between the trend and seasonal components of a decomposed series. Trend relationships are often driven by a few common stochastic factors or long-run equilibria, suggesting that they reside on a lower-dimensional manifold. In contrast, seasonal relationships involve dynamic, high-dimensional interactions like phase shifts and amplitude co-movements, requiring more expressive modeling. In this paper, we propose DMamba, a novel forecasting model that explicitly aligns architectural complexity with this component-specific characteristic. DMamba employs seasonal-trend decomposition and processes the components with specialized, differentially complex modules: a variable-direction Mamba encoder captures the rich, cross-variable dynamics within the seasonal component, while a simple Multi-Layer Perceptron (MLP) suffices to learn from the lower-dimensional inter-variable relationships in the trend component. Extensive experiments on diverse datasets demonstrate that DMamba sets a new state-of-the-art (SOTA), consistently outperforming both recent Mamba-based architectures and leading decomposition-based models.

[342] From Adam to Adam-Like Lagrangians: Second-Order Nonlocal Dynamics

Carlos Heredia

Main category: cs.LG

TL;DR: Accelerated continuous-time formulation of Adam as a second-order integro-differential dynamical system with stability analysis and variational viewpoint.

Details

Motivation: To develop a continuous-time theoretical framework for understanding Adam optimization algorithm through dynamical systems and variational perspectives.

Method: Model Adam as a second-order integro-differential dynamical system, relate it to existing first-order nonlocal Adam flow via α-refinement limit, provide Lyapunov-based stability analysis, and introduce Adam-inspired nonlocal Lagrangian formulation.

Result: Derived accelerated continuous-time formulation of Adam, established connections to existing models, provided convergence analysis, and numerical simulations on Rosenbrock-type examples show agreement with discrete Adam.

Conclusion: The paper provides a rigorous continuous-time dynamical systems perspective on Adam optimization with stability guarantees and variational interpretation.

Abstract: In this paper, we derive an accelerated continuous-time formulation of Adam by modeling it as a second-order integro-differential dynamical system. We relate this inertial nonlocal model to an existing first-order nonlocal Adam flow through an $α$-refinement limit, and we provide Lyapunov-based stability and convergence analyses. We also introduce an Adam-inspired nonlocal Lagrangian formulation, offering a variational viewpoint. Numerical simulations on Rosenbrock-type examples show agreement between the proposed dynamics and discrete Adam.

[343] Positive-Unlabelled Active Learning to Curate a Dataset for Orca Resident Interpretation

Bret Nestor, Bohan Yao, Jasmine Moore, Jasper Kanes

Main category: cs.LG

TL;DR: Largest curation of Southern Resident Killer Whale acoustic data using weakly-supervised active learning to create transformer-based marine mammal detectors and classifiers, yielding thousands of hours of annotated audio data for conservation and research.

Details

Motivation: To create the largest acoustic dataset of Southern Resident Killer Whales and other marine mammals for conservation efforts, addressing the need for comprehensive audio data to study critically endangered species and their habitats.

Method: Systematically searched 30+ years of public archival hydrophone data using weakly-supervised, positive-unlabelled active learning strategy with transformer-based detectors to identify marine mammal instances across multiple datasets.

Result: Created dataset with 919 hours of SRKW data, 230 hours of Bigg’s orca, 1374 hours of unlabelled orca, 1501 hours of humpback, and other marine mammal audio. Transformer detectors outperformed state-of-the-art on multiple benchmarks with 0-28.8% specificity at 95% sensitivity.

Conclusion: This comprehensive acoustic dataset enables unsupervised machine translation, habitat usage surveys, and conservation efforts for critically endangered marine mammals, providing valuable resources for ecological research and monitoring.

Abstract: This work presents the largest curation of Southern Resident Killer Whale (SRKW) acoustic data to date, also containing other marine mammals in their environment. We systematically search all available public archival hydrophone data within the SRKW habitat (over 30 years of audio data). The search consists of a weakly-supervised, positive-unlabelled, active learning strategy to identify all instances of marine mammals. The resulting transformer-based detectors outperform state-of-the-art detectors on the DEEPAL, DCLDE-2026, and two newly introduced expert-annotated datasets in terms of accuracy, energy efficiency, and speed. The detection model has a specificity of 0-28.8% at 95% sensitivity. Our multiclass species classifier obtains a top-1 accuracy of 42.1% (11 train classes, 4 test classes) and our ecotype classifier obtains a top-1 accuracy of 43.0% (4 train classes, 5 test classes) on the DCLDE-2026 dataset. We yield 919 hours of SRKW data, 230 hours of Bigg’s orca data, 1374 hours of orca data from unlabelled ecotypes, 1501 hours of humpback data, 88 hours of sea lion data, 246 hours of pacific white-sided dolphin data, and over 784 hours of unspecified marine mammal data. This SRKW dataset is larger than DCLDE-2026, Ocean Networks Canada, and OrcaSound combined. The curated species labels are available under CC-BY 4.0 license, and the corresponding audio data are available under the licenses of the original owners. The comprehensive nature of this dataset makes it suitable for unsupervised machine translation, habitat usage surveys, and conservation endeavours for this critically endangered ecotype.

[344] Distributed Hybrid Parallelism for Large Language Models: Comparative Study and System Design Guide

Hossam Amer, Rezaul Karim, Ali Pourranjbar, Weiwei Zhang, Walid Ahmed, Boxing Chen

Main category: cs.LG

TL;DR: A comprehensive survey of distributed computing methods for large language models, covering collective operations, parallel strategies, hybrid designs, and automated optimization techniques.

Details

Motivation: Existing surveys provide descriptive overviews but lack systematic analysis of benefits/trade-offs and principled methodology for designing optimal distributed systems for LLM training and inference.

Method: Comprehensive review of collective operations and distributed parallel strategies with mathematical formulations, examination of hybrid parallelization designs, analysis of communication-computation overlap, discussion of automated search methods using cost models, and case studies with mainstream architectures.

Result: Provides theoretical understanding through mathematical formulations, empirical insights from case studies, and systematic analysis of distributed computing techniques for LLMs across training and inference stages.

Conclusion: Highlights open challenges in current LLM training paradigms and outlines promising directions for next-generation large-scale model development, offering guidance for researchers and practitioners in parallelism strategy selection.

Abstract: With the rapid growth of large language models (LLMs), a wide range of methods have been developed to distribute computation and memory across hardware devices for efficient training and inference. While existing surveys provide descriptive overviews of these techniques, systematic analysis of their benefits and trade offs and how such insights can inform principled methodology for designing optimal distributed systems remain limited. This paper offers a comprehensive review of collective operations and distributed parallel strategies, complemented by mathematical formulations to deepen theoretical understanding. We further examine hybrid parallelization designs, emphasizing communication computation overlap across different stages of model deployment, including both training and inference. Recent advances in automated search for optimal hybrid parallelization strategies using cost models are also discussed. Moreover, we present case studies with mainstream architecture categories to reveal empirical insights to guide researchers and practitioners in parallelism strategy selection. Finally, we highlight open challenges and limitations of current LLM training paradigms and outline promising directions for the next generation of large scale model development.

[345] Benchmarking the Energy Savings with Speculative Decoding Strategies

Rohit Dutta, Paramita Koley, Soham Poddar, Janardan Misra, Sanjay Podder, Naveen Balani, Saptarshi Ghosh, Niloy Ganguly

Main category: cs.LG

TL;DR: Survey paper analyzing energy requirements of speculative decoding strategies for LLMs, examining how model size, decoding strategies, and datasets affect energy optimization.

Details

Motivation: While speculative decoding has become popular for reducing LLM inference latency and cost, there has been insufficient attention to the energy requirements of these methods. The paper aims to address this gap by providing a comprehensive analysis of energy consumption in speculative decoding.

Method: Conducted a comprehensive survey and analysis of energy requirements across different speculative decoding strategies. Examined multiple factors including model size and family, different speculative decoding approaches, and dataset characteristics to understand their impact on energy optimization.

Result: Provides detailed analysis showing how various factors influence energy optimizations in speculative decoding. Likely identifies trade-offs between speed, accuracy, and energy consumption, and offers insights into energy-efficient speculative decoding strategies.

Conclusion: Energy considerations are crucial for sustainable and efficient LLM deployment. The survey provides valuable insights for researchers and practitioners to optimize speculative decoding strategies not just for speed and cost, but also for energy efficiency.

Abstract: Speculative decoding has emerged as an effective method to reduce latency and inference cost of LLM inferences. However, there has been inadequate attention towards the energy requirements of these models. To address this gap, this paper presents a comprehensive survey of energy requirements of speculative decoding strategies, with detailed analysis on how various factors – model size and family, speculative decoding strategies, and dataset characteristics – influence the energy optimizations.

[346] Importance inversion transfer identifies shared principles for cross-domain learning

Daniele Caligiore

Main category: cs.LG

TL;DR: X-CDTL framework uses network science and explainable AI to identify structural invariants for cross-domain knowledge transfer, achieving 56% improvement in decision stability under noise.

Details

Motivation: Existing transfer learning methods fail to bridge radically heterogeneous systems, especially under data scarcity or noise. The paper aims to develop a principled approach for cross-disciplinary knowledge propagation by identifying shared organizational principles across domains.

Method: Proposes Explainable Cross-Domain Transfer Learning (X-CDTL) framework that unifies network science and explainable AI. Introduces Importance Inversion Transfer (IIT) mechanism that prioritizes domain-invariant structural anchors over highly discriminative but idiosyncratic features.

Result: In anomaly detection tasks, models guided by X-CDTL principles achieve significant performance gains, including 56% relative improvement in decision stability under extreme noise compared to traditional baselines.

Conclusion: The work demonstrates evidence for shared organizational signatures across heterogeneous domains (biological, linguistic, molecular, social networks) and establishes a principled paradigm for cross-disciplinary knowledge propagation by shifting from opaque latent representations to explicit structural laws.

Abstract: The capacity to transfer knowledge across scientific domains relies on shared organizational principles. However, existing transfer-learning methodologies often fail to bridge radically heterogeneous systems, particularly under severe data scarcity or stochastic noise. This study formalizes Explainable Cross-Domain Transfer Learning (X-CDTL), a framework unifying network science and explainable artificial intelligence to identify structural invariants that generalize across biological, linguistic, molecular, and social networks. By introducing the Importance Inversion Transfer (IIT) mechanism, the framework prioritizes domain-invariant structural anchors over idiosyncratic, highly discriminative features. In anomaly detection tasks, models guided by these principles achieve significant performance gains - exhibiting a 56% relative improvement in decision stability under extreme noise - over traditional baselines. These results provide evidence for a shared organizational signature across heterogeneous domains, establishing a principled paradigm for cross-disciplinary knowledge propagation. By shifting from opaque latent representations to explicit structural laws, this work advances machine learning as a robust engine for scientific discovery.

[347] Quantifying Multimodal Imbalance: A GMM-Guided Adaptive Loss for Audio-Visual Learning

Zhaocheng Liu, Zhiwen Yu, Xiaoqing Liu

Main category: cs.LG

TL;DR: A framework for diagnosing and mitigating modality imbalance in multimodal learning through sample-level analysis and adaptive loss reweighting based on modality gap distributions.

Details

Motivation: Multimodal learning suffers from modality imbalance where dominant modalities suppress weaker ones due to inconsistent convergence rates. Existing methods use static modulation or heuristics and fail to address sample-level variations in prediction bias, especially for outlier samples with exacerbated modality gaps from low data quality.

Method: Proposes a two-stage framework: 1) Warm-up stage, 2) Adaptive Training stage with GMM-guided Adaptive Loss. Introduces Modality Gap metric to quantify prediction discrepancies, identifies bimodal distribution indicating balanced/imbalanced subgroups, uses Gaussian Mixture Model for soft subgroup separation, and dynamically reallocates optimization priorities based on Bayesian posterior probabilities.

Result: Experiments on CREMA-D, AVE, and KineticSound datasets demonstrate significant outperformance over state-of-the-art baselines. Also shows that fine-tuning on GMM-filtered balanced subsets serves as effective data purification strategy, yielding substantial gains by eliminating extreme noisy samples.

Conclusion: The proposed framework effectively addresses modality imbalance at sample level through quantitative diagnosis and dynamic mitigation, improving multimodal learning performance and offering data purification benefits.

Abstract: Multimodal learning integrates diverse modalities but suffers from modality imbalance, where dominant modalities suppress weaker ones due to inconsistent convergence rates. Existing methods predominantly rely on static modulation or heuristics, overlooking sample-level distributional variations in prediction bias. Specifically, they fail to distinguish outlier samples where the modality gap is exacerbated by low data quality. We propose a framework to quantitatively diagnose and dynamically mitigate this imbalance at the sample level. We introduce the Modality Gap metric to quantify prediction discrepancies. Analysis reveals that this gap follows a bimodal distribution, indicating the coexistence of balanced and imbalanced sample subgroups. We employ a Gaussian Mixture Model (GMM) to explicitly model this distribution, leveraging Bayesian posterior probabilities for soft subgroup separation. Our two-stage framework comprises a Warm-up stage and an Adaptive Training stage. In the latter, a GMM-guided Adaptive Loss dynamically reallocates optimization priorities: it imposes stronger alignment penalties on imbalanced samples to rectify bias, while prioritizing fusion for balanced samples to maximize complementary information. Experiments on CREMA-D, AVE, and KineticSound demonstrate that our method significantly outperforms SOTA baselines. Furthermore, we show that fine-tuning on a GMM-filtered balanced subset serves as an effective data purification strategy, yielding substantial gains by eliminating extreme noisy samples even without the adaptive loss.

[348] SpinCastML an Open Decision-Making Application for Inverse Design of Electrospinning Manufacturing: A Machine Learning, Optimal Sampling and Inverse Monte Carlo Approach

Elisa Roldan, Tasneem Sabir

Main category: cs.LG

TL;DR: SpinCastML is an open-source machine learning and Inverse Monte Carlo software for inverse design in electrospinning, enabling prediction of full fiber diameter distributions with chemical constraints.

Details

Motivation: Electrospinning lacks frameworks for inverse design that integrate polymer-solvent chemical constraints or predict full fiber diameter distributions, requiring a shift from trial-and-error to data-driven design.

Method: Built on a curated dataset of 68,480 fiber diameters, integrates three structured sampling methods, 11 high-performance learners, and chemistry-aware constraints with Inverse Monte Carlo for distribution prediction.

Result: Cubist model with polymer-balanced Sobol D optimal sampling achieves R² > 0.92; IMC captures fiber distributions with R² > 0.90 and <1% error in success rates; enables inverse design with quantified probabilities.

Conclusion: SpinCastML establishes distribution-aware inverse design as a new standard for sustainable nanofiber manufacturing, democratizing access to advanced modeling and reducing experimental waste.

Abstract: Electrospinning is a powerful technique for producing micro to nanoscale fibers with application specific architectures. Small variations in solution or operating conditions can shift the jet regime, generating non Gaussian fiber diameter distributions. Despite substantial progress, no existing framework enables inverse design toward desired fiber outcomes while integrating polymer solvent chemical constraints or predicting full distributions. SpinCastML is an open source, distribution aware, chemically informed machine learning and Inverse Monte Carlo (IMC) software for inverse electrospinning design. Built on a rigorously curated dataset of 68,480 fiber diameters from 1,778 datasets across 16 polymers, SpinCastML integrates three structured sampling methods, a suite of 11 high-performance learners, and chemistry aware constraints to predict not only mean diameter but the entire distribution. Cubist model with a polymer balanced Sobol D optimal sampling provides the highest global performance (R2 > 0.92). IMC accurately captures the fiber distributions, achieving R2 > 0.90 and <1% error between predicted and experimental success rates. The IMC engine supports both retrospective analysis and forward-looking inverse design, generating physically and chemically feasible polymer solvent parameter combinations with quantified success probabilities for user-defined targets. SpinCastML reframes electrospinning from trial and error to a reproducible, data driven design process. As an open source executable, it enables laboratories to analyze their own datasets and co create an expanding community software. SpinCastML reduces experimental waste, accelerates discovery, and democratizes access to advanced modeling, establishing distribution aware inverse design as a new standard for sustainable nanofiber manufacturing across biomedical, filtration, and energy applications.

[349] Epistemic Throughput: Fundamental Limits of Attention-Constrained Inference

Lei You

Main category: cs.LG

TL;DR: The paper formalizes Attention-Constrained Inference (ACI) for AI systems that generate many candidates but can only verify a few, deriving scaling laws for epistemic throughput.

Details

Motivation: Modern AI systems can generate many outputs cheaply, but human verification is expensive and limited. This creates a bottleneck where decision-makers must form reliable conclusions from many records with scarce attention.

Method: Formalizes Attention-Constrained Inference (ACI) with two stages: cheap screening of K records and expensive verification of at most B records. Analyzes epistemic throughput (posterior uncertainty reduction) under Bayes log-loss, deriving scaling laws.

Result: Derives “JaKoB” scaling law showing epistemic throughput has linear baseline term plus information-leverage term scaling as √(JKB). Shows scaling is tight in weak-screening limit, and heavy-tailed score distributions are needed for substantial leverage in sparse-verification regime.

Conclusion: Expanding cheap screening can nonlinearly amplify scarce verification capacity, especially when informative records are rare. Heavy-tailed score distributions enable significant information leverage in attention-constrained settings.

Abstract: Recent generative and tool-using AI systems can surface a large volume of candidates at low marginal cost, yet only a small fraction can be checked carefully. This creates a decoder-side bottleneck: downstream decision-makers must form reliable posteriors from many public records under scarce attention. We formalize this regime via Attention-Constrained Inference (ACI), in which a cheap screening stage processes $K$ records and an expensive verification stage can follow up on at most $B$ of them. Under Bayes log-loss, we study the maximum achievable reduction in posterior uncertainty per window, which we call \emph{epistemic throughput}. Our main result is a ``JaKoB’’ scaling law showing that epistemic throughput has a baseline term that grows linearly with verification and prevalence, and an additional \emph{information-leverage} term that scales as $\sqrt{JKB}$, where $J$ summarizes screening quality. Thus, expanding cheap screening can nonlinearly amplify scarce verification, even when informative records are rare. We further show that this scaling is tight in a weak-screening limit, and that in the sparse-verification regime ($B \ll K$), substantial leverage requires heavy-tailed score distributions; for light-tailed scores the amplification is only logarithmic.

[350] Counterfactual Maps: What They Are and How to Find Them

Awa Khouna, Julien Ferry, Thibaut Vidal

Main category: cs.LG

TL;DR: Exact counterfactual explanations for tree ensembles using geometric nearest-region search with KD-trees for millisecond query times.

Details

Motivation: Counterfactual explanations are crucial for interpretable ML but challenging to compute exactly for complex models like tree ensembles. Existing methods lack optimality guarantees or don't scale for interactive use.

Method: Transform tree ensembles into equivalent partitions of labeled hyperrectangles, then cast counterfactual search as nearest-region search using generalized Voronoi cells. Use volumetric KD-trees for branch-and-bound nearest-region queries with optimality certificates.

Result: Achieves globally optimal counterfactual explanations with millisecond-level latency, orders of magnitude faster than existing exact methods on real datasets from high-stakes domains.

Conclusion: Geometric approach to counterfactual generation enables exact, efficient explanations for tree ensembles, making interactive recourse feasible through amortized preprocessing and sublinear query times.

Abstract: Counterfactual explanations are a central tool in interpretable machine learning, yet computing them exactly for complex models remains challenging. For tree ensembles, predictions are piecewise constant over a large collection of axis-aligned hyperrectangles, implying that an optimal counterfactual for a point corresponds to its projection onto the nearest rectangle with an alternative label under a chosen metric. Existing methods largely overlook this geometric structure, relying either on heuristics with no optimality guarantees or on mixed-integer programming formulations that do not scale to interactive use. In this work, we revisit counterfactual generation through the lens of nearest-region search and introduce counterfactual maps, a global representation of recourse for tree ensembles. Leveraging the fact that any tree ensemble can be compressed into an equivalent partition of labeled hyperrectangles, we cast counterfactual search as the problem of identifying the generalized Voronoi cell associated with the nearest rectangle of an alternative label. This leads to an exact, amortized algorithm based on volumetric k-dimensional (KD) trees, which performs branch-and-bound nearest-region queries with explicit optimality certificates and sublinear average query time after a one-time preprocessing phase. Our experimental analyses on several real datasets drawn from high-stakes application domains show that this approach delivers globally optimal counterfactual explanations with millisecond-level latency, achieving query times that are orders of magnitude faster than existing exact, cold-start optimization methods.

[351] UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

Jonathan von Rad, Yong Cao, Andreas Geiger

Main category: cs.LG

TL;DR: UniComp is a unified evaluation framework for comparing LLM compression techniques (pruning, quantization, distillation) across performance, reliability, and efficiency dimensions using diverse benchmarks.

Details

Motivation: Existing evaluations of model compression techniques are limited in method coverage and focus primarily on knowledge-centric benchmarks, lacking comprehensive assessment across different compression approaches.

Method: Introduces UniComp framework evaluating six compression techniques on modern LLMs across 40+ datasets, assessing performance, reliability, and efficiency with hardware-aware analysis.

Result: Compression shows knowledge bias (preserving knowledge tasks while degrading reasoning, multilingual, instruction-following); quantization offers best performance-efficiency trade-off; distillation provides runtime acceleration at high computational cost; task-specific calibration improves pruned models’ reasoning by up to 50%.

Conclusion: UniComp provides comprehensive evaluation framework revealing compression biases and trade-offs, with quantization being most balanced and task-specific calibration significantly improving pruned models.

Abstract: Model compression is increasingly essential for deploying large language models (LLMs), yet existing evaluations are limited in method coverage and focus primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through extensive evaluation of six compression techniques on modern LLMs across more than 40 datasets, we find that (i) compression exhibits a consistent knowledge bias, where knowledge-intensive tasks are relatively preserved while reasoning, multilingual, and instruction-following capabilities degrade substantially; (ii) quantization provides the best overall trade-off between retained performance and efficiency, whereas distillation yields strong runtime acceleration gains at high computational cost; and (iii) task-specific calibration can significantly improve the reasoning ability of pruned models by up to 50%.

[352] What do Geometric Hallucination Detection Metrics Actually Measure?

Eric Yeats, John Buckheit, Sarah Scullen, Brendan Kennedy, Loc Truong, Davis Brown, Bill Kay, Cliff Joslyn, Tegan Emerson, Michael J. Henry, John Emanuello, Henry Kvinge

Main category: cs.LG

TL;DR: Analysis of geometric signals in LLMs for hallucination detection, showing different statistics capture different hallucination types, with domain shift sensitivity addressed via normalization.

Details

Motivation: Hallucination remains a critical barrier for deploying generative models in high-consequence applications, especially when external ground truth is unavailable. The paper aims to understand what specific properties of hallucinations are captured by geometric statistics in LLM internal states, given that hallucinations can have different characteristics (irrelevance vs incoherence).

Method: 1) Generated a synthetic dataset varying distinct hallucination properties: correctness, confidence, relevance, coherence, and completeness. 2) Analyzed which geometric statistics capture which hallucination types. 3) Discovered existing geometric detection methods have substantial sensitivity to domain shifts. 4) Introduced a simple normalization method to mitigate domain shift effects on geometric statistics.

Result: Different geometric statistics capture different types of hallucinations. Existing methods show significant sensitivity to task domain shifts (e.g., math vs. history questions). The proposed normalization method achieves AUROC gains of +34 points in multi-domain settings.

Conclusion: Geometric statistics in LLMs can detect specific hallucination properties, but domain shift is a major challenge. Simple normalization effectively mitigates this issue, improving hallucination detection across diverse domains.

Abstract: Hallucination remains a barrier to deploying generative models in high-consequence applications. This is especially true in cases where external ground truth is not readily available to validate model outputs. This situation has motivated the study of geometric signals in the internal state of an LLM that are predictive of hallucination and require limited external knowledge. Given that there are a range of factors that can lead model output to be called a hallucination (e.g., irrelevance vs incoherence), in this paper we ask what specific properties of a hallucination these geometric statistics actually capture. To assess this, we generate a synthetic dataset which varies distinct properties of output associated with hallucination. This includes output correctness, confidence, relevance, coherence, and completeness. We find that different geometric statistics capture different types of hallucinations. Along the way we show that many existing geometric detection methods have substantial sensitivity to shifts in task domain (e.g., math questions vs. history questions). Motivated by this, we introduce a simple normalization method to mitigate the effect of domain shift on geometric statistics, leading to AUROC gains of +34 points in multi-domain settings.

[353] Boltzmann Reinforcement Learning for Noise resilience in Analog Ising Machines

Aditya Choudhary, Saaketh Desai, Prasad Iyer

Main category: cs.LG

TL;DR: BRAIN is a reinforcement learning framework that makes analog Ising machines noise-resilient for combinatorial optimization by learning Boltzmann distributions from aggregated noisy measurements.

Details

Motivation: Analog Ising machines (AIMs) offer energy-efficient combinatorial optimization but suffer from measurement noise that degrades traditional algorithms like MCMC. There's a need for noise-resilient methods to fully leverage AIMs' potential.

Method: BRAIN uses variational reinforcement learning to approximate Boltzmann distributions. Instead of state-by-state sampling, it aggregates information across multiple noisy measurements, making it resilient to Gaussian noise characteristic of AIMs.

Result: Under 3% Gaussian noise, BRAIN maintains 98% ground state fidelity vs. MCMC’s 51%, reaches MCMC-equivalent solutions 192x faster, scales as O(N^1.55) up to 65,536 spins, and handles up to 40% measurement uncertainty while capturing phase transitions.

Conclusion: BRAIN provides a scalable, noise-resilient framework for utilizing analog computing architectures in complex optimizations, overcoming key limitations of traditional methods on AIMs.

Abstract: Analog Ising machines (AIMs) have emerged as a promising paradigm for combinatorial optimization, utilizing physical dynamics to solve Ising problems with high energy efficiency. However, the performance of traditional optimization and sampling algorithms on these platforms is often limited by inherent measurement noise. We introduce BRAIN (Boltzmann Reinforcement for Analog Ising Networks), a distribution learning framework that utilizes variational reinforcement learning to approximate the Boltzmann distribution. By shifting from state-by-state sampling to aggregating information across multiple noisy measurements, BRAIN is resilient to Gaussian noise characteristic of AIMs. We evaluate BRAIN across diverse combinatorial topologies, including the Curie-Weiss and 2D nearest-neighbor Ising systems. We find that under realistic 3% Gaussian measurement noise, BRAIN maintains 98% ground state fidelity, whereas Markov Chain Monte Carlo (MCMC) methods degrade to 51% fidelity. Furthermore, BRAIN reaches the MCMC-equivalent solution up to 192x faster under these conditions. BRAIN exhibits $\mathcal{O}(N^{1.55})$ scaling up to 65,536 spins and maintains robustness against severe measurement uncertainty up to 40%. Beyond ground state optimization, BRAIN accurately captures thermodynamic phase transitions and metastable states, providing a scalable and noise-resilient method for utilizing analog computing architectures in complex optimizations.

[354] Faster Rates For Federated Variational Inequalities

Guanghui Wang, Satyen Kale

Main category: cs.LG

TL;DR: Improved federated optimization algorithms for stochastic variational inequalities with better convergence rates and reduced client drift

Details

Motivation: There's a significant gap between existing convergence rates for federated variational inequalities and state-of-the-art bounds for federated convex optimization. The paper aims to address this limitation by establishing improved convergence rates and addressing client drift issues in federated VI optimization.

Method: 1. Refined analysis of classical Local Extra SGD algorithm for general smooth and monotone variational inequalities. 2. Identified limitations of Local Extra SGD leading to excessive client drift. 3. Proposed new algorithm: Local Inexact Proximal Point Algorithm with Extra Step (LIPPAX) to mitigate client drift. 4. Extended results to federated composite variational inequalities.

Result: Established improved convergence guarantees for federated stochastic variational inequalities across several regimes including bounded Hessian, bounded operator, and low-variance settings. LIPPAX algorithm demonstrated better performance by mitigating client drift issues present in Local Extra SGD.

Conclusion: The paper bridges the gap between federated VI optimization and federated convex optimization by providing improved convergence rates and addressing client drift through novel algorithm design and refined analysis techniques.

Abstract: In this paper, we study federated optimization for solving stochastic variational inequalities (VIs), a problem that has attracted growing attention in recent years. Despite substantial progress, a significant gap remains between existing convergence rates and the state-of-the-art bounds known for federated convex optimization. In this work, we address this limitation by establishing a series of improved convergence rates. First, we show that, for general smooth and monotone variational inequalities, the classical Local Extra SGD algorithm admits tighter guarantees under a refined analysis. Next, we identify an inherent limitation of Local Extra SGD, which can lead to excessive client drift. Motivated by this observation, we propose a new algorithm, the Local Inexact Proximal Point Algorithm with Extra Step (LIPPAX), and show that it mitigates client drift and achieves improved guarantees in several regimes, including bounded Hessian, bounded operator, and low-variance settings. Finally, we extend our results to federated composite variational inequalities and establish improved convergence guarantees.

[355] Train Less, Infer Faster: Efficient Model Finetuning and Compression via Structured Sparsity

Jonathan Svirsky, Yehonathan Refael, Ofir Lindenbaum

Main category: cs.LG

TL;DR: Sparsifying specific rows and columns in foundation language models enables efficient task adaptation without weight tuning, using stochastic gates to remove 20-40% of parameters while maintaining accuracy and reducing inference time.

Details

Motivation: Fully finetuning large language models is impractical due to computational costs, memory requirements, and overfitting risk. Existing methods like low-rank adapters increase memory usage and don't reduce inference latency.

Method: Proposes sparsifying specific model rows and columns using training stochastic gates for efficient task adaptation without weight tuning. This approach requires minimal trainable parameters and removes 20-40% of model parameters.

Result: Outperforms recent finetuning baselines in efficiency and performance, reduces inference time, and provides theoretical guarantees for convergence of the stochastic gating process with better-conditioned optimization landscape compared to LoRA.

Conclusion: Sparsity is a compelling mechanism for task-specific adaptation in language models, offering efficient adaptation with minimal parameter tuning and improved inference efficiency.

Abstract: Fully finetuning foundation language models (LMs) with billions of parameters is often impractical due to high computational costs, memory requirements, and the risk of overfitting. Although methods like low-rank adapters help address these challenges by adding small trainable modules to the frozen LM, they also increase memory usage and do not reduce inference latency. We uncover an intriguing phenomenon: sparsifying specific model rows and columns enables efficient task adaptation without requiring weight tuning. We propose a scheme for effective finetuning via sparsification using training stochastic gates, which requires minimal trainable parameters, reduces inference time, and removes 20–40% of model parameters without significant accuracy loss. Empirical results show it outperforms recent finetuning baselines in efficiency and performance. Additionally, we provide theoretical guarantees for the convergence of this stochastic gating process, and show that our method admits a simpler and better-conditioned optimization landscape compared to LoRA. Our results highlight sparsity as a compelling mechanism for task-specific adaptation in LMs.

[356] $n$-Musketeers: Reinforcement Learning Shapes Collaboration Among Language Models

Ryozo Masukawa, Sanggeon Yun, Hyunwoo Oh, SuhgHeon Jeong, Raheeb Hassa, Hanning Chen, Wenjun Huang, Mahdi Imani, Pietro Mercati, Nathaniel D. Bastian, Mohsen Imani

Main category: cs.LG

TL;DR: Soft hidden-state collaboration integrates multiple frozen specialized language models via trainable attention on their internal representations, achieving competitive reasoning performance while revealing emergent expert specialization patterns.

Details

Motivation: To leverage multiple heterogeneous frozen specialized language models (SLMs) without relying on large monolithic LLMs, enabling structured reasoning through latent integration of expert representations rather than just output combination.

Method: Proposes soft hidden-state collaboration where multiple frozen SLM experts are integrated through their internal hidden states using a trainable attention interface, allowing the system to learn how to combine expert representations during reinforcement learning with verifiable rewards (RLVR).

Result: Competitive performance with strong single-model RLVR baselines on Reasoning Gym and GSM8K benchmarks; ablation studies reveal dual mechanism: simpler domains use static expert preferences while challenging settings induce increasingly concentrated and structured expert attention, showing emergent specialization.

Conclusion: Hidden-state collaboration provides a compact mechanism for leveraging frozen experts while offering insights into expert utilization patterns and their evolution under RLVR, demonstrating that small SLMs can achieve structured reasoning without large LLMs.

Abstract: Recent progress in reinforcement learning with verifiable rewards (RLVR) shows that small, specialized language models (SLMs) can exhibit structured reasoning without relying on large monolithic LLMs. We introduce soft hidden-state collaboration, where multiple heterogeneous frozen SLM experts are integrated through their internal representations via a trainable attention interface. Experiments on Reasoning Gym and GSM8K show that this latent integration is competitive with strong single-model RLVR baselines. Ablations further reveal a dual mechanism of expert utilization: for simpler arithmetic domains, performance gains can largely be explained by static expert preferences, whereas more challenging settings induce increasingly concentrated and structured expert attention over training, indicating emergent specialization in how the router connects to relevant experts. Overall, hidden-state collaboration provides a compact mechanism for leveraging frozen experts, while offering an observational window into expert utilization patterns and their evolution under RLVR.

[357] Weighted Wasserstein Barycenter of Gaussian Processes for exotic Bayesian Optimization tasks

Antonio Candelieri, Francesco Archetti

Main category: cs.LG

TL;DR: W2BGP framework uses weighted Wasserstein Barycenter of Gaussian Processes to unify various Bayesian Optimization tasks (collaborative/federated, batch, multi-fidelity) through different weighting schemes.

Details

Motivation: To create a unified framework for different exotic Bayesian Optimization tasks that are typically handled separately, leveraging the analogy between Gaussian Distributions and Gaussian Processes' posterior.

Method: Proposes weighted Wasserstein Barycenter of Gaussian Processes (W2BGP) framework where different BO tasks (collaborative/federated, batch, multi-fidelity) are unified by applying appropriate weighting schemes to the Wasserstein Barycenter while keeping the overall framework unchanged.

Result: Empirical analysis shows each BO task requires only appropriate weighting schema for W2BGP; framework enables reinterpretation of well-known BO acquisition functions and provides more computationally efficient Wasserstein Barycenter computation compared to state-of-the-art methods.

Conclusion: W2BGP provides a unified, efficient framework for various Bayesian Optimization tasks through weighted Wasserstein Barycenters, with potential for further research extensions.

Abstract: Exploiting the analogy between Gaussian Distributions and Gaussian Processes’ posterior, we present how the weighted Wasserstein Barycenter of Gaussian Processes (W2BGP) can be used to unify, under a common framework, different exotic Bayesian Optimization (BO) tasks. Specifically, collaborative/federated BO, (synchronous) batch BO, and multi-fidelity BO are considered in this paper. Our empirical analysis proves that each one of these tasks requires just an appropriate weighting schema for the W2BGP, while the entire framework remains untouched. Moreover, we demonstrate that the most well-known BO acquisition functions can be easily re-interpreted under the proposed framework and also enable a more computationally efficient way to deal with the computation of the Wasserstein Barycenter, compared with state-of-the-art methods from the Machine Learning literature. Finally, research perspectives branching from the proposed approach are presented.

[358] Gradient Residual Connections

Yangchen Pan, Qizhen Ying, Philip Torr, Bo Liu

Main category: cs.LG

TL;DR: Proposes gradient-based residual connections to improve neural networks’ ability to approximate high-frequency functions, complementing standard identity skip connections.

Details

Motivation: Existing work links gradient properties to function approximation difficulty. High-frequency functions are challenging for neural networks, and standard residual connections struggle with rapidly varying patterns.

Method: Introduces gradient-based residual connections as complement to identity skip connections. Provides theoretical intuition for gradient information helping distinguish inputs. Uses convex combination of standard and gradient residuals for flexible control.

Result: Gradient residuals substantially improve approximation quality on synthetic high-frequency regression. Validated on single-image super-resolution (high-frequency underlying function). Comparable performance to standard residual networks on image classification and segmentation.

Conclusion: Gradient-based residual connections effectively improve neural networks’ ability to approximate high-frequency functions while maintaining performance on standard tasks, suggesting broad utility.

Abstract: Existing work has linked properties of a function’s gradient to the difficulty of function approximation. Motivated by these insights, we study how gradient information can be leveraged to improve neural network’s ability to approximate high-frequency functions, and we propose a gradient-based residual connection as a complement to the standard identity skip connection used in residual networks. We provide simple theoretical intuition for why gradient information can help distinguish inputs and improve the approximation of functions with rapidly varying behaviour. On a synthetic regression task with a high-frequency sinusoidal ground truth, we show that conventional residual connections struggle to capture high-frequency patterns. In contrast, our gradient residual substantially improves approximation quality. We then introduce a convex combination of the standard and gradient residuals, allowing the network to flexibly control how strongly it relies on gradient information. After validating the design choices of our proposed method through an ablation study, we further validate our approach’s utility on the single-image super-resolution task, where the underlying function may be high-frequency. Finally, on standard tasks such as image classification and segmentation, our method achieves performance comparable to standard residual networks, suggesting its broad utility.

[359] ML-DCN: Masked Low-Rank Deep Crossing Network Towards Scalable Ads Click-through Rate Prediction at Pinterest

Jiacheng Li, Yixiong Meng, Yi wu, Yun Zhao, Sharare Zehtabian, Jiayin Jin, Degao Peng, Jinfeng Zhuang, Qifei Shen, Kungang Li

Main category: cs.LG

TL;DR: ML-DCN: A scalable feature interaction module for recommendation systems that combines DCNv2 and MaskNet with instance-conditioned masks for efficient computation and improved performance under fixed serving budgets.

Details

Motivation: Large-scale ad ranking systems need feature interaction modules that can scale effectively with additional compute while remaining compute-efficient at serving time due to strict latency and FLOPs constraints in production environments.

Method: Proposes ML-DCN, which integrates an instance-conditioned mask into a low-rank crossing layer, enabling per-example selection and amplification of salient interaction directions while maintaining efficient computation.

Result: ML-DCN achieves higher AUC than DCNv2, MaskNet, and recent scaling-oriented alternatives at matched FLOPs on Pinterest ads dataset, scales more favorably as compute increases, and shows statistically significant improvements in online A/B tests for key ads metrics.

Conclusion: ML-DCN successfully addresses the scaling-efficiency trade-off in recommendation systems, combining strengths of existing approaches and achieving state-of-the-art performance with neutral serving cost in production deployment.

Abstract: Deep learning recommendation systems rely on feature interaction modules to model complex user-item relationships across sparse categorical and dense features. In large-scale ad ranking, increasing model capacity is a promising path to improving both predictive performance and business outcomes, yet production serving budgets impose strict constraints on latency and FLOPs. This creates a central tension: we want interaction modules that both scale effectively with additional compute and remain compute-efficient at serving time. In this work, we study how to scale feature interaction modules under a fixed serving budget. We find that naively scaling DCNv2 and MaskNet, despite their widespread adoption in industry, yields rapidly diminishing offline gains in the Pinterest ads ranking system. To overcome aforementioned limitations, we propose ML-DCN, an interaction module that integrates an instance-conditioned mask into a low-rank crossing layer, enabling per-example selection and amplification of salient interaction directions while maintaining efficient computation. This novel architecture combines the strengths of DCNv2 and MaskNet, scales efficiently with increased compute, and achieves state-of-the-art performance. Experiments on a large internal Pinterest ads dataset show that ML-DCN achieves higher AUC than DCNv2, MaskNet, and recent scaling-oriented alternatives at matched FLOPs, and it scales more favorably overall as compute increases, exhibiting a stronger AUC-FLOPs trade-off. Finally, online A/B tests demonstrate statistically significant improvements in key ads metrics (including CTR and click-quality measures) and ML-DCN has been deployed in the production system with neutral serving cost.

[360] Fair Feature Importance Scores via Feature Occlusion and Permutation

Camille Little, Madeline Navarro, Santiago Segarra, Genevera Allen

Main category: cs.LG

TL;DR: Proposes two model-agnostic methods to measure feature importance for fairness (not accuracy): permutation-based and occlusion-based approaches that quantify how individual features contribute to model fairness metrics.

Details

Motivation: As ML models impact society, their opacity challenges trust and accountability, especially regarding fairness. While feature importance for accuracy is well-established, methods for assessing feature contributions to fairness remain underexplored, creating a gap in interpretable and equitable model development.

Method: Two model-agnostic approaches: 1) Permutation-based: compare model fairness before and after permuting feature values to decouple feature from predictions; 2) Occlusion-based: evaluate fairness of models trained with and without a given feature, using minipatch learning for computational efficiency.

Result: Empirical results show simplicity and effectiveness of proposed metrics across multiple predictive tasks. Both methods provide scalable, interpretable solutions for quantifying feature influence on fairness.

Conclusion: The proposed methods offer practical tools for responsible ML development by enabling quantification of how individual features affect model fairness, addressing a critical gap in interpretability for equitable AI systems.

Abstract: As machine learning models increasingly impact society, their opaque nature poses challenges to trust and accountability, particularly in fairness contexts. Understanding how individual features influence model outcomes is crucial for building interpretable and equitable models. While feature importance metrics for accuracy are well-established, methods for assessing feature contributions to fairness remain underexplored. We propose two model-agnostic approaches to measure fair feature importance. First, we propose to compare model fairness before and after permuting feature values. This simple intervention-based approach decouples a feature and model predictions to measure its contribution to training. Second, we evaluate the fairness of models trained with and without a given feature. This occlusion-based score enjoys dramatic computational simplification via minipatch learning. Our empirical results reflect the simplicity and effectiveness of our proposed metrics for multiple predictive tasks. Both methods offer simple, scalable, and interpretable solutions to quantify the influence of features on fairness, providing new tools for responsible machine learning development.

[361] CausalGDP: Causality-Guided Diffusion Policies for Reinforcement Learning

Xiaofeng Xiao, Xiao Hu, Yang Ye, Xubo Yue

Main category: cs.LG

TL;DR: CausalGDP integrates causal reasoning into diffusion-based RL policies to identify which action components truly cause high returns, improving performance in complex control tasks.

Details

Motivation: Existing diffusion-based RL policies rely on statistical associations and fail to explicitly account for causal relationships among states, actions, and rewards, limiting their ability to identify which action components truly cause high returns.

Method: CausalGDP learns a base diffusion policy and initial causal dynamical model from offline data, then continuously updates causal information during real-time interaction to guide the diffusion process toward actions that causally influence future states and rewards.

Result: CausalGDP consistently achieves competitive or superior performance over state-of-the-art diffusion-based and offline RL methods, especially in complex, high-dimensional control tasks.

Conclusion: Integrating causal reasoning into diffusion-based RL policies enables focusing optimization on action components that genuinely drive performance improvements, leading to better performance in complex sequential decision-making problems.

Abstract: Reinforcement learning (RL) has achieved remarkable success in a wide range of sequential decision-making problems. Recent diffusion-based policies further improve RL by modeling complex, high-dimensional action distributions. However, existing diffusion policies primarily rely on statistical associations and fail to explicitly account for causal relationships among states, actions, and rewards, limiting their ability to identify which action components truly cause high returns. In this paper, we propose Causality-guided Diffusion Policy (CausalGDP), a unified framework that integrates causal reasoning into diffusion-based RL. CausalGDP first learns a base diffusion policy and an initial causal dynamical model from offline data, capturing causal dependencies among states, actions, and rewards. During real-time interaction, the causal information is continuously updated and incorporated as a guidance signal to steer the diffusion process toward actions that causally influence future states and rewards. By explicitly considering causality beyond association, CausalGDP focuses policy optimization on action components that genuinely drive performance improvements. Experimental results demonstrate that CausalGDP consistently achieves competitive or superior performance over state-of-the-art diffusion-based and offline RL methods, especially in complex, high-dimensional control tasks.

[362] A Lightweight Multi-View Approach to Short-Term Load Forecasting

Julien Guité-Vinet, Alexandre Blondin Massé, Éric Beaudry

Main category: cs.LG

TL;DR: A lightweight multi-view approach for short-term load forecasting using single-value embeddings and scaled time-range inputs with embedding dropout to prevent overfitting and enhance interpretability.

Details

Motivation: Transformer-based and large-parameter models for time series forecasting can lead to overfitting and unstable forecasts, especially when older data becomes less relevant. There's a need for more efficient, interpretable methods with fewer parameters.

Method: Proposes a lightweight multi-view approach using single-value embeddings and scaled time-range input to capture temporally relevant features efficiently. Introduces embedding dropout mechanism to prevent over-reliance on specific features and enhance interpretability.

Result: Achieves competitive performance with significantly fewer parameters, demonstrating robustness across multiple datasets including scenarios with noisy or sparse data. Provides insights into contributions of individual features to forecasts.

Conclusion: The proposed lightweight multi-view approach offers an efficient, interpretable alternative to complex transformer-based models for time series forecasting, maintaining competitive performance while reducing computational overhead and overfitting risks.

Abstract: Time series forecasting is a critical task across domains such as energy, finance, and meteorology, where accurate predictions enable informed decision-making. While transformer-based and large-parameter models have recently achieved state-of-the-art results, their complexity can lead to overfitting and unstable forecasts, especially when older data points become less relevant. In this paper, we propose a lightweight multi-view approach to short-term load forecasting that leverages single-value embeddings and a scaled time-range input to capture temporally relevant features efficiently. We introduce an embedding dropout mechanism to prevent over-reliance on specific features and enhance interpretability. Our method achieves competitive performance with significantly fewer parameters, demonstrating robustness across multiple datasets, including scenarios with noisy or sparse data, and provides insights into the contributions of individual features to the forecast.

[363] Barycentric alignment for instance-level comparison of neural representations

Shreya Saha, Zoe Wanying He, Meenakshi Khosla

Main category: cs.LG

TL;DR: A barycentric alignment framework that enables instance-level comparison of neural representations across models by quotienting out nuisance symmetries, revealing stimulus-specific convergence/divergence patterns and enabling cross-modal alignment without joint training.

Details

Motivation: Existing neural representation comparison methods are limited because they operate at set-level (entire stimulus sets) and are confounded by symmetries like unit reordering or rotations that obscure true representational equivalence between models.

Method: Introduces a barycentric alignment framework that constructs universal embedding spaces by quotienting out nuisance symmetries. Enables instance-level similarity comparison across vision/language models and brain representations. Applied to unimodal models for cross-modal alignment.

Result: Identified systematic input properties predicting representational convergence/divergence across model families. Created universal embedding spaces for brain representations across individuals/cortical regions. Post-hoc alignment of unimodal vision/language models yields image-text similarity scores tracking human judgments and approaching contrastively trained models.

Conclusion: Independently learned representations share sufficient geometric structure for human-aligned cross-modal comparison. Instance-level similarity reveals phenomena undetectable by set-level metrics, providing new insights into representational alignment across modalities.

Abstract: Comparing representations across neural networks is challenging because representations admit symmetries, such as arbitrary reordering of units or rotations of activation space, that obscure underlying equivalence between models. We introduce a barycentric alignment framework that quotients out these nuisance symmetries to construct a universal embedding space across many models. Unlike existing similarity measures, which summarize relationships over entire stimulus sets, this framework enables similarity to be defined at the level of individual stimuli, revealing inputs that elicit convergent versus divergent representations across models. Using this instance-level notion of similarity, we identify systematic input properties that predict representational convergence versus divergence across vision and language model families. We also construct universal embedding spaces for brain representations across individuals and cortical regions, enabling instance-level comparison of representational agreement across stages of the human visual hierarchy. Finally, we apply the same barycentric alignment framework to purely unimodal vision and language models and find that post-hoc alignment into a shared space yields image text similarity scores that closely track human cross-modal judgments and approach the performance of contrastively trained vision-language models. This strikingly suggests that independently learned representations already share sufficient geometric structure for human-aligned cross-modal comparison. Together, these results show that resolving representational similarity at the level of individual stimuli reveals phenomena that cannot be detected by set-level comparison metrics.

[364] Beyond the Unit Hypersphere: Embedding Magnitude in Contrastive Learning

Xincan Feng, Taro Watanabe

Main category: cs.LG

TL;DR: Systematic study reveals embedding magnitude carries meaningful information in contrastive learning, with output magnitude correlating with relevance in text retrieval and asymmetric tasks benefiting from magnitude learning while symmetric tasks are harmed.

Details

Motivation: Cosine similarity is widely used in contrastive learning but assumes embedding magnitude is noise. Prior work occasionally found dot product comparable to cosine but didn't explain what information magnitude carries, when it helps, and how to leverage it.

Method: Conducted systematic study through a 2×2 ablation that independently controls input-side and output-side normalization across text and vision models to analyze the role of embedding magnitude.

Result: Three key findings: 1) In text retrieval, output (document) magnitude strongly correlates with relevance, yielding largest gains on reasoning-intensive tasks; 2) Input and output magnitudes serve asymmetric roles - output magnitude directly scales similarity scores while input magnitude modulates training dynamics; 3) Magnitude learning benefits asymmetric tasks (text retrieval, RAG) but harms symmetric tasks (STS, text-image alignment).

Conclusion: Establishes a task symmetry principle: choice between cosine and dot product depends on whether task has distinct input roles, enabling cost-free improvements by simply removing unnecessary normalization constraints.

Abstract: Cosine similarity is prevalent in contrastive learning, yet it makes an implicit assumption: embedding magnitude is noise. Prior work occasionally found dot product and cosine similarity comparable, but left unanswered WHAT information magnitude carries, WHEN it helps, and HOW to leverage it. We conduct a systematic study through a $2 \times 2$ ablation that independently controls input-side and output-side normalization across text and vision models. Our findings reveal three key insights. First, in text retrieval, output (document) magnitude strongly correlates with relevance (Cohen’s $d$ up to 1.80), yielding the largest gains on reasoning-intensive tasks. Second, input and output magnitudes serve asymmetric roles: output magnitude directly scales similarity scores while input magnitude modulates training dynamics. Third, magnitude learning benefits asymmetric tasks (text retrieval, RAG) but harms symmetric tasks (STS, text-image alignment). These findings establish a task symmetry principle: the choice between cosine and dot product depends on whether the task has distinct input roles, enabling cost-free improvements by simply removing an unnecessary constraint.

[365] Do Neural Networks Lose Plasticity in a Gradually Changing World?

Tianhui Liu, Lili Mou

Main category: cs.LG

TL;DR: Loss of plasticity in neural networks is largely an artifact of abrupt task changes and can be mitigated in gradually changing environments.

Details

Motivation: Existing research on loss of plasticity relies on contrived settings with abrupt task transitions that don't reflect real-world environments where changes typically occur gradually.

Method: Investigate gradually changing environments using input/output interpolation and task sampling, performing both theoretical and empirical analysis.

Result: Loss of plasticity is shown to be an artifact of abrupt task changes and can be largely mitigated when the world changes gradually.

Conclusion: Gradual environmental changes preserve neural network plasticity better than abrupt task transitions, suggesting more realistic continual learning scenarios.

Abstract: Continual learning has become a trending topic in machine learning. Recent studies have discovered an interesting phenomenon called loss of plasticity, referring to neural networks gradually losing the ability to learn new tasks. However, existing plasticity research largely relies on contrived settings with abrupt task transitions, which often do not reflect real-world environments. In this paper, we propose to investigate a gradually changing environment, and we simulate this by input/output interpolation and task sampling. We perform theoretical and empirical analysis, showing that the loss of plasticity is an artifact of abrupt tasks changes in the environment and can be largely mitigated if the world changes gradually.

[366] RAPID: Risk of Attribute Prediction-Induced Disclosure in Synthetic Microdata

Matthias Templ, Oscar Thees, Roman Müller

Main category: cs.LG

TL;DR: RAPID is a new disclosure risk measure for synthetic microdata that quantifies inferential vulnerability by simulating an adversary’s ability to predict sensitive attributes from released synthetic data.

Details

Motivation: Traditional identity disclosure measures are less informative for fully synthetic microdata, where the real risk is an adversary's ability to infer sensitive attributes from the released data. Current methods don't adequately measure this attribute-inference risk.

Method: RAPID simulates an adversary who trains a predictive model solely on released synthetic data and applies it to real individuals’ quasi-identifiers. For continuous attributes, it measures proportion of records with predictions within relative error tolerance. For categorical attributes, it uses a baseline-normalized confidence score comparing attacker confidence to expected class prevalence.

Result: RAPID provides an interpretable, bounded risk metric robust to class imbalance, independent of specific synthesizers, and applicable with arbitrary learning algorithms. It offers practical upper bounds on attribute-inference disclosure risk.

Conclusion: RAPID complements existing utility diagnostics and disclosure control frameworks by providing an attacker-realistic measure of attribute-inference risk for synthetic data, addressing a gap in current statistical disclosure control methods.

Abstract: Statistical data anonymization increasingly relies on fully synthetic microdata, for which classical identity disclosure measures are less informative than an adversary’s ability to infer sensitive attributes from released data. We introduce RAPID (Risk of Attribute Prediction–Induced Disclosure), a disclosure risk measure that directly quantifies inferential vulnerability under a realistic attack model. An adversary trains a predictive model solely on the released synthetic data and applies it to real individuals’ quasi-identifiers. For continuous sensitive attributes, RAPID reports the proportion of records whose predicted values fall within a specified relative error tolerance. For categorical attributes, we propose a baseline-normalized confidence score that measures how much more confident the attacker is about the true class than would be expected from class prevalence alone, and we summarize risk as the fraction of records exceeding a policy-defined threshold. This construction yields an interpretable, bounded risk metric that is robust to class imbalance, independent of any specific synthesizer, and applicable with arbitrary learning algorithms. We illustrate threshold calibration, uncertainty quantification, and comparative evaluation of synthetic data generators using simulations and real data. Our results show that RAPID provides a practical, attacker-realistic upper bound on attribute-inference disclosure risk that complements existing utility diagnostics and disclosure control frameworks.

[367] Feature salience – not task-informativeness – drives machine learning model explanations

Benedict Clark, Marta Oliveira, Rick Wilming, Stefan Haufe

Main category: cs.LG

TL;DR: XAI methods attribute importance based on visual salience rather than learned statistical associations, as shown through watermark experiments in image classification.

Details

Motivation: To investigate whether XAI methods truly identify informative features or are driven by other factors like feature salience, statistical suppression, or novelty at test-time.

Method: Trained deep learning models on three variants of binary image classification with translucent watermarks (absent, class-dependent confounds, class-independent noise). Evaluated five popular attribution methods and compared to model-agnostic edge detection filters.

Result: XAI methods showed substantially elevated importance in watermarked areas regardless of training setting (R² ≥ .45). Class-dependence of watermarks had minimal effect (R² ≤ .03) despite impacting model performance. Importance attribution resembled edge detection and was sensitive to feature value encoding.

Conclusion: Importance attribution is primarily driven by test-time feature salience rather than learned statistical associations, suggesting XAI methods may produce misleading results when salience and informativeness coincide spuriously.

Abstract: Explainable AI (XAI) promises to provide insight into machine learning models’ decision processes, where one goal is to identify failures such as shortcut learning. This promise relies on the field’s assumption that input features marked as important by an XAI must contain information about the target variable. However, it is unclear whether informativeness is indeed the main driver of importance attribution in practice, or if other data properties such as statistical suppression, novelty at test-time, or high feature salience substantially contribute. To clarify this, we trained deep learning models on three variants of a binary image classification task, in which translucent watermarks are either absent, act as class-dependent confounds, or represent class-independent noise. Results for five popular attribution methods show substantially elevated relative importance in watermarked areas (RIW) for all models regardless of the training setting ($R^2 \geq .45$). By contrast, whether the presence of watermarks is class-dependent or not only has a marginal effect on RIW ($R^2 \leq .03$), despite a clear impact impact on model performance and generalisation ability. XAI methods show similar behaviour to model-agnostic edge detection filters and attribute substantially less importance to watermarks when bright image intensities are encoded by smaller instead of larger feature values. These results indicate that importance attribution is most strongly driven by the salience of image structures at test time rather than statistical associations learned by machine learning models. Previous studies demonstrating successful XAI application should be reevaluated with respect to a possibly spurious concurrency of feature salience and informativeness, and workflows using feature attribution methods as building blocks should be scrutinised.

[368] Generalizing GNNs with Tokenized Mixture of Experts

Xiaoguang Guo, Zehong Wang, Jiazheng Li, Shawn Spitzel, Qi Yang, Kaize Ding, Jundong Li, Chuxu Zhang

Main category: cs.LG

TL;DR: STEM-GNN is a pretrain-then-finetune framework for graph neural networks that addresses the tradeoff between stability and generalization under distribution shifts through instance-conditional routing, mixture-of-experts encoding, and stabilization techniques.

Details

Motivation: Deployed GNNs face challenges in fitting clean data, generalizing under distribution shifts, and remaining stable to perturbations. Static inference induces a fundamental tradeoff where improving stability reduces reliance on shift-sensitive features, creating an irreducible worst-case generalization floor.

Method: STEM-GNN uses a pretrain-then-finetune framework with: 1) mixture-of-experts encoder for diverse computation paths, 2) vector-quantized token interface to stabilize encoder-to-head signals, and 3) Lipschitz-regularized head to bound output amplification.

Result: Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.

Conclusion: Instance-conditional routing can break the generalization ceiling of static GNNs, but requires careful design to handle routing fragility under shifts and perturbations. The proposed STEM-GNN framework successfully addresses these challenges through architectural innovations.

Abstract: Deployed graph neural networks (GNNs) are frozen at deployment yet must fit clean data, generalize under distribution shifts, and remain stable to perturbations. We show that static inference induces a fundamental tradeoff: improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor. Instance-conditional routing can break this ceiling, but is fragile because shifts can mislead routing and perturbations can make routing fluctuate. We capture these effects via two decompositions separating coverage vs selection, and base sensitivity vs fluctuation amplification. Based on these insights, we propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths, a vector-quantized token interface to stabilize encoder-to-head signals, and a Lipschitz-regularized head to bound output amplification. Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.

[369] The effect of whitening on explanation performance

Benedict Clark, Stoyan Karastoyanov, Rick Wilming, Stefan Haufe

Main category: cs.LG

TL;DR: XAI feature attribution methods often misattribute importance to non-informative suppressor variables; data whitening can partially mitigate these errors but effectiveness varies across methods and architectures.

Details

Motivation: Feature attribution methods in XAI frequently produce unreliable explanations by assigning importance to non-informative suppressor variables due to feature dependencies, raising concerns about their interpretability and trustworthiness.

Method: Evaluated 16 popular feature attribution methods combined with 5 whitening transforms using XAI-TRIS benchmark with synthetic ground-truth data; also analyzed minimal linear 2D classification problem to theoretically assess whitening’s impact on Bayes-optimal models.

Result: Specific whitening techniques can improve explanation performance, but improvement varies substantially across XAI methods and model architectures; whitening’s effectiveness depends on data non-linearities and preprocessing quality.

Conclusion: Preprocessing techniques like data whitening play a vital role in enhancing model interpretability, but their effectiveness is complex and method-dependent, highlighting the need for careful consideration of data transformations in XAI.

Abstract: Explainable Artificial Intelligence (XAI) aims to provide transparent insights into machine learning models, yet the reliability of many feature attribution methods remains a critical challenge. Prior research (Haufe et al., 2014; Wilming et al., 2022, 2023) has demonstrated that these methods often erroneously assign significant importance to non-informative variables, such as suppressor variables, leading to fundamental misinterpretations. Since statistical suppression is induced by feature dependencies, this study investigates whether data whitening, a common preprocessing technique for decorrelation, can mitigate such errors. Using the established XAI-TRIS benchmark (Clark et al., 2024b), which offers synthetic ground-truth data and quantitative measures of explanation correctness, we empirically evaluate 16 popular feature attribution methods applied in combination with 5 distinct whitening transforms. Additionally, we analyze a minimal linear two-dimensional classification problem (Wilming et al., 2023) to theoretically assess whether whitening can remove the impact of suppressor features from Bayes-optimal models. Our results indicate that, while specific whitening techniques can improve explanation performance, the degree of improvement varies substantially across XAI methods and model architectures. These findings highlight the complex relationship between data non-linearities, preprocessing quality, and attribution fidelity, underscoring the vital role of pre-processing techniques in enhancing model interpretability.

[370] Measuring Privacy Risks and Tradeoffs in Financial Synthetic Data Generation

Michael Zuo, Inwon Kang, Stacy Patterson, Oshani Seneviratne

Main category: cs.LG

TL;DR: The paper explores privacy-utility tradeoffs in synthetic data generation for tabular financial datasets, focusing on class imbalance and mixed-type attributes, comparing various generative models including autoencoders, GANs, diffusion models, and copula synthesizers.

Details

Motivation: Financial datasets present unique challenges for synthetic data generation due to high regulatory risk, severe class imbalance, and mixed-type attributes. There's a need to understand how different generative models perform in this domain while balancing privacy, data quality, and downstream utility.

Method: The authors evaluate representative tabular data generators including autoencoders, GANs, diffusion models, and copula synthesizers. They provide novel privacy-preserving implementations for GAN and autoencoder synthesizers. The evaluation compares performance across balanced and imbalanced input datasets, assessing data quality, downstream utility, and privacy preservation.

Result: The study provides insights into the distinct challenges of generating synthetic data from datasets with severe class imbalance and mixed-type attributes. Results show how different generators perform in balancing privacy, data quality, and utility in the financial domain.

Conclusion: The research offers valuable insights for synthetic data generation in high-risk financial applications, highlighting the tradeoffs between privacy preservation and data utility when dealing with imbalanced, mixed-type tabular data.

Abstract: We explore the privacy-utility tradeoff of synthetic data generation schemes on tabular financial datasets, a domain characterized by high regulatory risk and severe class imbalance. We consider representative tabular data generators, including autoencoders, generative adversarial networks, diffusion, and copula synthesizers. To address the challenges of the financial domain, we provide novel privacy-preserving implementations of GAN and autoencoder synthesizers. We evaluate whether and how well the generators simultaneously achieve data quality, downstream utility, and privacy, with comparison across balanced and imbalanced input datasets. Our results offer insight into the distinct challenges of generating synthetic data from datasets that exhibit severe class imbalance and mixed-type attributes.

[371] The Laplacian Mechanism Improves Transformers by Reshaping Token Geometry

Yuchong Zhang, Vardan Papyan

Main category: cs.LG

TL;DR: Laplacian attention mechanism improves transformer performance by better controlling token variance and achieving ideal token geometry with maximal separability.

Details

Motivation: Transformers use attention, residual connections, and layer normalization to control token variance, but the authors propose a Laplacian mechanism to give models more direct control over token variance and achieve better token geometry.

Method: Modify attention into a Laplacian mechanism that provides more direct control over token variance. Analyze its impact using PCA, cosine similarity, variance analysis, and Neural Collapse metrics.

Result: Laplacian mechanism improves transformer performance across computer vision and language benchmarks. Analysis shows it reshapes token embeddings toward maximal separability: tokens collapse according to classes and class means exhibit Neural Collapse.

Conclusion: Laplacian attention mechanism helps transformers achieve ideal token geometry with better class separability, leading to consistent performance improvements across vision and language tasks.

Abstract: Transformers leverage attention, the residual connection, and layer normalization to control the variance of token representations. We propose to modify attention into a Laplacian mechanism that gives the model more direct control over token variance. We conjecture that this helps transformers achieve the ideal token geometry. To investigate our conjecture, we first show that incorporating the Laplacian mechanism into transformers induces consistent improvements across benchmarks in computer vision and language. Next, we study how the Laplacian mechanism impacts the geometry of token representations using various tools: 1) principal component analysis, 2) cosine similarity metric, 3) analysis of variance, and 4) Neural Collapse metrics. Our investigation shows that the Laplacian mechanism reshapes token embeddings toward a geometry of maximal separability: tokens collapse according to their classes, and the class means exhibit Neural Collapse.

[372] Risk-sensitive reinforcement learning using expectiles, shortfall risk and optimized certainty equivalent risk

Sumedh Gupte, Shrey Rakeshkumar Patel, Soumen Pachal, Prashanth L. A., Sanjay P. Bhat

Main category: cs.LG

TL;DR: Risk-sensitive RL algorithms for three risk measures (expectiles, utility-based shortfall risk, optimized certainty equivalent) with policy gradient theorems, estimators, convergence bounds, and empirical validation.

Details

Motivation: Standard RL focuses on expected cumulative reward, but many real-world applications require risk-sensitive decision-making to handle uncertainty and avoid catastrophic outcomes. Existing risk-sensitive RL methods are limited in scope and theoretical guarantees.

Method: Derived policy gradient theorems for three risk measures in finite-horizon MDPs, proposed gradient estimators with O(1/m) MSE bounds, established smoothness properties, and developed overall risk-sensitive policy gradient algorithms with convergence guarantees.

Result: Theoretical analysis shows O(1/m) mean-squared error bounds for gradient estimators and stationary convergence rate bounds for the overall algorithm. Numerical experiments on RL benchmarks validate theoretical findings.

Conclusion: The paper provides a comprehensive framework for risk-sensitive RL with three important risk measures, offering both theoretical guarantees and practical algorithms that can handle various risk preferences in decision-making under uncertainty.

Abstract: We propose risk-sensitive reinforcement learning algorithms catering to three families of risk measures, namely expectiles, utility-based shortfall risk and optimized certainty equivalent risk. For each risk measure, in the context of a finite horizon Markov decision process, we first derive a policy gradient theorem. Second, we propose estimators of the risk-sensitive policy gradient for each of the aforementioned risk measures, and establish $\mathcal{O}\left(1/m\right)$ mean-squared error bounds for our estimators, where $m$ is the number of trajectories. Further, under standard assumptions for policy gradient-type algorithms, we establish smoothness of the risk-sensitive objective, in turn leading to stationary convergence rate bounds for the overall risk-sensitive policy gradient algorithm that we propose. Finally, we conduct numerical experiments to validate the theoretical findings on popular RL benchmarks.

[373] Stabilizing Physics-Informed Consistency Models via Structure-Preserving Training

Che-Chia Chang, Chen-Yang Dai, Te-Sheng Lin, Ming-Chih Lai, Chieh-Hsin Lai

Main category: cs.LG

TL;DR: Physics-informed consistency modeling for fast PDE solving using generative inference with stability improvements

Details

Motivation: Current physics-informed generative models for PDE solving face stability issues where PDE residuals can drive models toward trivial or degenerate solutions, degrading learned data distributions. Need for stable, fast inference methods for both unconditional generation and forward problems.

Method: Two-stage training strategy decoupling distribution learning from physics enforcement by freezing coefficient decoder during physics-informed fine-tuning. Two-step residual objective enforces physical consistency on refined generative trajectories rather than noisy single-step predictions.

Result: Achieves consistent accuracy of diffusion baselines with orders of magnitude reduction in computational cost. Enables stable, high-fidelity inference for both unconditional generation and forward problems via projection-based zero-shot inpainting.

Conclusion: Proposed framework provides stable physics-informed consistency modeling for fast PDE solving, addressing key stability challenges in physics-constrained consistency training while maintaining accuracy with dramatically reduced computational cost.

Abstract: We propose a physics-informed consistency modeling framework for solving partial differential equations (PDEs) via fast, few-step generative inference. We identify a key stability challenge in physics-constrained consistency training, where PDE residuals can drive the model toward trivial or degenerate solutions, degrading the learned data distribution. To address this, we introduce a structure-preserving two-stage training strategy that decouples distribution learning from physics enforcement by freezing the coefficient decoder during physics-informed fine-tuning. We further propose a two-step residual objective that enforces physical consistency on refined, structurally valid generative trajectories rather than noisy single-step predictions. The resulting framework enables stable, high-fidelity inference for both unconditional generation and forward problems. We demonstrate that forward solutions can be obtained via a projection-based zero-shot inpainting procedure, achieving consistent accuracy of diffusion baselines with orders of magnitude reduction in computational cost.

[374] Statistical Roughness-Informed Machine Unlearning

Mohammad Partohaghighi, Roummel Marcia, Bruce J. West, YangQuan Chen

Main category: cs.LG

TL;DR: SRAGU is a machine unlearning method that uses layer-wise statistical roughness analysis to adaptively reweight gradient updates, improving stability when removing data from trained models.

Details

Motivation: Current approximate unlearning methods fail under large or adversarial deletions due to layer-wise heterogeneity in deep networks - some layers are stable while others are brittle or overfit, causing catastrophic forgetting or unstable dynamics when applying naive update allocation.

Method: SRAGU (Statistical-Roughness Adaptive Gradient Unlearning) uses WeightWatcher-style heavy-tailed spectral diagnostics of layer weight matrices to estimate layer stability. It maps heavy-tailed exponents to spectral stability weights, then uses these to reweight AGU (Adaptive Gradient Unlearning) sensitivity signals before applying minibatch updates, concentrating updates in stable layers while damping unstable ones.

Result: The method is evaluated using behavioral alignment to a gold retrained reference model, measuring prediction-divergence and KL-to-gold proxies on forget-focused query sets, plus membership inference auditing to detect leakage of forgotten data.

Conclusion: SRAGU improves unlearning stability under hard deletions by adaptively reallocating updates based on layer-wise statistical roughness, addressing the heterogeneity problem in deep network unlearning.

Abstract: Machine unlearning aims to remove the influence of a designated forget set from a trained model while preserving utility on the retained data. In modern deep networks, approximate unlearning frequently fails under large or adversarial deletions due to pronounced layer-wise heterogeneity: some layers exhibit stable, well-regularized representations while others are brittle, undertrained, or overfit, so naive update allocation can trigger catastrophic forgetting or unstable dynamics. We propose Statistical-Roughness Adaptive Gradient Unlearning (SRAGU), a mechanism-first unlearning algorithm that reallocates unlearning updates using layer-wise statistical roughness operationalized via heavy-tailed spectral diagnostics of layer weight matrices. Starting from an Adaptive Gradient Unlearning (AGU) sensitivity signal computed on the forget set, SRAGU estimates a WeightWatcher-style heavy-tailed exponent for each layer, maps it to a bounded spectral stability weight, and uses this stability signal to spectrally reweight the AGU sensitivities before applying the same minibatch update form. This concentrates unlearning motion in spectrally stable layers while damping updates in unstable or overfit layers, improving stability under hard deletions. We evaluate unlearning via behavioral alignment to a gold retrained reference model trained from scratch on the retained data, using empirical prediction-divergence and KL-to-gold proxies on a forget-focused query set; we additionally report membership inference auditing as a complementary leakage signal, treating forget-set points as should-be-forgotten members during evaluation.

[375] Reward Modeling for Reinforcement Learning-Based LLM Reasoning: Design, Challenges, and Evaluation

Pei-Chi Pan, Yingbin Liang, Sen Lin

Main category: cs.LG

TL;DR: RARL framework systematizes reward design for LLM reasoning alignment, analyzing reward mechanisms, hacking vulnerabilities, and unifying challenges like hallucination mitigation.

Details

Motivation: Current RL-based fine-tuning for LLMs suffers from poorly understood relationships between reward modeling and core challenges like evaluation bias, hallucination, and distribution shift. Reward design is crucial for reasoning alignment but remains fragmented in research.

Method: Introduces Reasoning-Aligned Reinforcement Learning (RARL) framework with taxonomy of reward mechanisms for multi-step reasoning, analyzes reward hacking as failure mode, and examines how reward signals unify various challenges.

Result: Provides systematic analysis of reward paradigms, identifies vulnerabilities in existing benchmarks (data contamination, reward misalignment), and offers roadmap for building robust, verifiable reasoning models.

Conclusion: Reward modeling is central to reasoning alignment, not just implementation detail. RARL framework integrates fragmented research and clarifies interplay between reward design and fundamental reasoning capabilities for trustworthy models.

Abstract: Large Language Models (LLMs) demonstrate transformative potential, yet their reasoning remains inconsistent and unreliable. Reinforcement learning (RL)-based fine-tuning is a key mechanism for improvement, but its effectiveness is fundamentally governed by reward design. Despite its importance, the relationship between reward modeling and core LLM challenges–such as evaluation bias, hallucination, distribution shift, and efficient learning–remains poorly understood. This work argues that reward modeling is not merely an implementation detail but a central architect of reasoning alignment, shaping what models learn, how they generalize, and whether their outputs can be trusted. We introduce Reasoning-Aligned Reinforcement Learning (RARL), a unifying framework that systematizes diverse reward paradigms for multi-step reasoning. Within this framework, we present a taxonomy of reward mechanisms, analyze reward hacking as a pervasive failure mode, and examine how reward signals unify challenges ranging from inference-time scaling to hallucination mitigation. We further critically evaluate existing benchmarks, highlighting vulnerabilities such as data contamination and reward misalignment, and outline directions for more robust evaluation. By integrating fragmented research threads and clarifying the interplay between reward design and fundamental reasoning capabilities, this work provides a foundational roadmap for building reasoning models that are robust, verifiable, and trustworthy.

[376] Empowering Contrastive Federated Sequential Recommendation with LLMs

Thi Minh Chau Nguyen, Minh Hieu Nguyen, Duc Anh Nguyen, Xuan Huong Tran, Thanh Trung Huynh, Quoc Viet Hung Nguyen

Main category: cs.LG

TL;DR: LUMOS: A federated sequential recommendation system using on-device LLMs to generate semantic sequence variants for improved privacy-preserving recommendation without data sharing.

Details

Motivation: Federated sequential recommendation suffers from fragmented, noisy, and homogeneous interaction data on individual devices. Existing approaches have limited semantic diversity or high system overhead.

Method: Parameter-isolated FedSeqRec architecture with on-device LLMs as local semantic generators. LLMs create three sequence variants: future-oriented trajectories, semantically equivalent rephrasings, and preference-inconsistent counterfactuals. Uses tri-view contrastive optimization.

Result: Achieves consistent gains over centralized and federated baselines on HR@20 and NDCG@20 across three public benchmarks. Improves robustness under noisy and adversarial environments without dedicated server-side protection.

Conclusion: Demonstrates LLM-driven semantic generation as a new paradigm for advancing privacy-preserving federated recommendation, showing potential for richer representation learning without exposing sensitive information.

Abstract: Federated sequential recommendation (FedSeqRec) aims to perform next-item prediction while keeping user data decentralised, yet model quality is frequently constrained by fragmented, noisy, and homogeneous interaction logs stored on individual devices. Many existing approaches attempt to compensate through manual data augmentation or additional server-side constraints, but these strategies either introduce limited semantic diversity or increase system overhead. To overcome these challenges, we propose \textbf{LUMOS}, a parameter-isolated FedSeqRec architecture that integrates large language models (LLMs) as \emph{local semantic generators}. Instead of sharing gradients or auxiliary parameters, LUMOS privately invokes an on-device LLM to construct three complementary sequence variants from each user history: (i) \emph{future-oriented} trajectories that infer plausible behavioural continuations, (ii) \emph{semantically equivalent rephrasings} that retain user intent while diversifying interaction patterns, and (iii) \emph{preference-inconsistent counterfactuals} that serve as informative negatives. These synthesized sequences are jointly encoded within the federated backbone through a tri-view contrastive optimisation scheme, enabling richer representation learning without exposing sensitive information. Experimental results across three public benchmarks show that LUMOS achieves consistent gains over competitive centralised and federated baselines on HR@20 and NDCG@20. In addition, the use of semantically grounded positive signals and counterfactual negatives improves robustness under noisy and adversarial environments, even without dedicated server-side protection modules. Overall, this work demonstrates the potential of LLM-driven semantic generation as a new paradigm for advancing privacy-preserving federated recommendation.

[377] Clarifying Shampoo: Adapting Spectral Descent to Stochasticity and the Parameter Trajectory

Runa Eschenhagen, Anna Cai, Tsung-Hsien Lee, Hao-Jun Michael Shi

Main category: cs.LG

TL;DR: Shampoo optimizer achieves higher token efficiency than Muon on language models, similar to Adam’s advantage over Signum, with benefits coming from its application to weight matrices rather than variance adaptation or whitening.

Details

Motivation: To understand the relationship between matrix-structured optimizers (Shampoo, Muon) and element-wise optimizers (Adam, Signum), and determine their relative data efficiency and underlying mechanisms in controlled settings.

Method: Extensive experiments on language models comparing Shampoo and Muon, with analysis showing Shampoo’s update can be decomposed into an adapted Muon update, focusing on weight matrix applications.

Result: Shampoo achieves higher token efficiency than Muon, mirroring Adam’s advantage over Signum. Shampoo’s benefits are exclusively attributed to its application to weight matrices, not variance adaptation or whitening.

Conclusion: Shampoo’s updates are time-averaged semi-orthogonal in expectation rather than enforcing semi-orthogonality, providing a new perspective that avoids shortcomings of previous interpretations based on variance adaptation and whitening.

Abstract: Optimizers leveraging the matrix structure in neural networks, such as Shampoo and Muon, are more data-efficient than element-wise algorithms like Adam and Signum. While in specific settings, Shampoo and Muon reduce to spectral descent analogous to how Adam and Signum reduce to sign descent, their general relationship and relative data efficiency under controlled settings remain unclear. Through extensive experiments on language models, we demonstrate that Shampoo achieves higher token efficiency than Muon, mirroring Adam’s advantage over Signum. We show that Shampoo’s update applied to weight matrices can be decomposed into an adapted Muon update. Consistent with this, Shampoo’s benefits can be exclusively attributed to its application to weight matrices, challenging interpretations agnostic to parameter shapes. This admits a new perspective that also avoids shortcomings of related interpretations based on variance adaptation and whitening: rather than enforcing semi-orthogonality as in spectral descent, Shampoo’s updates are time-averaged semi-orthogonal in expectation.

[378] Effective MoE-based LLM Compression by Exploiting Heterogeneous Inter-Group Experts Routing Frequency and Information Density

Zhendong Mi, Yixiao Chen, Pu Zhao, Xiaodong Yu, Hao Wang, Yanzhi Wang, Shaoyi Huang

Main category: cs.LG

TL;DR: RFID-MoE: A compression framework for Mixture-of-Experts LLMs that uses heterogeneous routing frequency and information density to allocate compression ranks adaptively, with sparse projection to recover lost information.

Details

Motivation: MoE-based LLMs have superior performance but massive memory overhead from storing multiple expert networks hinders practical deployment. Existing SVD-based compression methods use uniform rank allocation or rely only on static weight properties, overlooking heterogeneity in expert utilization patterns and information density.

Method: Proposes RFID-MoE framework with: 1) Fused metric combining expert activation frequency with effective rank to measure expert importance, adaptively allocating higher ranks to critical expert groups under fixed budget; 2) Parameter-efficient sparse projection mechanism to reconstruct compression residuals and recover lost information with minimal parameter overhead.

Result: Extensive experiments on MoE LLMs (Qwen3, DeepSeekMoE) show RFID-MoE consistently outperforms state-of-the-art methods like MoBE and D2-MoE. Achieves perplexity of 16.92 on PTB with Qwen3-30B at 60% compression ratio (8.0+ perplexity reduction vs baselines), improves zero-shot accuracy on HellaSwag by ~8%.

Conclusion: RFID-MoE effectively compresses MoE LLMs by exploiting heterogeneous routing patterns and information density, achieving better performance than existing methods while maintaining practical deployment feasibility through adaptive rank allocation and residual reconstruction.

Abstract: Mixture-of-Experts (MoE) based Large Language Models (LLMs) have achieved superior performance, yet the massive memory overhead caused by storing multiple expert networks severely hinders their practical deployment. Singular Value Decomposition (SVD)-based compression has emerged as a promising post-training technique; however, most existing methods apply uniform rank allocation or rely solely on static weight properties. This overlooks the substantial heterogeneity in expert utilization observed in MoE models, where frequent routing patterns and intrinsic information density vary significantly across experts. In this work, we propose RFID-MoE, an effective framework for MoE compression by exploiting heterogeneous Routing Frequency and Information Density. We first introduce a fused metric that combines expert activation frequency with effective rank to measure expert importance, adaptively allocating higher ranks to critical expert groups under a fixed budget. Moreover, instead of discarding compression residuals, we reconstruct them via a parameter-efficient sparse projection mechanism to recover lost information with minimal parameter overhead. Extensive experiments on representative MoE LLMs (e.g., Qwen3, DeepSeekMoE) across multiple compression ratios demonstrate that RFID-MoE consistently outperforms state-of-the-art methods like MoBE and D2-MoE. Notably, RFID-MoE achieves a perplexity of 16.92 on PTB with the Qwen3-30B model at a 60% compression ratio, reducing perplexity by over 8.0 compared to baselines, and improves zero-shot accuracy on HellaSwag by approximately 8%.

[379] SnareNet: Flexible Repair Layers for Neural Networks with Hard Constraints

Ya-Chi Chu, Alkiviades Boukas, Madeleine Udell

Main category: cs.LG

TL;DR: SnareNet is a neural network architecture with a differentiable repair layer that ensures outputs satisfy input-dependent nonlinear constraints through iterative feasibility steering and adaptive relaxation during training.

Details

Motivation: Neural networks used as surrogate solvers and control policies often produce unconstrained predictions that can violate physical, operational, or safety requirements. There's a need for architectures that can learn mappings whose outputs must satisfy input-dependent nonlinear constraints while maintaining good objective quality.

Method: SnareNet appends a differentiable repair layer that navigates in the constraint map’s range space, steering iterates toward feasibility. It uses adaptive relaxation to design a relaxed feasible set that snares the neural network at initialization and shrinks into the feasible set, enabling early exploration and strict feasibility later in training.

Result: On optimization-learning and trajectory planning benchmarks, SnareNet consistently attains improved objective quality while satisfying constraints more reliably than prior work.

Conclusion: SnareNet provides an effective architecture for learning constraint-satisfying mappings with better objective quality and reliability than existing methods, addressing the critical need for neural networks that respect physical, operational, and safety constraints.

Abstract: Neural networks are increasingly used as surrogate solvers and control policies, but unconstrained predictions can violate physical, operational, or safety requirements. We propose SnareNet, a feasibility-controlled architecture for learning mappings whose outputs must satisfy input-dependent nonlinear constraints. SnareNet appends a differentiable repair layer that navigates in the constraint map’s range space, steering iterates toward feasibility and producing a repaired output that satisfies constraints to a user-specified tolerance. To stabilize end-to-end training, we introduce adaptive relaxation, which designs a relaxed feasible set that snares the neural network at initialization and shrinks it into the feasible set, enabling early exploration and strict feasibility later in training. On optimization-learning and trajectory planning benchmarks, SnareNet consistently attains improved objective quality while satisfying constraints more reliably than prior work.

[380] Priority-Aware Shapley Value

Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang

Main category: cs.LG

TL;DR: PASV extends Shapley values to handle precedence constraints and priority weights for more structure-faithful data valuation and feature attribution.

Details

Motivation: Standard Shapley values assume contributors are interchangeable, which is problematic when there are dependencies (reused/augmented data, causal feature orderings) or when contributions should be adjusted by factors like trust or risk.

Method: Proposes Priority-Aware Shapley Value (PASV) that incorporates both hard precedence constraints and soft, contributor-specific priority weights. Develops an efficient adjacent-swap Metropolis-Hastings sampler for scalable Monte Carlo estimation.

Result: Experiments on data valuation (MNIST/CIFAR10) and feature attribution (Census Income) demonstrate more structure-faithful allocations and practical sensitivity analysis via “priority sweeping”.

Conclusion: PASV provides a principled extension of Shapley values that handles precedence constraints and priority weights, enabling more accurate data valuation and feature attribution in structured settings.

Abstract: Shapley values are widely used for model-agnostic data valuation and feature attribution, yet they implicitly assume contributors are interchangeable. This can be problematic when contributors are dependent (e.g., reused/augmented data or causal feature orderings) or when contributions should be adjusted by factors such as trust or risk. We propose Priority-Aware Shapley Value (PASV), which incorporates both hard precedence constraints and soft, contributor-specific priority weights. PASV is applicable to general precedence structures, recovers precedence-only and weight-only Shapley variants as special cases, and is uniquely characterized by natural axioms. We develop an efficient adjacent-swap Metropolis-Hastings sampler for scalable Monte Carlo estimation and analyze limiting regimes induced by extreme priority weights. Experiments on data valuation (MNIST/CIFAR10) and feature attribution (Census Income) demonstrate more structure-faithful allocations and a practical sensitivity analysis via our proposed “priority sweeping”.

[381] MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection

Xueying Ding, Simon Klüttermann, Haomin Wen, Yilong Chen, Leman Akoglu

Main category: cs.LG

TL;DR: MacrOData introduces a large-scale benchmark suite for tabular outlier detection with 2,446 datasets across three components: OddBench (real semantic anomalies), OvrBench (real statistical outliers), and SynBench (synthetic datasets).

Details

Motivation: Existing outlier detection benchmarks like AdBench are limited to only 57 datasets, restricting diversity and statistical power. There's a need for comprehensive, large-scale benchmarks to fairly evaluate tabular OD methods.

Method: Created three benchmark components: OddBench (790 datasets with real semantic anomalies), OvrBench (856 datasets with real statistical outliers), and SynBench (800 synthetic datasets). Provided standardized train/test splits, public/private partitions, semantic metadata annotations, and an online leaderboard.

Result: MacrOData enables comprehensive evaluation of classical, deep, and foundation model OD methods across diverse configurations. The benchmark suite provides detailed empirical findings and practical guidelines for future research.

Conclusion: MacrOData addresses limitations of existing OD benchmarks through scale and diversity, offering statistically robust evaluation capabilities and serving as a valuable resource for the outlier detection community.

Abstract: Quality benchmarks are essential for fairly and accurately tracking scientific progress and enabling practitioners to make informed methodological choices. Outlier detection (OD) on tabular data underpins numerous real-world applications, yet existing OD benchmarks remain limited. The prominent OD benchmark AdBench is the de facto standard in the literature, yet comprises only 57 datasets. In addition to other shortcomings discussed in this work, its small scale severely restricts diversity and statistical power. We introduce MacrOData, a large-scale benchmark suite for tabular OD comprising three carefully curated components: OddBench, with 790 datasets containing real-world semantic anomalies; OvrBench, with 856 datasets featuring real-world statistical outliers; and SynBench, with 800 synthetically generated datasets spanning diverse data priors and outlier archetypes. Owing to its scale and diversity, MacrOData enables comprehensive and statistically robust evaluation of tabular OD methods. Our benchmarks further satisfy several key desiderata: We provide standardized train/test splits for all datasets, public/private benchmark partitions with held-out test labels for the latter reserved toward an online leaderboard, and annotate our datasets with semantic metadata. We conduct extensive experiments across all benchmarks, evaluating a broad range of OD methods comprising classical, deep, and foundation models, over diverse hyperparameter configurations. We report detailed empirical findings, practical guidelines, as well as individual performances as references for future research. All benchmarks containing 2,446 datasets combined are open-sourced, along with a publicly accessible leaderboard hosted at https://huggingface.co/MacrOData-CMU.

[382] Large Language Models for Designing Participatory Budgeting Rules

Nguyen Thach, Xingchen Sha, Hau Chan

Main category: cs.LG

TL;DR: LLMRule: A framework using large language models in evolutionary search to automatically design participatory budgeting rules that balance utility and fairness.

Details

Motivation: Participatory budgeting rules traditionally require extensive domain knowledge and face trade-offs between utility and fairness. Recent advances in LLMs for algorithmic design offer potential to automate and improve PB rule design.

Method: LLMRule incorporates large language models into an evolutionary search procedure to automatically design PB rules. It leverages the resemblance between PB rules and knapsack problem algorithms, using LLMs to generate and refine rule candidates.

Result: Evaluated on 600+ real-world PB instances from multiple countries, LLM-generated rules generally outperform existing handcrafted rules in overall utility while maintaining similar fairness levels.

Conclusion: LLMs can effectively automate the design of participatory budgeting rules, achieving better utility-fairness trade-offs than manually crafted rules.

Abstract: Participatory budgeting (PB) is a democratic paradigm for deciding the funding of public projects given the residents’ preferences, which has been adopted in numerous cities across the world. The main focus of PB is designing rules, functions that return feasible budget allocations for a set of projects subject to some budget constraint. Designing PB rules that optimize both utility and fairness objectives based on agent preferences had been challenging due to the extensive domain knowledge required and the proven trade-off between the two notions. Recently, large language models (LLMs) have been increasingly employed for automated algorithmic design. Given the resemblance of PB rules to algorithms for classical knapsack problems, in this paper, we introduce a novel framework, named LLMRule, that addresses the limitations of existing works by incorporating LLMs into an evolutionary search procedure for automating the design of PB rules. Our experimental results, evaluated on more than 600 real-world PB instances obtained from the U.S., Canada, Poland, and the Netherlands with different representations of agent preferences, demonstrate that the LLM-generated rules generally outperform existing handcrafted rules in terms of overall utility while still maintaining a similar degree of fairness.

[383] Latent Poincaré Shaping for Agentic Reinforcement Learning

Hanchen Xia, Baoyou Chen, Zelin Zang, Yutang Ge, Guojiang Zhao, Siyu Zhu

Main category: cs.LG

TL;DR: LaPha trains AlphaZero-like LLM agents in Poincaré hyperbolic space for mathematical reasoning, using hyperbolic geometry to structure search trees and provide dense rewards, achieving strong performance on math benchmarks.

Details

Motivation: The paper addresses the challenge of improving LLM reasoning capabilities for complex mathematical problems. Traditional methods lack efficient search structures and reward mechanisms for mathematical reasoning tasks.

Method: LaPha trains LLM agents in a Poincaré latent space where search trees grow outward from the origin toward the boundary. It uses hyperbolic geodesic distance to define node potentials and assign dense process rewards. A lightweight value head is attached to the shared latent space for self-guided test-time scaling.

Result: On MATH-500, LaPha improves Qwen2.5-Math-1.5B from 66.0% to 88.2%. With value-head-guided search, LaPha-1.5B reaches 56.7% accuracy on AIME'24, and LaPha-7B achieves 60.0% on AIME'24 and 53.3% on AIME'25.

Conclusion: LaPha demonstrates that training LLM agents in hyperbolic latent spaces with structured search and dense reward mechanisms significantly improves mathematical reasoning capabilities, enabling strong performance on challenging math competitions.

Abstract: We propose LaPha, a method for training AlphaZero-like LLM agents in a Poincaré latent space. Under LaPha, the search process can be visualized as a tree rooted at the prompt and growing outward from the origin toward the boundary of the Poincaré ball, where negative curvature provides exponentially increasing capacity with radius. Using hyperbolic geodesic distance to rule-verified correctness, we define a node potential and assign dense process rewards by potential differences. We further attach a lightweight value head on the same shared latent space, enabling self-guided test-time scaling with almost no additional overhead. On MATH-500, LaPha improves Qwen2.5-Math-1.5B from 66.0% to 88.2%. With value-head-guided search, LaPha-1.5B reaches 56.7% accuracy on AIME'24, and LaPha-7B further achieves 60.0% on AIME'24 and 53.3% on AIME'25.

[384] Sparse Layer Sharpness-Aware Minimization for Efficient Fine-Tuning

Yifei Cheng, Xianglin Yang, Guoxia Wang, Chao Huang, Fei Ma, Dianhai Yu, Xiaochun Cao, Li Shen

Main category: cs.LG

TL;DR: SL-SAM introduces sparse layer selection to reduce computation cost in Sharpness-Aware Minimization by treating layer selection as a multi-armed bandit problem, achieving comparable performance with significantly fewer active parameters.

Details

Motivation: SAM improves generalization by finding flat minima but doubles computation cost due to extra parameter perturbation steps, creating a practical bottleneck for implementation, especially in fine-tuning scenarios.

Method: Proposes SL-SAM that dynamically selects layers for gradient ascent (perturbation) and descent (update) steps using a multi-armed bandit approach based on gradient norms, reducing backpropagation computation by activating only a subset of layers.

Result: SL-SAM achieves comparable performance to state-of-the-art baselines (including #1 rank on LLM fine-tuning) while significantly reducing active parameters (47% on vision models, 22% on moderate models, 21% on large language models vs. 100% for vanilla SAM).

Conclusion: SL-SAM effectively breaks the computation bottleneck of SAM through sparse layer selection, maintaining performance while dramatically improving efficiency, making SAM more practical for real-world applications.

Abstract: Sharpness-aware minimization (SAM) seeks the minima with a flat loss landscape to improve the generalization performance in machine learning tasks, including fine-tuning. However, its extra parameter perturbation step doubles the computation cost, which becomes the bottleneck of SAM in the practical implementation. In this work, we propose an approach SL-SAM to break this bottleneck by introducing the sparse technique to layers. Our key innovation is to frame the dynamic selection of layers for both the gradient ascent (perturbation) and descent (update) steps as a multi-armed bandit problem. At the beginning of each iteration, SL-SAM samples a part of the layers of the model according to the gradient norm to participate in the backpropagation of the following parameter perturbation and update steps, thereby reducing the computation complexity. We then provide the analysis to guarantee the convergence of SL-SAM. In the experiments of fine-tuning models in several tasks, SL-SAM achieves the performances comparable to the state-of-the-art baselines, including a #1 rank on LLM fine-tuning. Meanwhile, SL-SAM significantly reduces the ratio of active parameters in backpropagation compared to vanilla SAM (SL-SAM activates 47%, 22% and 21% parameters on the vision, moderate and large language model respectively while vanilla SAM always activates 100%), verifying the efficiency of our proposed algorithm.

[385] Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

Nilaksh, Antoine Clavaud, Mathieu Reymond, François Rivest, Sarath Chandar

Main category: cs.LG

TL;DR: Streaming RL with Self-Predictive Representations (SPR) to improve sample efficiency without replay buffers, using orthogonal gradient updates to handle correlated samples in streaming regime.

Details

Motivation: Streaming RL is resource-efficient but sample-inefficient due to immediate discarding of transitions after single updates. Value-based losses alone struggle to extract meaningful representations from transient data, creating a performance gap compared to methods with replay buffers.

Method: Extends Self-Predictive Representations (SPR) to streaming pipeline to maximize utility of every observed frame. Introduces orthogonal gradient updates relative to momentum target to resolve training instabilities from correlated samples in streaming regime. Addresses gradient conflicts from streaming-specific optimizers.

Result: Systematically outperforms existing streaming baselines across Atari, MinAtar, and Octax suites. Latent-space analysis (t-SNE visualizations and effective-rank measurements) confirms learning significantly richer representations. Bridges performance gap caused by absence of replay buffer while remaining efficient enough to train on just a few CPU cores.

Conclusion: Proposed method successfully extends SPR to streaming RL, overcoming training instabilities from correlated samples through orthogonal gradient updates. Achieves state-of-the-art performance in streaming RL while maintaining computational efficiency, making it practical for on-device applications.

Abstract: In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes resource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.

[386] Learning with Multiple Correct Answers – A Trichotomy of Regret Bounds under Different Feedback Models

Alireza F. Pour, Farnam Mansouri, Shai Ben-David

Main category: cs.LG

TL;DR: Online learning with multiple correct answers where each instance has a set of valid labels, motivated by language generation tasks where prompts admit many acceptable completions.

Details

Motivation: Motivated by language generation tasks where a prompt may have many acceptable completions but not every completion is acceptable, addressing the need for learning frameworks that handle multiple valid answers.

Method: Studies the problem under three feedback models, characterizes optimal mistake bounds in realizable setting using combinatorial dimensions, establishes trichotomy of regret bounds across models in agnostic setting.

Result: Characterizes optimal mistake bounds using combinatorial dimensions for each feedback model, establishes a trichotomy of regret bounds across models, and implies sample complexity bounds for batch setup.

Conclusion: Provides theoretical framework for online learning with multiple correct answers relevant to language generation tasks, with combinatorial dimension-based analysis of mistake and regret bounds.

Abstract: We study an online learning problem with multiple correct answers, where each instance admits a set of valid labels, and in each round the learner must output a valid label for the queried example. This setting is motivated by language generation tasks, in which a prompt may admit many acceptable completions, but not every completion is acceptable. We study this problem under three feedback models. For each model, we characterize the optimal mistake bound in the realizable setting using an appropriate combinatorial dimension. We then establish a trichotomy of regret bounds across the three models in the agnostic setting. Our results also imply sample complexity bounds for the batch setup that depend on the respective combinatorial dimensions.

[387] Reward-Guided Discrete Diffusion via Clean-Sample Markov Chain for Molecule and Biological Sequence Design

Prin Phunyaphibarn, Minhyuk Sung

Main category: cs.LG

TL;DR: CSMC Sampler: A test-time reward-guided sampling method for discrete diffusion models that avoids noisy intermediate rewards by constructing a Markov chain of clean samples using Metropolis-Hastings.

Details

Motivation: Existing reward-guided discrete diffusion models underperform because they rely on noisy intermediate rewards from non-smooth reward functions common in scientific domains like chemistry and biology.

Method: Proposes Clean-Sample Markov Chain (CSMC) Sampler that constructs a Markov chain of clean samples using Metropolis-Hastings algorithm. Uses sequential forward and backward diffusion processes to design tractable proposal distribution, enabling local search without intermediate rewards.

Result: Experiments on molecule and biological sequence generation with various reward functions show CSMC consistently outperforms prior approaches that rely on intermediate rewards.

Conclusion: CSMC provides an effective test-time reward-guided sampling method for discrete diffusion models that avoids the limitations of noisy intermediate rewards, improving performance in scientific domains.

Abstract: Discrete diffusion models have recently emerged as a powerful class of generative models for chemistry and biology data. In these fields, the goal is to generate various samples with high rewards (e.g., drug-likeness in molecules), making reward-based guidance crucial. Most existing methods are based on guiding the diffusion model using intermediate rewards but tend to underperform since intermediate rewards are noisy due to the non-smooth nature of reward functions used in scientific domains. To address this, we propose Clean-Sample Markov Chain (CSMC) Sampler, a method that performs effective test-time reward-guided sampling for discrete diffusion models, enabling local search without relying on intermediate rewards. CSMC constructs a Markov chain of clean samples using the Metropolis-Hastings algorithm such that its stationary distribution is the target distribution. We design a proposal distribution by sequentially applying the forward and backward diffusion processes, making the acceptance probability tractable. Experiments on molecule and biological sequence generation with various reward functions demonstrate that our method consistently outperforms prior approaches that rely on intermediate rewards.

[388] Diffusion-Guided Pretraining for Brain Graph Foundation Models

Xinxu Wei, Rong Zhou, Lifang He, Yu Zhang

Main category: cs.LG

TL;DR: A diffusion-based pretraining framework for brain graph/hypergraph foundation models that uses structure-aware dropping/masking and topology-aware readout/reconstruction to learn better representations from connectome data.

Details

Motivation: Existing graph-based pretraining methods for brain signals use naive random dropping/masking that disrupts semantically meaningful connectivity patterns in brain graphs/hypergraphs, and fail to capture global structural information through graph-level readout and reconstruction schemes.

Method: Proposes a unified diffusion-based pretraining framework that: 1) uses diffusion to guide structure-aware dropping and masking strategies to preserve brain graph semantics while maintaining pretraining diversity, and 2) enables topology-aware graph-level readout and node-level global reconstruction by allowing graph embeddings and masked nodes to aggregate information from globally related regions.

Result: Extensive experiments across multiple neuroimaging datasets with over 25,000 subjects and 60,000 scans involving various mental disorders and brain atlases demonstrate consistent performance improvements.

Conclusion: The diffusion-based framework effectively addresses limitations of existing contrastive and masked autoencoder methods for brain graph pretraining, providing better representations for foundation models of brain signals.

Abstract: With the growing interest in foundation models for brain signals, graph-based pretraining has emerged as a promising paradigm for learning transferable representations from connectome data. However, existing contrastive and masked autoencoder methods typically rely on naive random dropping or masking for augmentation, which is ill-suited for brain graphs and hypergraphs as it disrupts semantically meaningful connectivity patterns. Moreover, commonly used graph-level readout and reconstruction schemes fail to capture global structural information, limiting the robustness of learned representations. In this work, we propose a unified diffusion-based pretraining framework that addresses both limitations. First, diffusion is designed to guide structure-aware dropping and masking strategies, preserving brain graph semantics while maintaining effective pretraining diversity. Second, diffusion enables topology-aware graph-level readout and node-level global reconstruction by allowing graph embeddings and masked nodes to aggregate information from globally related regions. Extensive experiments across multiple neuroimaging datasets with over 25,000 subjects and 60,000 scans involving various mental disorders and brain atlases demonstrate consistent performance improvements.

[389] Taming the Monster Every Context: Complexity Measure and Unified Framework for Offline-Oracle Efficient Contextual Bandits

Hao Qin, Chicheng Zhang

Main category: cs.LG

TL;DR: OE2D reduces contextual bandit learning to offline regression with logarithmic oracle calls, using exploitative F-design action distributions and introducing Decision-Offline Estimation Coefficient (DOEC) complexity measure.

Details

Motivation: The paper aims to bridge the gap between offline regression and online contextual bandit learning, seeking to reduce the computational burden of online bandit algorithms by leveraging offline regression oracles while maintaining near-optimal regret guarantees.

Method: Proposes OE2D framework that reduces contextual bandit learning to offline regression using an “exploitative F-design” action distribution that balances exploration and exploitation. Introduces Decision-Offline Estimation Coefficient (DOEC) as a new complexity measure and shows its relationship with Decision Estimation Coefficient (DEC).

Result: Achieves near-optimal regret with only O(log(T)) calls to offline regression oracle over T rounds (O(loglog(T)) when T is known). Shows DOEC is bounded in bounded Eluder dimension per-context and smoothed regret settings.

Conclusion: OE2D successfully bridges offline regression and online contextual bandit learning, providing an efficient algorithmic framework that reduces computational complexity while maintaining theoretical guarantees.

Abstract: We propose an algorithmic framework, Offline Estimation to Decisions (OE2D), that reduces contextual bandit learning with general reward function approximation to offline regression. The framework allows near-optimal regret for contextual bandits with large action spaces with $O(log(T))$ calls to an offline regression oracle over $T$ rounds, and makes $O(loglog(T))$ calls when $T$ is known. The design of OE2D algorithm generalizes Falcon~\citep{simchi2022bypassing} and its linear reward version~\citep[][Section 4]{xu2020upper} in that it chooses an action distribution that we term ``exploitative F-design’’ that simultaneously guarantees low regret and good coverage that trades off exploration and exploitation. Central to our regret analysis is a new complexity measure, the Decision-Offline Estimation Coefficient (DOEC), which we show is bounded in bounded Eluder dimension per-context and smoothed regret settings. We also establish a relationship between DOEC and Decision Estimation Coefficient (DEC)~\citep{foster2021statistical}, bridging the design principles of offline- and online-oracle efficient contextual bandit algorithms for the first time.

[390] Scalable and Reliable State-Aware Inference of High-Impact N-k Contingencies

Lihao Mai, Chenhan Xiao, Yang Weng

Main category: cs.LG

TL;DR: A scalable contingency inference framework using conditional diffusion models and graph neural networks to generate high-impact N-k outage scenarios without exhaustive combinatorial evaluation.

Details

Motivation: Traditional exhaustive evaluation of all N-k contingency combinations is computationally prohibitive for power systems with increasing inverter-based resources and flexible loads, forcing operators to rely on heuristic methods without formal guarantees for identifying critical contingencies.

Method: Proposes a state-aware contingency inference framework with: 1) conditional diffusion model to generate candidate contingencies tailored to current operating state, 2) topology-aware graph neural network trained only on base and N-1 cases to construct high-risk training samples offline, and 3) controllable coverage guarantees for severe contingencies.

Result: Experiments on IEEE benchmark systems show the approach consistently evaluates higher-severity contingencies than uniform sampling for a given evaluation budget, allowing critical outages to be identified more reliably with reduced computational effort.

Conclusion: The framework provides a scalable solution for N-k contingency assessment with formal risk management capabilities, enabling operators to identify critical outages more reliably while managing computational budgets.

Abstract: Increasing penetration of inverter-based resources, flexible loads, and rapidly changing operating conditions make higher-order $N!-!k$ contingency assessment increasingly important but computationally prohibitive. Exhaustive evaluation of all outage combinations using AC power-flow or ACOPF is infeasible in routine operation. This fact forces operators to rely on heuristic screening methods whose ability to consistently retain all critical contingencies is not formally established. This paper proposes a scalable, state-aware contingency inference framework designed to directly generate high-impact $N!-!k$ outage scenarios without enumerating the combinatorial contingency space. The framework employs a conditional diffusion model to produce candidate contingencies tailored to the current operating state, while a topology-aware graph neural network trained only on base and $N!-!1$ cases efficiently constructs high-risk training samples offline. Finally, the framework is developed to provide controllable coverage guarantees for severe contingencies, allowing operators to explicitly manage the risk of missing critical events under limited AC power-flow evaluation budgets. Experiments on IEEE benchmark systems show that, for a given evaluation budget, the proposed approach consistently evaluates higher-severity contingencies than uniform sampling. This allows critical outages to be identified more reliably with reduced computational effort.

[391] Online Learning in MDPs with Partially Adversarial Transitions and Losses

Ofir Schlisselberg, Tal Lancewicki, Yishay Mansour

Main category: cs.LG

TL;DR: The paper studies reinforcement learning in MDPs with mostly stochastic transitions but Λ adversarial steps per episode, introducing conditioned occupancy measures and algorithms with improved regret bounds.

Details

Motivation: To develop RL algorithms for environments that are mostly stable but have a few vulnerable points where transitions can be adversarial, capturing real-world scenarios where systems are generally predictable but have occasional adversarial disruptions.

Method: Introduces conditioned occupancy measures that remain stable across episodes despite adversarial transitions, and designs two algorithms: one for arbitrary adversarial steps and another for consecutive adversarial steps with improved dependence on state space size.

Result: Achieves regret bounds of Õ(H S^Λ√(K S A^{Λ+1})) for arbitrary adversarial steps and Õ(H√(K S^3 A^{Λ+1})) for consecutive adversarial steps, with a K^{2/3}-regret reduction that eliminates need to know adversarial step locations.

Conclusion: The paper provides theoretical foundations and algorithms for RL in mixed stochastic-adversarial environments, with improved regret bounds and characterization of fully adversarial settings.

Abstract: We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of $Λ$ steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emph{conditioned occupancy measures}, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret $\tilde{O}(H S^Λ\sqrt{K S A^{Λ+1}})$, where $K$ is the number of episodes, $S$ is the number of state, $A$ is the number of actions and $H$ is the episode’s horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on $S$ to $\tilde{O}(H\sqrt{K S^{3} A^{Λ+1}})$. We further give a $K^{2/3}$-regret reduction that removes the need to know which steps are the $Λ$ adversarial steps. We also characterize the regret of adversarial MDPs in the \emph{fully adversarial} setting ($Λ=H-1$) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).

[392] Adaptive recurrent flow map operator learning for reaction diffusion dynamics

Huseyin Tunc

Main category: cs.LG

TL;DR: DDOL-ART is a data-driven neural operator with adaptive recurrent training that learns stable reaction-diffusion dynamics from limited data, achieving zero-shot generalization to out-of-distribution patterns with reduced training costs.

Details

Motivation: Learning stable operators for reaction-diffusion equations from data is challenging due to error accumulation in autoregressive rollouts and degradation with out-of-distribution initial conditions. Physics-based regularization adds assumptions and computational cost.

Method: DDOL-ART uses adaptive recurrent training with lightweight validation milestones that early-exit unproductive rollout segments and redirect optimization. Trained only on short-horizon in-distribution data, it learns one-step operators that remain stable in long rollouts.

Result: The method achieves zero-shot generalization to strong morphology shifts across FitzHugh-Nagumo, Gray-Scott, and Lambda-Omega systems, with several-fold faster training than physics-based approaches while maintaining competitive accuracy and stability.

Conclusion: Feedback-controlled recurrent training generates robust flow-map surrogates without PDE residuals, maintaining competitiveness with physics-based methods at significantly reduced training costs.

Abstract: Reaction-diffusion (RD) equations underpin pattern formation across chemistry, biology, and physics, yet learning stable operators that forecast their long-term dynamics from data remains challenging. Neural-operator surrogates provide resolution-robust prediction, but autoregressive rollouts can drift due to the accumulation of error, and out-of-distribution (OOD) initial conditions often degrade accuracy. Physics-based numerical residual objectives can regularize operator learning, although they introduce additional assumptions, sensitivity to discretization and loss design, and higher training cost. Here we develop a purely data-driven operator learner with adaptive recurrent training (DDOL-ART) using a robust recurrent strategy with lightweight validation milestones that early-exit unproductive rollout segments and redirect optimization. Trained only on a single in-distribution toroidal Gaussian family over short horizons, DDOL-ART learns one-step operators that remain stable under long rollouts and generalize zero-shot to strong morphology shifts across FitzHugh-Nagumo (FN), Gray-Scott (GS), and Lambda-Omega (LO) systems. Across these benchmarks, DDOL-ART delivers a strong accuracy and cost trade-off. It is several-fold faster than a physics-based numerical-loss operator learner (NLOL) under matched settings, and it remains competitive on both in-distribution stability and OOD robustness. Training-dynamics diagnostics show that adaptivity strengthens the correlation between validation error and OOD test error performance, acting as a feedback controller that limits optimization drift. Our results indicate that feedback-controlled recurrent training of DDOL-ART generates robust flow-map surrogates without PDE residuals, while simultaneously maintaining competitiveness with NLOL at significantly reduced training costs.

[393] Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

Sangyoon Lee, Jaeho Lee

Main category: cs.LG

TL;DR: Batch size is the overlooked factor explaining conflicting performance of LoRA variants; proper batch size tuning makes vanilla LoRA competitive with complex variants, and batch size should be treated as a first-order design parameter.

Details

Motivation: Many LoRA variants report conflicting empirical gains on the same benchmarks, creating confusion about which variants are actually better. The authors aim to identify the root cause of these contradictions and provide guidance for more reliable evaluations.

Method: The paper shows that batch size is the overlooked factor causing performance contradictions. They propose a proxy-based, cost-efficient strategy for batch size tuning and systematically study how rank, dataset size, and model capacity affect optimal batch size.

Result: When properly tuned for batch size, vanilla LoRA often matches the performance of more complex variants. The optimal batch size depends on rank, dataset size, and model capacity. Batch size tuning reconciles prior inconsistencies in LoRA variant evaluations.

Conclusion: Batch size should be elevated from a minor implementation detail to a first-order design parameter in LoRA fine-tuning. Proper batch size tuning enables more reliable evaluations of LoRA variants and can make simpler approaches competitive with complex ones.

Abstract: Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical gains, often on the same benchmarks. We show that these contradictions arise from a single overlooked factor: the batch size. When properly tuned, vanilla LoRA often matches the performance of more complex variants. We further propose a proxy-based, cost-efficient strategy for batch size tuning, revealing the impact of rank, dataset size, and model capacity on the optimal batch size. Our findings elevate batch size from a minor implementation detail to a first-order design parameter, reconciling prior inconsistencies and enabling more reliable evaluations of LoRA variants.

[394] Computationally Efficient Replicable Learning of Parities

Moshe Noivirt, Jessica Sorrell, Eliad Tsfadia

Main category: cs.LG

TL;DR: Replicable PAC learning is shown to be computationally more powerful than statistical query learning, with an efficient replicable algorithm for learning parities over arbitrary distributions, bridging the gap between replicability and differential privacy.

Details

Motivation: The paper investigates the computational relationship between replicability (a stability notion in learning) and other stability concepts like differential privacy and statistical query learning. While statistically, replicable learning and differentially private learning are equivalent and more powerful than SQ-learning, computationally, efficient replicable algorithms were previously limited to SQ-learnable tasks or restricted distributions, unlike differentially private learning which has broader computational capabilities.

Method: The main contribution is a new computationally efficient replicable algorithm for realizable learning of parities over arbitrary distributions. The key building block is an efficient replicable algorithm that, given a set of vectors, outputs a subspace of their linear span that covers most of them. This enables learning parities, a task known to be hard in the SQ-model but possible under differential privacy.

Result: The paper provides the first evidence that efficient replicable learning over general distributions strictly extends efficient SQ-learning and is closer in power to efficient differentially private learning, despite known computational separations between replicability and privacy.

Conclusion: This work establishes that replicable learning has greater computational power than previously known, bridging the gap between replicability and differential privacy in terms of computational capabilities, while maintaining the statistical equivalence between these stability notions.

Abstract: We study the computational relationship between replicability (Impagliazzo et al. [STOC 22], Ghazi et al. [NeurIPS 21]) and other stability notions. Specifically, we focus on replicable PAC learning and its connections to differential privacy (Dwork et al. [TCC 2006]) and to the statistical query (SQ) model (Kearns [JACM `98]). Statistically, it was known that differentially private learning and replicable learning are equivalent and strictly more powerful than SQ-learning. Yet, computationally, all previously known efficient (i.e., polynomial-time) replicable learning algorithms were confined to SQ-learnable tasks or restricted distributions, in contrast to differentially private learning. Our main contribution is the first computationally efficient replicable algorithm for realizable learning of parities over arbitrary distributions, a task that is known to be hard in the SQ-model, but possible under differential privacy. This result provides the first evidence that efficient replicable learning over general distributions strictly extends efficient SQ-learning, and is closer in power to efficient differentially private learning, despite computational separations between replicability and privacy. Our main building block is a new, efficient, and replicable algorithm that, given a set of vectors, outputs a subspace of their linear span that covers most of them.

[395] Improved Approximate Regret for Decentralized Online Continuous Submodular Maximization via Reductions

Yuanyu Wan, Yu Shen, Dingzhi Yu, Bo Xue, Mingli Song

Main category: cs.LG

TL;DR: Proposes reduction techniques from decentralized online continuous submodular maximization to decentralized online convex optimization to improve regret bounds and enable projection-free algorithms.

Details

Motivation: Existing algorithms for decentralized online continuous submodular maximization have large gaps in regret bounds compared to convex settings, and projection-free algorithms can't match centralized performance. Need to address these limitations.

Method: Two reduction techniques from D-OCSM to D-OCO that exploit D-OCO algorithms to improve approximate regret bounds for general convex decision sets and downward-closed decision sets.

Result: For general convex decision sets, both issues (regret gap and projection-free limitations) can be addressed simultaneously. For downward-closed sets, the second issue can be addressed while significantly alleviating the first.

Conclusion: The proposed reductions successfully bridge the gap between D-OCSM and D-OCO, enabling better regret bounds and practical projection-free algorithms for decentralized online continuous submodular maximization.

Abstract: To expand the applicability of decentralized online learning, previous studies have proposed several algorithms for decentralized online continuous submodular maximization (D-OCSM) – a non-convex/non-concave setting with continuous DR-submodular reward functions. However, there exist large gaps between their approximate regret bounds and the regret bounds achieved in the convex setting. Moreover, if focusing on projection-free algorithms, which can efficiently handle complex decision sets, they cannot even recover the approximate regret bounds achieved in the centralized setting. In this paper, we first demonstrate that for D-OCSM over general convex decision sets, these two issues can be addressed simultaneously. Furthermore, for D-OCSM over downward-closed decision sets, we show that the second issue can be addressed while significantly alleviating the first issue. Our key techniques are two reductions from D-OCSM to decentralized online convex optimization (D-OCO), which can exploit D-OCO algorithms to improve the approximate regret of D-OCSM in these two cases, respectively.

[396] Towards Uniformity and Alignment for Multimodal Representation Learning

Wenzhe Yin, Pan Zhou, Zehao Xiao, Jie Liu, Shujian Yu, Jan-Jakob Sonke, Efstratios Gavves

Main category: cs.LG

TL;DR: Proposes decoupling alignment and uniformity in multimodal representation learning to address conflicts that cause distribution gaps across modalities, enabling both discriminative and generative tasks without task-specific modules.

Details

Motivation: InfoNCE-based objectives in multimodal learning introduce inherent conflicts that create distribution gaps across modalities, especially as the number of modalities increases. Two key conflicts identified: (1) alignment-uniformity conflict where repulsion undermines pairwise alignment, and (2) intra-alignment conflict where aligning multiple modalities creates competing alignment directions.

Method: Proposes a principled decoupling of alignment and uniformity for multimodal representations. Provides a conflict-free recipe for multimodal learning that simultaneously supports both discriminative and generative use cases without requiring task-specific modules. The method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions.

Result: Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains. The theoretical analysis shows the method reduces distribution gaps among modalities by serving as an efficient proxy for global Hölder divergence.

Conclusion: Decoupling alignment and uniformity addresses fundamental conflicts in multimodal representation learning, enabling better semantic alignment across modalities while supporting both discriminative and generative applications without task-specific modifications.

Abstract: Multimodal representation learning aims to construct a shared embedding space in which heterogeneous modalities are semantically aligned. Despite strong empirical results, InfoNCE-based objectives introduce inherent conflicts that yield distribution gaps across modalities. In this work, we identify two conflicts in the multimodal regime, both exacerbated as the number of modalities increases: (i) an alignment-uniformity conflict, whereby the repulsion of uniformity undermines pairwise alignment, and (ii) an intra-alignment conflict, where aligning multiple modalities induces competing alignment directions. To address these issues, we propose a principled decoupling of alignment and uniformity for multimodal representations, providing a conflict-free recipe for multimodal learning that simultaneously supports discriminative and generative use cases without task-specific modules. We then provide a theoretical guarantee that our method acts as an efficient proxy for a global Hölder divergence over multiple modality distributions, and thus reduces the distribution gap among modalities. Extensive experiments on retrieval and UnCLIP-style generation demonstrate consistent gains.

[397] Beyond Student: An Asymmetric Network for Neural Network Inheritance

Yiyun Zhou, Jingwei Shi, Mingjing Xu, Zhonghua Jiang, Jingyuan Chen

Main category: cs.LG

TL;DR: InherNet: A neural network inheritance method using asymmetric low-rank decomposition of teacher weights to create lightweight expressive networks without architectural disruption, outperforming traditional KD students of similar size.

Details

Motivation: Traditional Knowledge Distillation (KD) has capacity gap limitations between teacher and student networks. The authors explore whether a network can inherit both teacher structure and knowledge more effectively than standard KD students.

Method: Proposes InherNet which performs asymmetric low-rank decomposition on teacher weights using Singular Value Decomposition (SVD) for initialization, reconstructing a lightweight network while preserving principal knowledge. Balances depth, width, and compression efficiency.

Result: Experimental results across unimodal and multimodal tasks show InherNet achieves higher performance compared to student networks of similar parameter sizes.

Conclusion: InherNet reveals a promising direction for efficient model compression beyond traditional distillation, enabling better knowledge inheritance from teacher networks.

Abstract: Knowledge Distillation (KD) has emerged as a powerful technique for model compression, enabling lightweight student networks to benefit from the performance of redundant teacher networks. However, the inherent capacity gap often limits the performance of student networks. Inspired by the expressiveness of pretrained teacher networks, a compelling research question arises: is there a type of network that can not only inherit the teacher’s structure but also maximize the inheritance of its knowledge? Furthermore, how does the performance of such an inheriting network compare to that of student networks, all benefiting from the same teacher network? To further explore this question, we propose InherNet, a neural network inheritance method that performs asymmetric low-rank decomposition on the teacher’s weights and reconstructs a lightweight yet expressive network without significant architectural disruption. By leveraging Singular Value Decomposition (SVD) for initialization to ensure the inheritance of principal knowledge, InherNet effectively balances depth, width, and compression efficiency. Experimental results across unimodal and multimodal tasks demonstrate that InherNet achieves higher performance compared to student networks of similar parameter sizes. Our findings reveal a promising direction for future research in efficient model compression beyond traditional distillation.

[398] Rashomon Sets and Model Multiplicity in Federated Learning

Xenia Heilmann, Luca Corbucci, Mattia Cerrato

Main category: cs.LG

TL;DR: Formalizing Rashomon sets for Federated Learning to understand model multiplicity in decentralized settings where multiple clients train models without sharing data.

Details

Motivation: Existing Rashomon set definitions assume centralized learning and don't extend to Federated Learning (FL), where multiple clients train models collaboratively without sharing raw data. In FL, choosing a single best model may homogenize predictive behavior, amplify biases, or undermine fairness across diverse clients with heterogeneous data distributions.

Method: 1) Adapt Rashomon set definitions to FL with three perspectives: global Rashomon set (aggregated statistics), t-agreement Rashomon set (intersection of local sets across fraction t of clients), and individual Rashomon sets (client-specific). 2) Show how to estimate multiplicity metrics under FL’s privacy constraints. 3) Introduce multiplicity-aware FL pipeline and conduct empirical study on standard FL benchmark datasets.

Result: All three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.

Conclusion: Formalizing Rashomon sets for FL provides crucial tools for understanding model multiplicity in decentralized settings, helping address fairness, transparency, and robustness challenges in federated learning environments.

Abstract: The Rashomon set captures the collection of models that achieve near-identical empirical performance yet may differ substantially in their decision boundaries. Understanding the differences among these models, i.e., their multiplicity, is recognized as a crucial step toward model transparency, fairness, and robustness, as it reveals decision boundaries instabilities that standard metrics obscure. However, the existing definitions of Rashomon set and multiplicity metrics assume centralized learning and do not extend naturally to decentralized, multi-party settings like Federated Learning (FL). In FL, multiple clients collaboratively train models under a central server’s coordination without sharing raw data, which preserves privacy but introduces challenges from heterogeneous client data distribution and communication constraints. In this setting, the choice of a single best model may homogenize predictive behavior across diverse clients, amplify biases, or undermine fairness guarantees. In this work, we provide the first formalization of Rashomon sets in FL.First, we adapt the Rashomon set definition to FL, distinguishing among three perspectives: (I) a global Rashomon set defined over aggregated statistics across all clients, (II) a t-agreement Rashomon set representing the intersection of local Rashomon sets across a fraction t of clients, and (III) individual Rashomon sets specific to each client’s local distribution.Second, we show how standard multiplicity metrics can be estimated under FL’s privacy constraints. Finally, we introduce a multiplicity-aware FL pipeline and conduct an empirical study on standard FL benchmark datasets. Our results demonstrate that all three proposed federated Rashomon set definitions offer valuable insights, enabling clients to deploy models that better align with their local data, fairness considerations, and practical requirements.

[399] Learning to Discover Iterative Spectral Algorithms

Zihang Liu, Oleg Balabanov, Yaoqing Yang, Michael W. Mahoney

Main category: cs.LG

TL;DR: AutoSpec is a neural framework that discovers iterative spectral algorithms for numerical linear algebra tasks by learning recurrence coefficients from spectral information, achieving significant performance improvements on real-world matrices.

Details

Motivation: The paper aims to automate the discovery of efficient iterative algorithms for large-scale numerical linear algebra problems, moving beyond hand-designed methods to leverage machine learning for algorithm design.

Method: AutoSpec uses self-supervised neural networks that take coarse spectral information (eigenvalue estimates, residual norms) as input and predict recurrence coefficients for matrix polynomials. The framework features: 1) an architecture implementing executable numerical linear algebra recurrences, 2) efficient training on small synthetic problems with transfer to large-scale operators, and 3) task-defined objectives enforcing desired approximation/preconditioning behavior.

Result: On real-world matrices, AutoSpec delivers orders-of-magnitude improvements in accuracy and/or reductions in iteration count compared to basic baselines. The learned polynomials exhibit near-equiripple, near-minimax behavior similar to classical Chebyshev polynomials.

Conclusion: AutoSpec successfully demonstrates that neural networks can discover effective iterative spectral algorithms for numerical linear algebra tasks, bridging classical theory with modern machine learning approaches.

Abstract: We introduce AutoSpec, a neural network framework for discovering iterative spectral algorithms for large-scale numerical linear algebra and numerical optimization. Our self-supervised models adapt to input operators using coarse spectral information (e.g., eigenvalue estimates and residual norms), and they predict recurrence coefficients for computing or applying a matrix polynomial tailored to a downstream task. The effectiveness of AutoSpec relies on three ingredients: an architecture whose inference pass implements short, executable numerical linear algebra recurrences; efficient training on small synthetic problems with transfer to large-scale real-world operators; and task-defined objectives that enforce the desired approximation or preconditioning behavior across the range of spectral profiles represented in the training set. We apply AutoSpec to discovering algorithms for representative numerical linear algebra tasks: accelerating matrix-function approximation; accelerating sparse linear solvers; and spectral filtering/preconditioning for eigenvalue computations. On real-world matrices, the learned procedures deliver orders-of-magnitude improvements in accuracy and/or reductions in iteration count, relative to basic baselines. We also find clear connections to classical theory: the induced polynomials often exhibit near-equiripple, near-minimax behavior characteristic of Chebyshev polynomials.

[400] ECG-IMN: Interpretable Mesomorphic Neural Networks for 12-Lead Electrocardiogram Interpretation

Vajira Thambawita, Jonas L. Isaksen, Jørgen K. Kanters, Hugo L. Hammer, Pål Halvorsen

Main category: cs.LG

TL;DR: ECG-IMN: An interpretable neural network for ECG diagnosis that generates sample-specific linear models with transparent feature attribution maps

Details

Motivation: Deep learning achieves expert-level ECG diagnosis but lacks interpretability, hindering clinical deployment. Existing explainability methods are unstable, computationally expensive, and unfaithful to actual model decisions.

Method: ECG-IMN uses a hypernetwork architecture where a deep convolutional backbone generates parameters for a strictly linear model specific to each input sample. A transition decoder maps latent features to sample-wise weights, enabling precise localization of pathological evidence in time and lead dimensions.

Result: On PTB-XL dataset, ECG-IMN achieves competitive predictive performance (AUROC comparable to black-box baselines) while providing faithful, instance-specific explanations with exact, high-resolution feature attribution maps.

Conclusion: The framework bridges deep learning capability with clinical trustworthiness by enforcing intrinsic interpretability through mathematically transparent decision logic, offering a principled path toward “white-box” cardiac diagnostics.

Abstract: Deep learning has achieved expert-level performance in automated electrocardiogram (ECG) diagnosis, yet the “black-box” nature of these models hinders their clinical deployment. Trust in medical AI requires not just high accuracy but also transparency regarding the specific physiological features driving predictions. Existing explainability methods for ECGs typically rely on post-hoc approximations (e.g., Grad-CAM and SHAP), which can be unstable, computationally expensive, and unfaithful to the model’s actual decision-making process. In this work, we propose the ECG-IMN, an Interpretable Mesomorphic Neural Network tailored for high-resolution 12-lead ECG classification. Unlike standard classifiers, the ECG-IMN functions as a hypernetwork: a deep convolutional backbone generates the parameters of a strictly linear model specific to each input sample. This architecture enforces intrinsic interpretability, as the decision logic is mathematically transparent and the generated weights (W) serve as exact, high-resolution feature attribution maps. We introduce a transition decoder that effectively maps latent features to sample-wise weights, enabling precise localization of pathological evidence (e.g., ST-elevation, T-wave inversion) in both time and lead dimensions. We evaluate our approach on the PTB-XL dataset for classification tasks, demonstrating that the ECG-IMN achieves competitive predictive performance (AUROC comparable to black-box baselines) while providing faithful, instance-specific explanations. By explicitly decoupling parameter generation from prediction execution, our framework bridges the gap between deep learning capability and clinical trustworthiness, offering a principled path toward “white-box” cardiac diagnostics.

[401] Training deep physical neural networks with local physical information bottleneck

Hao Wang, Ziao Wang, Xiangpeng Liang, Han Zhao, Jianqi Hu, Junjie Jiang, Xing Fu, Jianshi Tang, Huaqiang Wu, Sylvain Gigan, Qiang Liu

Main category: cs.LG

TL;DR: Physical Information Bottleneck (PIB) is a universal training framework for deep physical neural networks that integrates information theory and local learning to enable efficient AI execution on analog physical substrates.

Details

Motivation: Deep learning faces growing energy and latency constraints, while physical neural networks (PNNs) offer energy-efficient, ultrafast AI execution by exploiting analog dynamics. However, realizing this potential requires universal training methods tailored to physical intricacies.

Method: PIB integrates information theory and local learning, allocating matrix-based information bottlenecks to each unit. It bypasses auxiliary digital models and contrastive measurements, recasting PNN training as an intrinsic, scalable information-theoretic process.

Result: Demonstrated supervised, unsupervised, and reinforcement learning across electronic memristive chips and optical computing platforms. PIB adapts to severe hardware faults and allows parallel training via geographically distributed resources.

Conclusion: PIB provides a general and efficient framework for training deep physical neural networks under arbitrary physical dynamics, enabling energy-efficient, ultrafast AI execution compatible with diverse physical substrates.

Abstract: Deep learning has revolutionized modern society but faces growing energy and latency constraints. Deep physical neural networks (PNNs) are interconnected computing systems that directly exploit analog dynamics for energy-efficient, ultrafast AI execution. Realizing this potential, however, requires universal training methods tailored to physical intricacies. Here, we present the Physical Information Bottleneck (PIB), a general and efficient framework that integrates information theory and local learning, enabling deep PNNs to learn under arbitrary physical dynamics. By allocating matrix-based information bottlenecks to each unit, we demonstrate supervised, unsupervised, and reinforcement learning across electronic memristive chips and optical computing platforms. PIB also adapts to severe hardware faults and allows for parallel training via geographically distributed resources. Bypassing auxiliary digital models and contrastive measurements, PIB recasts PNN training as an intrinsic, scalable information-theoretic process compatible with diverse physical substrates.

[402] Mitigating the Likelihood Paradox in Flow-based OOD Detection via Entropy Manipulation

Donghwan Kim, Hyunsoo Yoon

Main category: cs.LG

TL;DR: A method to improve OOD detection in normalizing flows by manipulating input entropy based on semantic similarity to in-distribution data, without retraining the model.

Details

Motivation: Normalizing flows and other likelihood-based generative models often assign high likelihoods to out-of-distribution inputs, creating a "likelihood paradox" where OOD samples appear more likely than in-distribution ones.

Method: Manipulate input entropy by applying stronger perturbations to inputs that are less semantically similar to an in-distribution memory bank. The approach controls entropy without retraining the density model.

Result: Consistent AUROC improvements over baseline likelihood-based OOD detectors on standard benchmarks, supporting the theoretical analysis.

Conclusion: Entropy control based on semantic similarity effectively mitigates the likelihood paradox in normalizing flows, improving OOD detection without model retraining.

Abstract: Deep generative models that can tractably compute input likelihoods, including normalizing flows, often assign unexpectedly high likelihoods to out-of-distribution (OOD) inputs. We mitigate this likelihood paradox by manipulating input entropy based on semantic similarity, applying stronger perturbations to inputs that are less similar to an in-distribution memory bank. We provide a theoretical analysis showing that entropy control increases the expected log-likelihood gap between in-distribution and OOD samples in favor of the in-distribution, and we explain why the procedure works without any additional training of the density model. We then evaluate our method against likelihood-based OOD detectors on standard benchmarks and find consistent AUROC improvements over baselines, supporting our explanation.

[403] Rollout-Training Co-Design for Efficient LLM-Based Multi-Agent Reinforcement Learning

Zhida Jiang, Zhaolong Xing, Jiawei Lu, Yipei Niu, Qingyuan Sang, Liangxu Zhang, Wenquan Dai, Junhua Shu, Jiaxing Wang, Qiangyu Pei, Qiong Chen, Xinyu Liu, Fangming Liu, Ai Han, Zhen Chen, Ke Zhang

Main category: cs.LG

TL;DR: FlexMARL is an end-to-end training framework for large-scale LLM-based multi-agent reinforcement learning that optimizes rollout, training, and their orchestration to address system-level challenges like synchronization barriers and resource underutilization.

Details

Motivation: Existing training frameworks are optimized for single-agent scenarios and fail to address unique system-level challenges in MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization.

Method: FlexMARL introduces a joint orchestrator for data flow management under rollout-training disaggregated architecture, uses micro-batch driven asynchronous pipeline with experience store to eliminate synchronization barriers, implements parallel sampling with hierarchical load balancing in rollout engine, and achieves on-demand hardware binding through agent-centric resource allocation in training engine.

Result: Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.

Conclusion: FlexMARL successfully addresses system-level challenges in large-scale MARL training through holistic optimization of rollout, training, and their orchestration, significantly improving performance and resource utilization.

Abstract: Despite algorithm-level innovations for multi-agent reinforcement learning (MARL), the underlying networked infrastructure for large-scale MARL training remains underexplored. Existing training frameworks primarily optimize for single-agent scenarios and fail to address the unique system-level challenges of MARL, including rollout-training synchronization barriers, rollout load imbalance, and training resource underutilization. To bridge this gap, we propose FlexMARL, the first end-to-end training framework that holistically optimizes rollout, training, and their orchestration for large-scale LLM-based MARL. Specifically, FlexMARL introduces the joint orchestrator to manage data flow under the rollout-training disaggregated architecture. Building upon the experience store, a novel micro-batch driven asynchronous pipeline eliminates the synchronization barriers while providing strong consistency guarantees. Rollout engine adopts a parallel sampling scheme combined with hierarchical load balancing, which adapts to skewed inter/intra-agent request patterns. Training engine achieves on-demand hardware binding through agent-centric resource allocation. The training states of different agents are swapped via unified and location-agnostic communication. Empirical results on a large-scale production cluster demonstrate that FlexMARL achieves up to 7.3x speedup and improves hardware utilization by up to 5.6x compared to existing frameworks.

[404] Why the Counterintuitive Phenomenon of Likelihood Rarely Appears in Tabular Anomaly Detection with Deep Generative Models?

Donghwan Kim, Junghun Phee, Hyunsoo Yoon

Main category: cs.LG

TL;DR: Normalizing flows for anomaly detection in tabular data show consistent performance without the counterintuitive likelihood behavior seen in image domains, making them practical for tabular anomaly detection.

Details

Motivation: Address the counterintuitive phenomenon where deep generative models sometimes assign higher likelihoods to anomalous data in image domains, and investigate whether this occurs in tabular settings.

Method: Introduce domain-agnostic formulation to detect and evaluate the counterintuitive phenomenon, conduct extensive experiments on 47 tabular datasets and 10 CV/NLP embedding datasets using ADBench benchmark with 13 baseline models, and analyze theoretical and empirical perspectives focusing on data dimensionality and feature correlation differences.

Result: The counterintuitive phenomenon is consistently rare in tabular data, with normalizing flows showing reliable performance for anomaly detection in tabular domains using likelihood-only scoring.

Conclusion: Likelihood-only detection with normalizing flows offers a practical and reliable approach for anomaly detection in tabular domains, unlike in image domains where the counterintuitive behavior is more common.

Abstract: Deep generative models with tractable and analytically computable likelihoods, exemplified by normalizing flows, offer an effective basis for anomaly detection through likelihood-based scoring. We demonstrate that, unlike in the image domain where deep generative models frequently assign higher likelihoods to anomalous data, such counterintuitive behavior occurs far less often in tabular settings. We first introduce a domain-agnostic formulation that enables consistent detection and evaluation of the counterintuitive phenomenon, addressing the absence of precise definition. Through extensive experiments on 47 tabular datasets and 10 CV/NLP embedding datasets in ADBench, benchmarked against 13 baseline models, we demonstrate that the phenomenon, as defined, is consistently rare in general tabular data. We further investigate this phenomenon from both theoretical and empirical perspectives, focusing on the roles of data dimensionality and difference in feature correlation. Our results suggest that likelihood-only detection with normalizing flows offers a practical and reliable approach for anomaly detection in tabular domains.

[405] Flexible Entropy Control in RLVR with Gradient-Preserving Perspective

Kun Chen, Peng Shi, Fanfan Liu, Haibo Qiu, Zhixiong Zeng, Siqi Yang, Wenji Mao

Main category: cs.LG

TL;DR: The paper proposes dynamic clipping threshold strategies for RL with verifiable rewards to prevent policy entropy collapse in LLMs, enabling better entropy control and improved performance.

Details

Motivation: Continuous RL training for LLMs leads to policy entropy collapse - rapid entropy decay causing premature overconfidence, reduced output diversity, and vanishing gradients that inhibit learning. Existing clipping strategies are static and lack precise entropy control mechanisms.

Method: 1) Theoretically and empirically analyze importance sampling ratio regions contributing to entropy growth/reduction; 2) Introduce dynamic clipping threshold regulation for precise entropy management; 3) Design and evaluate three dynamic entropy control strategies: increase-then-decrease, decrease-increase-decrease, and oscillatory decay.

Result: Experimental results show the proposed strategies effectively mitigate entropy collapse and achieve superior performance across multiple benchmarks compared to static approaches.

Conclusion: Dynamic clipping threshold regulation provides an effective framework for precise entropy control in RL for LLMs, addressing the fundamental problem of policy entropy collapse and enabling better training dynamics.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a critical method for enhancing the reasoning capabilities of Large Language Models (LLMs). However, continuous training often leads to policy entropy collapse, characterized by a rapid decay in entropy that results in premature overconfidence, reduced output diversity, and vanishing gradient norms that inhibit learning. Gradient-Preserving Clipping is a primary factor influencing these dynamics, but existing mitigation strategies are largely static and lack a framework connecting clipping mechanisms to precise entropy control. This paper proposes reshaping entropy control in RL from the perspective of Gradient-Preserving Clipping. We first theoretically and empirically verify the contributions of specific importance sampling ratio regions to entropy growth and reduction. Leveraging these findings, we introduce a novel regulation mechanism using dynamic clipping threshold to precisely manage entropy. Furthermore, we design and evaluate dynamic entropy control strategies, including increase-then-decrease, decrease-increase-decrease, and oscillatory decay. Experimental results demonstrate that these strategies effectively mitigate entropy collapse, and achieve superior performance across multiple benchmarks.

[406] LLM-FS: Zero-Shot Feature Selection for Effective and Interpretable Malware Detection

Naveen Gill, Ajvad Haneef K, Madhu Kumar S D

Main category: cs.LG

TL;DR: LLMs can perform zero-shot feature selection for malware detection using only feature names and task descriptions, achieving competitive performance with traditional methods while offering better interpretability and stability.

Details

Motivation: Traditional feature selection methods for malware detection rely on statistical heuristics or model-driven importance scores but often overlook semantic context of features. The paper investigates whether LLMs can guide feature selection in a zero-shot setting as an alternative to conventional approaches.

Method: Evaluated multiple LLMs (GPT-5.0, GPT-4.0, Gemini-2.5) on EMBOD dataset (fusion of EMBER and BODMAS benchmarks) for zero-shot feature selection using only feature names and task descriptions. Compared against traditional FS methods (Extra Trees, Variance Threshold, Tree-based models, etc.) across several classifiers including Random Forest, Extra Trees, MLP, and KNN.

Result: LLM-guided zero-shot feature selection achieves competitive performance with traditional FS methods across multiple metrics (accuracy, precision, recall, F1, AUC, MCC) while offering advantages in interpretability, stability, and reduced dependence on labeled data.

Conclusion: Zero-shot LLM-based feature selection is a promising alternative for effective and interpretable malware detection, paving the way for knowledge-guided feature selection in security-critical applications.

Abstract: Feature selection (FS) remains essential for building accurate and interpretable detection models, particularly in high-dimensional malware datasets. Conventional FS methods such as Extra Trees, Variance Threshold, Tree-based models, Chi-Squared tests, ANOVA, Random Selection, and Sequential Attention rely primarily on statistical heuristics or model-driven importance scores, often overlooking the semantic context of features. Motivated by recent progress in LLM-driven FS, we investigate whether large language models (LLMs) can guide feature selection in a zero-shot setting, using only feature names and task descriptions, as a viable alternative to traditional approaches. We evaluate multiple LLMs (GPT-5.0, GPT-4.0, Gemini-2.5 etc.) on the EMBOD dataset (a fusion of EMBER and BODMAS benchmark datasets), comparing them against established FS methods across several classifiers, including Random Forest, Extra Trees, MLP, and KNN. Performance is assessed using accuracy, precision, recall, F1, AUC, MCC, and runtime. Our results demonstrate that LLM-guided zero-shot feature selection achieves competitive performance with traditional FS methods while offering additional advantages in interpretability, stability, and reduced dependence on labeled data. These findings position zero-shot LLM-based FS as a promising alternative strategy for effective and interpretable malware detection, paving the way for knowledge-guided feature selection in security-critical applications

[407] Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints

Andres Saurez, Yousung Lee, Dongsoo Har

Main category: cs.LG

TL;DR: Transformers’ linear communication interfaces (attention OV circuits, unembedding matrices) force semantic features to occupy context-invariant linear subspaces, explaining why simple linear methods like probes and sparse autoencoders succeed in deep nonlinear systems.

Details

Motivation: To provide a principled architectural explanation for why simple linear interpretability methods (linear probes and sparse autoencoders) consistently succeed in recovering meaningful structure from deep, nonlinear transformer representations, rather than treating this as merely an empirical observation.

Method: Proves the Invariant Subspace Necessity theorem showing that transformers communicate information through linear interfaces, forcing semantic features to occupy context-invariant linear subspaces. Derives the Self-Reference Property where tokens directly provide geometric directions for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirically validates in eight classification tasks across four model families.

Result: Empirical validation confirms alignment between class tokens and semantically related instances. The framework provides a principled architectural explanation unifying linear probes and sparse autoencoders, showing why linear interpretability methods work in transformers.

Conclusion: The success of linear interpretability methods in transformers is not accidental but architecturally necessary due to linear communication interfaces, enabling zero-shot identification of semantic structure and providing a unified theoretical foundation for interpretability research.

Abstract: Linear probes and sparse autoencoders consistently recover meaningful structure from transformer representations – yet why should such simple methods succeed in deep, nonlinear systems? We show this is not merely an empirical regularity but a consequence of architectural necessity: transformers communicate information through linear interfaces (attention OV circuits, unembedding matrices), and any semantic feature decoded through such an interface must occupy a context-invariant linear subspace. We formalize this as the \emph{Invariant Subspace Necessity} theorem and derive the \emph{Self-Reference Property}: tokens directly provide the geometric direction for their associated features, enabling zero-shot identification of semantic structure without labeled data or learned probes. Empirical validation in eight classification tasks and four model families confirms the alignment between class tokens and semantically related instances. Our framework provides \textbf{a principled architectural explanation} for why linear interpretability methods work, unifying linear probes and sparse autoencoders.

Zahra Kadkhodaie, Aram-Alexandre Pooladian, Sinho Chewi, Eero Simoncelli

Main category: cs.LG

TL;DR: Blind Denoising Diffusion Models (BDDMs) automatically learn noise schedules without explicit noise amplitude information, achieving better sample quality than non-blind models by correcting noise mismatch.

Details

Motivation: Current diffusion models require explicit noise schedules and noise amplitude information during training and sampling. The authors investigate whether diffusion models can work effectively without this explicit noise information, using "blind denoisers" that don't receive noise amplitude.

Method: Theoretical analysis of BDDMs assuming low intrinsic dimensionality of data distribution. Empirical validation on synthetic and image data. Comparison with non-blind diffusion models that use explicit noise schedules.

Result: BDDMs automatically track an implicit noise schedule and can accurately sample from data distribution in polynomially many steps. They produce higher quality samples than non-blind counterparts by correcting noise mismatch between true residual noise and assumed schedule noise.

Conclusion: Blind denoising diffusion models are theoretically sound and empirically effective, offering improved sample quality by eliminating noise schedule mismatch issues present in traditional diffusion models.

Abstract: We analyze, theoretically and empirically, the performance of generative diffusion models based on \emph{blind denoisers}, in which the denoiser is not given the noise amplitude in either the training or sampling processes. Assuming that the data distribution has low intrinsic dimensionality, we prove that blind denoising diffusion models (BDDMs), despite not having access to the noise amplitude, \emph{automatically} track a particular \emph{implicit} noise schedule along the reverse process. Our analysis shows that BDDMs can accurately sample from the data distribution in polynomially many steps as a function of the intrinsic dimension. Empirical results corroborate these mathematical findings on both synthetic and image data, demonstrating that the noise variance is accurately estimated from the noisy image. Remarkably, we observe that schedule-free BDDMs produce samples of higher quality compared to their non-blind counterparts. We provide evidence that this performance gain arises because BDDMs correct the mismatch between the true residual noise (of the image) and the noise assumed by the schedule used in non-blind diffusion models.

[409] Circuit Fingerprints: How Answer Tokens Encode Their Geometrical Path

Andres Saurez, Neha Sengar, Dongsoo Har

Main category: cs.LG

TL;DR: Circuit discovery and activation steering in transformers are unified by a geometric principle: answer tokens encode directions that would produce them, enabling circuit discovery without gradients and controlled steering.

Details

Motivation: To unify circuit discovery and activation steering in transformers by showing they follow a single geometric principle, moving beyond treating them as separate research threads.

Method: Proposes the Circuit Fingerprint hypothesis: answer tokens processed in isolation encode directions that would produce them. Uses geometric alignment for circuit discovery without gradients or causal intervention, validated on standard benchmarks across four model families.

Result: Achieves circuit discovery performance comparable to gradient-based methods on IOI, SVA, MCQA benchmarks. Enables controlled steering with 69.8% emotion classification accuracy vs 53.1% for instruction prompting while preserving factual accuracy.

Conclusion: Transformer circuits are fundamentally geometric structures where interpretability and controllability are two facets of the same object, revealing a read-write duality in transformer representations.

Abstract: Circuit discovery and activation steering in transformers have developed as separate research threads, yet both operate on the same representational space. Are they two views of the same underlying structure? We show they follow a single geometric principle: answer tokens, processed in isolation, encode the directions that would produce them. This Circuit Fingerprint hypothesis enables circuit discovery without gradients or causal intervention – recovering comparable structure to gradient-based methods through geometric alignment alone. We validate this on standard benchmarks (IOI, SVA, MCQA) across four model families, achieving circuit discovery performance comparable to gradient-based methods. The same directions that identify circuit components also enable controlled steering – achieving 69.8% emotion classification accuracy versus 53.1% for instruction prompting while preserving factual accuracy. Beyond method development, this read-write duality reveals that transformer circuits are fundamentally geometric structures: interpretability and controllability are two facets of the same object.

[410] Differentiable Modeling for Low-Inertia Grids: Benchmarking PINNs, NODEs, and DP for Identification and Control of SMIB System

Shinhoo Kang, Sangwook Kim, Sehyun Yun

Main category: cs.LG

TL;DR: Comparative study of Physics-Informed Neural Networks (PINNs), Neural Ordinary Differential Equations (NODEs), and Differentiable Programming (DP) for power system modeling, identification, and control, showing trade-offs between data-driven flexibility and physical structure.

Details

Motivation: The transition to low-inertia power systems requires accurate state predictions with physically consistent sensitivities for control. Scientific machine learning offers tools, but control-oriented implications of different differentiable paradigms remain insufficiently understood.

Method: Comparative study of PINNs, NODEs, and DP for modeling, identification, and control of power system dynamics using Single Machine Infinite Bus (SMIB) system as benchmark. Evaluated performance in trajectory extrapolation, parameter estimation, and Linear Quadratic Regulator (LQR) synthesis.

Result: NODE shows superior extrapolation by capturing underlying vector field, while PINN has limited generalization. DP and PINN both recover unknown parameters in inverse problems, but DP converges faster with hard constraints. DP yields closed-loop stability comparable to theoretical optimum for control synthesis. NODE serves as viable data-driven surrogate when governing equations are unavailable.

Conclusion: There’s a fundamental trade-off between data-driven flexibility and physical structure in differentiable paradigms for power system control. DP provides best control synthesis results, while NODE offers good data-driven modeling when physics is unknown.

Abstract: The transition toward low-inertia power systems demands modeling frameworks that provide not only accurate state predictions but also physically consistent sensitivities for control. While scientific machine learning offers powerful nonlinear modeling tools, the control-oriented implications of different differentiable paradigms remain insufficiently understood. This paper presents a comparative study of Physics-Informed Neural Networks (PINNs), Neural Ordinary Differential Equations (NODEs), and Differentiable Programming (DP) for modeling, identification, and control of power system dynamics. Using the Single Machine Infinite Bus (SMIB) system as a benchmark, we evaluate their performance in trajectory extrapolation, parameter estimation, and Linear Quadratic Regulator (LQR) synthesis. Our results highlight a fundamental trade-off between data-driven flexibility and physical structure. NODE exhibits superior extrapolation by capturing the underlying vector field, whereas PINN shows limited generalization due to its reliance on a time-dependent solution map. In the inverse problem of parameter identification, while both DP and PINN successfully recover the unknown parameters, DP achieves significantly faster convergence by enforcing governing equations as hard constraints. Most importantly, for control synthesis, the DP framework yields closed-loop stability comparable to the theoretical optimum. Furthermore, we demonstrate that NODE serves as a viable data-driven surrogate when governing equations are unavailable.

[411] Resilient Class-Incremental Learning: on the Interplay of Drifting, Unlabelled and Imbalanced Data Streams

Jin Li, Kleanthis Malialis, Marios Polycarpou

Main category: cs.LG

TL;DR: SCIL is a streaming class-incremental learning framework that handles concept drift, class imbalance, label scarcity, and new class emergence in dynamic data streams using autoencoder-based architecture with dual-loss training and class management techniques.

Details

Motivation: To address challenges in streaming data environments where concept drift, class imbalance, label scarcity, and new class emergence jointly degrade representation stability, bias learning toward outdated distributions, and reduce detection reliability in dynamic environments.

Method: Integrates autoencoder with multi-layer perceptron for multi-class prediction, uses dual-loss strategy (classification + reconstruction) for prediction and new class detection, employs corrected pseudo-labels for online training, manages classes with queues, and applies oversampling to handle imbalance.

Result: SCIL outperforms strong baselines and state-of-the-art methods on both real-world and synthetic datasets featuring class imbalance, incremental classes, and concept drifts.

Conclusion: The proposed SCIL framework effectively addresses multiple challenges in streaming learning environments and demonstrates superior performance compared to existing methods.

Abstract: In today’s connected world, the generation of massive streaming data across diverse domains has become commonplace. In the presence of concept drift, class imbalance, label scarcity, and new class emergence, they jointly degrade representation stability, bias learning toward outdated distributions, and reduce the resilience and reliability of detection in dynamic environments. This paper proposes SCIL (Streaming Class-Incremental Learning) to address these challenges. The SCIL framework integrates an autoencoder (AE) with a multi-layer perceptron for multi-class prediction, uses a dual-loss strategy (classification and reconstruction) for prediction and new class detection, employs corrected pseudo-labels for online training, manages classes with queues, and applies oversampling to handle imbalance. The rationale behind the method’s structure is elucidated through ablation studies and a comprehensive experimental evaluation is performed using both real-world and synthetic datasets that feature class imbalance, incremental classes, and concept drifts. Our results demonstrate that SCIL outperforms strong baselines and state-of-the-art methods. Based on our commitment to Open Science, we make our code and datasets available to the community.

[412] Model soups need only one ingredient

Alireza Abdollahpoorrostam, Nikolaos Dimitriadis, Adam Hazimeh, Pascal Frossard

Main category: cs.LG

TL;DR: MonoSoup: A single-checkpoint method using SVD decomposition and entropy-based reweighting to balance in-distribution accuracy and out-of-distribution robustness without training multiple models.

Details

Motivation: Fine-tuning large models improves in-distribution accuracy but harms out-of-distribution robustness. Existing ensemble methods like Model Soups require training/storing multiple models, which is computationally expensive.

Method: Apply SVD to each layer’s weight updates, decompose into high-energy (task-specific) and low-energy (noise/residual) directions. Use entropy-based effective rank to automatically re-weigh components with layer-wise coefficients.

Result: Achieves strong ID-OOD balance on CLIP models fine-tuned on ImageNet and Qwen language models on mathematical reasoning benchmarks. Comparable to multi-checkpoint methods without computational overhead.

Conclusion: MonoSoup provides a practical, data-free, hyperparameter-free alternative to ensemble methods, balancing accuracy and robustness using only a single checkpoint.

Abstract: Fine-tuning large pre-trained models on a target distribution often improves in-distribution (ID) accuracy, but at the cost of out-of-distribution (OOD) robustness as representations specialize to the fine-tuning data. Weight-space ensembling methods, such as Model Soups, mitigate this effect by averaging multiple checkpoints, but they are computationally prohibitive, requiring the training and storage of dozens of fine-tuned models. In this paper, we introduce MonoSoup, a simple, data-free, hyperparameter-free, post-hoc method that achieves a strong ID-OOD balance using only a single checkpoint. Our method applies Singular Value Decomposition (SVD) to each layer’s update and decomposes it into high-energy directions that capture task-specific adaptation and low-energy directions that introduce noise but may still encode residual signals useful for robustness. MonoSoup then uses entropy-based effective rank to automatically re-weigh these components with layer-wise coefficients that account for the spectral and geometric structure of the model. Experiments on CLIP models fine-tuned on ImageNet and evaluated under natural distribution shifts, as well as on Qwen language models tested on mathematical reasoning and multiple-choice benchmarks, show that this plug-and-play approach is a practical and effective alternative to multi-checkpoint methods, retaining much of their benefits without their computational overhead.

[413] Physics-informed diffusion models in spectral space

Davide Gallon, Philippe von Wurstemberger, Patrick Cheridito, Arnulf Jentzen

Main category: cs.LG

TL;DR: Physics-informed spectral diffusion model for solving parametric PDEs with partial observations, combining latent diffusion with spectral representations and physics constraints.

Details

Motivation: To develop an efficient method for solving forward and inverse PDE problems with sparse observations by combining generative modeling with physics-informed constraints, addressing limitations of existing diffusion-based PDE solvers.

Method: Uses latent diffusion models in spectral space for dimensionality reduction, learns joint distribution of PDE parameters and solutions via diffusion process, enforces physics-informed constraints during inference using Adam-based updates at each diffusion step.

Result: Demonstrates improved accuracy and computational efficiency on Poisson, Helmholtz, and incompressible Navier-Stokes equations compared to state-of-the-art diffusion-based PDE solvers for sparse observations.

Conclusion: The proposed physics-informed spectral diffusion approach effectively solves parametric PDE problems with partial observations, offering better performance than existing methods while maintaining computational efficiency.

Abstract: We propose a methodology that combines generative latent diffusion models with physics-informed machine learning to generate solutions of parametric partial differential equations (PDEs) conditioned on partial observations, which includes, in particular, forward and inverse PDE problems. We learn the joint distribution of PDE parameters and solutions via a diffusion process in a latent space of scaled spectral representations, where Gaussian noise corresponds to functions with controlled regularity. This spectral formulation enables significant dimensionality reduction compared to grid-based diffusion models and ensures that the induced process in function space remains within a class of functions for which the PDE operators are well defined. Building on diffusion posterior sampling, we enforce physics-informed constraints and measurement conditions during inference, applying Adam-based updates at each diffusion step. We evaluate the proposed approach on Poisson, Helmholtz, and incompressible Navier–Stokes equations, demonstrating improved accuracy and computational efficiency compared with existing diffusion-based PDE solvers, which are state of the art for sparse observations. Code is available at https://github.com/deeplearningmethods/PISD.

[414] Contextual and Seasonal LSTMs for Time Series Anomaly Detection

Lingpei Zhang, Qingming Li, Yong Yang, Jiahao Chen, Rui Zeng, Chenyang Lyu, Shouling Ji

Main category: cs.LG

TL;DR: CS-LSTMs: A novel prediction-based framework for univariate time series anomaly detection that combines contextual dependencies and seasonal patterns to better detect subtle anomalies like small point anomalies and slowly rising anomalies.

Details

Motivation: Existing reconstruction-based and prediction-based methods for univariate time series anomaly detection struggle to capture subtle anomalies, particularly small point anomalies and slowly rising anomalies, which are crucial for system reliability management in web systems and cloud servers.

Method: Proposes Contextual and Seasonal LSTMs (CS-LSTMs) built on a noise decomposition strategy that jointly leverages contextual dependencies and seasonal patterns. The framework integrates both time-domain and frequency-domain representations to achieve more accurate modeling of periodic trends and anomaly localization.

Result: Extensive evaluations on public benchmark datasets demonstrate that CS-LSTMs consistently outperform state-of-the-art methods, highlighting their effectiveness and practical value in robust time series anomaly detection.

Conclusion: CS-LSTMs provide an effective solution for detecting subtle anomalies in univariate time series by combining contextual and seasonal modeling, offering practical value for system reliability management in web and cloud environments.

Abstract: Univariate time series (UTS), where each timestamp records a single variable, serve as crucial indicators in web systems and cloud servers. Anomaly detection in UTS plays an essential role in both data mining and system reliability management. However, existing reconstruction-based and prediction-based methods struggle to capture certain subtle anomalies, particularly small point anomalies and slowly rising anomalies. To address these challenges, we propose a novel prediction-based framework named Contextual and Seasonal LSTMs (CS-LSTMs). CS-LSTMs are built upon a noise decomposition strategy and jointly leverage contextual dependencies and seasonal patterns, thereby strengthening the detection of subtle anomalies. By integrating both time-domain and frequency-domain representations, CS-LSTMs achieve more accurate modeling of periodic trends and anomaly localization. Extensive evaluations on public benchmark datasets demonstrate that CS-LSTMs consistently outperform state-of-the-art methods, highlighting their effectiveness and practical value in robust time series anomaly detection.

[415] ExO-PPO: an Extended Off-policy Proximal Policy Optimization Algorithm

Hanyong Wang, Menglong Yang

Main category: cs.LG

TL;DR: ExO-PPO: A new PPO variant combining on-policy stability with off-policy sample efficiency through extended off-policy improvement theory, segmented exponential clipping, and replay buffer organization.

Details

Motivation: PPO provides stable policy improvement through conservative on-policy updates but sacrifices sample efficiency, while off-policy methods offer better data utilization but suffer from increased variance and bias. The paper aims to leverage advantages of both approaches.

Method: 1) Derive extended off-policy improvement from expectation form of generalized policy improvement lower bound; 2) Extend clipping mechanism with segmented exponential functions for suitable surrogate objective; 3) Organize trajectories from past M policies in replay buffer for off-policy training.

Result: ExO-PPO demonstrates improved performance compared to PPO and other state-of-the-art variants, achieving balanced sample efficiency and stability across varied tasks in empirical experiments.

Conclusion: The proposed ExO-PPO successfully combines the stability of on-policy methods with the sample efficiency of off-policy approaches, offering a practical solution to the trade-off between stability and data utilization in deep reinforcement learning.

Abstract: Deep reinforcement learning has been able to solve various tasks successfully, however, due to the construction of policy gradient and training dynamics, tuning deep reinforcement learning models remains challenging. As one of the most successful deep reinforcement-learning algorithm, the Proximal Policy Optimization algorithm (PPO) clips the policy gradient within a conservative on-policy updates, which ensures reliable and stable policy improvement. However, this training pattern may sacrifice sample efficiency. On the other hand, off-policy methods make more adequate use of data through sample reuse, though at the cost of increased the estimation variance and bias. To leverage the advantages of both, in this paper, we propose a new PPO variant based on the stability guarantee from conservative on-policy iteration with a more efficient off-policy data utilization. Specifically, we first derive an extended off-policy improvement from an expectation form of generalized policy improvement lower bound. Then, we extend the clipping mechanism with segmented exponential functions for a suitable surrogate objective function. Third, the trajectories generated by the past $M$ policies are organized in the replay buffer for off-policy training. We refer to this method as Extended Off-policy Proximal Policy Optimization (ExO-PPO). Compared with PPO and some other state-of-the-art variants, we demonstrate an improved performance of ExO-PPO with balanced sample efficiency and stability on varied tasks in the empirical experiments.

[416] BRAVA-GNN: Betweenness Ranking Approximation Via Degree MAss Inspired Graph Neural Network

Justin Dachille, Aurora Rossi, Sunil Kumar Maurya, Frederik Mallmann-Trenn, Xin Liu, Frédéric Giroire, Tsuyoshi Murata, Emanuele Natale

Main category: cs.LG

TL;DR: BRAVA-GNN: A lightweight GNN architecture for predicting node betweenness centrality that generalizes well to high-diameter graphs like road networks using degree-based features and hyperbolic random graph training.

Details

Motivation: Betweenness centrality is computationally expensive on large networks, and existing GNN-based methods fail to generalize to high-diameter graphs like road networks, creating a need for more robust and efficient solutions.

Method: Leverages correlation between betweenness centrality and multi-hop degree mass, uses degree masses as size-invariant node features, trains on synthetic hyperbolic random graphs that better match real network structures, and employs a lightweight GNN architecture.

Result: Achieves up to 214% improvement in Kendall-Tau correlation and up to 70x speedup in inference time over state-of-the-art GNN approaches, particularly on challenging road networks, while using 54x fewer parameters than existing baselines.

Conclusion: BRAVA-GNN demonstrates that lightweight GNNs with appropriate feature engineering and synthetic training data can effectively predict betweenness centrality across diverse graph types, including previously challenging high-diameter networks.

Abstract: Computing node importance in networks is a long-standing fundamental problem that has driven extensive study of various centrality measures. A particularly well-known centrality measure is betweenness centrality, which becomes computationally prohibitive on large-scale networks. Graph Neural Network (GNN) models have thus been proposed to predict node rankings according to their relative betweenness centrality. However, state-of-the-art methods fail to generalize to high-diameter graphs such as road networks. We propose BRAVA-GNN, a lightweight GNN architecture that leverages the empirically observed correlation linking betweenness centrality to degree-based quantities, in particular multi-hop degree mass. This correlation motivates the use of degree masses as size-invariant node features and synthetic training graphs that closely match the degree distributions of real networks. Furthermore, while previous work relies on scale-free synthetic graphs, we leverage the hyperbolic random graph model, which reproduces power-law exponents outside the scale-free regime, better capturing the structure of real-world graphs like road networks. This design enables BRAVA-GNN to generalize across diverse graph families while using 54x fewer parameters than the most lightweight existing GNN baseline. Extensive experiments on 19 real-world networks, spanning social, web, email, and road graphs, show that BRAVA-GNN achieves up to 214% improvement in Kendall-Tau correlation and up to 70x speedup in inference time over state-of-the-art GNN-based approaches, particularly on challenging road networks.

[417] Grounding LTL Tasks in Sub-Symbolic RL Environments for Zero-Shot Generalization

Matteo Pannacci, Andrea Fanti, Elena Umili, Roberto Capobianco

Main category: cs.LG

TL;DR: Training RL agents to follow Linear Temporal Logic instructions in sub-symbolic environments without requiring pre-defined symbol mappings, using joint training of policy and symbol grounder.

Details

Motivation: Previous multi-task RL approaches require knowledge of mapping between raw observations and symbols in temporal logic formulas, which is unrealistic. Need methods that work in sub-symbolic environments without this assumption.

Method: Jointly train multi-task policy and symbol grounder using same experience. Symbol grounder trained only from raw observations and sparse rewards via Neural Reward Machines in semi-supervised fashion.

Result: Achieves performance comparable to using true symbol grounding and significantly outperforms state-of-the-art methods for sub-symbolic environments in vision-based experiments.

Conclusion: Proposed method enables RL agents to follow temporal logic instructions in sub-symbolic environments without requiring pre-defined symbol mappings, advancing multi-task RL capabilities.

Abstract: In this work we address the problem of training a Reinforcement Learning agent to follow multiple temporally-extended instructions expressed in Linear Temporal Logic in sub-symbolic environments. Previous multi-task work has mostly relied on knowledge of the mapping between raw observations and symbols appearing in the formulae. We drop this unrealistic assumption by jointly training a multi-task policy and a symbol grounder with the same experience. The symbol grounder is trained only from raw observations and sparse rewards via Neural Reward Machines in a semi-supervised fashion. Experiments on vision-based environments show that our method achieves performance comparable to using the true symbol grounding and significantly outperforms state-of-the-art methods for sub-symbolic environments.

[418] Towards Poisoning Robustness Certification for Natural Language Generation

Mihnea Ghitu, Matthew Wicker

Main category: cs.LG

TL;DR: TPA is the first certified defense algorithm for autoregressive language generation that provides provable robustness bounds against poisoning attacks, addressing both stability (general robustness) and validity (targeted harmful changes) properties.

Details

Motivation: Existing certified poisoning defenses work for classification but fail for autoregressive generation due to sequential predictions and exponentially large output spaces. There's a need for certified security guarantees when deploying language models in security-critical applications.

Method: Introduces Targeted Partition Aggregation (TPA) algorithm that computes minimum poisoning budget needed to induce specific harmful outputs. Extends TPA with mixed integer linear programming (MILP) for tighter guarantees in multi-turn generations.

Result: TPA effectively certifies validity of agent tool-calling with up to 0.5% dataset modification and certifies 8-token stability horizons in preference-based alignment. Demonstrates practical applicability across diverse settings.

Conclusion: TPA enables certified deployment of language models in security-critical applications by providing the first framework for certified natural language generation, though inference-time latency remains a challenge.

Abstract: Understanding the reliability of natural language generation is critical for deploying foundation models in security-sensitive domains. While certified poisoning defenses provide provable robustness bounds for classification tasks, they are fundamentally ill-equipped for autoregressive generation: they cannot handle sequential predictions or the exponentially large output space of language models. To establish a framework for certified natural language generation, we formalize two security properties: stability (robustness to any change in generation) and validity (robustness to targeted, harmful changes in generation). We introduce Targeted Partition Aggregation (TPA), the first algorithm to certify validity/targeted attacks by computing the minimum poisoning budget needed to induce a specific harmful class, token, or phrase. Further, we extend TPA to provide tighter guarantees for multi-turn generations using mixed integer linear programming (MILP). Empirically, we demonstrate TPA’s effectiveness across diverse settings including: certifying validity of agent tool-calling when adversaries modify up to 0.5% of the dataset and certifying 8-token stability horizons in preference-based alignment. Though inference-time latency remains an open challenge, our contributions enable certified deployment of language models in security-critical applications.

[419] Explainability in Generative Medical Diffusion Models: A Faithfulness-Based Analysis on MRI Synthesis

Surjo Dey, Pallabi Saikia

Main category: cs.LG

TL;DR: A faithfulness-based explainability framework for diffusion models in medical MRI synthesis, evaluating prototype methods to link generated and training features for better transparency.

Details

Motivation: Diffusion models show strong performance in medical image generation but remain opaque in their decision-making process, creating a need for explainability to ensure trustworthy AI in healthcare applications.

Method: Proposes a faithfulness-based explainability framework analyzing prototype methods (ProtoPNet, Enhanced ProtoPNet, ProtoPool) to link generated and training features, focusing on understanding image formation through denoising trajectories and prototype explainability with faithfulness analysis.

Result: Enhanced ProtoPNet achieves the highest faithfulness score (0.1534), offering more reliable insights into the generative process, demonstrating that diffusion models can be made more transparent through faithfulness-based explanations.

Conclusion: Faithfulness-based explainability can make diffusion models more transparent and trustworthy, contributing to safer and more interpretable generative AI applications in healthcare.

Abstract: This study investigates the explainability of generative diffusion models in the context of medical imaging, focusing on Magnetic resonance imaging (MRI) synthesis. Although diffusion models have shown strong performance in generating realistic medical images, their internal decision making process remains largely opaque. We present a faithfulness-based explainability framework that analyzes how prototype-based explainability methods like ProtoPNet (PPNet), Enhanced ProtoPNet (EPPNet), and ProtoPool can link the relationship between generated and training features. Our study focuses on understanding the reasoning behind image formation through denoising trajectory of diffusion model and subsequently prototype explainability with faithfulness analysis. Experimental analysis shows that EPPNet achieves the highest faithfulness (with score 0.1534), offering more reliable insights, and explainability into the generative process. The results highlight that diffusion models can be made more transparent and trustworthy through faithfulness-based explanations, contributing to safer and more interpretable applications of generative AI in healthcare.

[420] When Less is More: The LLM Scaling Paradox in Context Compression

Ruishan Guo, Yibing Liu, Guoxin Ma, Yan Wang, Yueyang Zhang, Long Xia, Kecheng Chen, Zhiyuan Sun, Daiting Shi

Main category: cs.LG

TL;DR: Larger models in compressor-decoder setups show a Size-Fidelity Paradox where increased compressor size reduces faithfulness of reconstructed contexts despite lower training loss, due to knowledge overwriting and semantic drift.

Details

Motivation: The paper challenges the conventional scaling paradigm that larger models always yield superior generation capabilities, particularly examining how model scaling affects faithfulness in context compression tasks where contexts need to be accurately reconstructed.

Method: Conducted extensive experiments across models from 0.6B to 90B parameters in compressor-decoder setups. Analyzed the paradox through two factors: knowledge overwriting (replacing source facts with prior beliefs) and semantic drift (paraphrasing/restructuring content). Examined how increased semantic capacity and generative uncertainty affect faithful preservation.

Result: Found that larger compressor models exhibit decreased faithfulness in reconstructed contexts despite lower training loss. Identified that increased rank of context embeddings facilitates prior knowledge intrusion, while higher entropy over token prediction distributions promotes rewriting behavior.

Conclusion: The Size-Fidelity Paradox reveals that scaling laws break down for faithful preservation in open-ended generation. The issue is not parameter count itself but the excessive semantic capacity and amplified generative uncertainty that accompany scaling, challenging assumptions about model scaling benefits.

Abstract: Scaling up model parameters has long been a prevalent training paradigm driven by the assumption that larger models yield superior generation capabilities. However, under lossy context compression in a compressor-decoder setup, we observe a Size-Fidelity Paradox: increasing the compressor size can lessen the faithfulness of reconstructed contexts though training loss decreases. Through extensive experiments across models from 0.6B to 90B, we coin this paradox arising from two dominant factors: 1) knowledge overwriting: larger models increasingly replace source facts with their own prior beliefs, e.g., the white strawberry'' $\to$ the red strawberry’’; and 2) semantic drift: larger models tend to paraphrase or restructure content instead of reproducing it verbatim, e.g., Alice hit Bob'' $\to$ Bob hit Alice’’. By holding model size fixed, we reflect on the emergent properties of compressed context representations. We show that the culprit is not parameter count itself, but the excessive semantic capacity and amplified generative uncertainty that accompany scaling. Specifically, the increased rank of context embeddings facilitates prior knowledge intrusion, whereas higher entropy over token prediction distributions promotes rewriting. Our results complement existing evaluations over context compression paradigm, underpinning a breakdown in scaling laws for faithful preservation in open-ended generation.

[421] A Controlled Study of Double DQN and Dueling DQN Under Cross-Environment Transfer

Azka Nasir, Fatima Dossa, Muhammad Ahmed Atif, Mohammad Ahmed Atif

Main category: cs.LG

TL;DR: DDQN shows robust transfer learning while Dueling DQN exhibits negative transfer when transferring from CartPole to LunarLander environments, suggesting architectural inductive biases significantly impact cross-environment transfer robustness in deep RL.

Details

Motivation: To understand how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer learning behavior across structurally distinct environments, particularly examining robustness to domain shift.

Method: Controlled empirical study using CartPole as source task and LunarLander as target task with fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, comparing against baseline agents trained from scratch.

Result: DDQN consistently avoids negative transfer and maintains learning dynamics comparable to baseline performance, while Dueling DQN consistently exhibits negative transfer with degraded rewards and unstable optimization behavior, confirmed by statistical analysis across multiple random seeds.

Conclusion: Architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning, with DDQN showing superior transfer robustness compared to Dueling DQN under the examined protocol.

Abstract: Transfer learning in deep reinforcement learning is often motivated by improved stability and reduced training cost, but it can also fail under substantial domain shift. This paper presents a controlled empirical study examining how architectural differences between Double Deep Q-Networks (DDQN) and Dueling DQN influence transfer behavior across environments. Using CartPole as a source task and LunarLander as a structurally distinct target task, we evaluate a fixed layer-wise representation transfer protocol under identical hyperparameters and training conditions, with baseline agents trained from scratch used to contextualize transfer effects. Empirical results show that DDQN consistently avoids negative transfer under the examined setup and maintains learning dynamics comparable to baseline performance in the target environment. In contrast, Dueling DQN consistently exhibits negative transfer under identical conditions, characterized by degraded rewards and unstable optimization behavior. Statistical analysis across multiple random seeds confirms a significant performance gap under transfer. These findings suggest that architectural inductive bias is strongly associated with robustness to cross-environment transfer in value-based deep reinforcement learning under the examined transfer protocol.

[422] Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinson’s disease and isolated REM sleep behavior disorder

Jesper Strøm, Casper Skjærbæk, Natasha Becker Bertelsen, Steffen Torpe Simonsen, Niels Okkels, David Bertram, Sinah Röttgen, Konstantin Kufer, Kaare B. Mikkelsen, Marit Otto, Poul Jørgen Jennum, Per Borghammer, Michael Sommerauer, Preben Kidmose

Main category: cs.LG

TL;DR: Adapted U-Sleep deep neural network for automated sleep staging in Parkinson’s disease and isolated REM sleep behavior disorder, improving accuracy through fine-tuning and confidence-based thresholds.

Details

Motivation: Manual sleep staging is challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, creating a bottleneck for deploying RBD screening technologies at scale.

Method: Fine-tuned a pretrained U-Sleep model on research datasets (PD, iRBD, controls) and evaluated on independent dataset, with interrater study and confidence-based thresholds for REM sleep staging.

Result: Fine-tuning improved model performance from κ=0.66 to κ=0.74 on training data and from κ=0.60 to κ=0.64 on independent test data; confidence thresholds increased correct REM sleep identification from 85% to 95.5%.

Conclusion: The adapted U-Sleep model provides generalizable automated sleep staging for neurodegenerative diseases, addressing the bottleneck in RBD screening and enabling scalable deployment.

Abstract: Isolated REM sleep behavior disorder (iRBD) is a key prodromal marker of Parkinson’s disease (PD), and video-polysomnography (vPSG) remains the diagnostic gold standard. However, manual sleep staging is particularly challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, making PSG assessments a bottleneck for deploying new RBD screening technologies at scale. We adapted U-Sleep, a deep neural network, for generalizable sleep staging in PD and iRBD. A pretrained U-Sleep model, based on a large publicly available, multisite non-neurodegenerative dataset (PUB; 19,236 PSGs across 12 sites), was fine-tuned on research datasets from two centers (Lundbeck Foundation Parkinson’s Disease Research Center (PACE) and the Cologne-Bonn Cohort (CBC); 112 PD, 138 iRBD, 89 age-matched controls. The resulting model was evaluated on an independent dataset from the Danish Center for Sleep Medicine (DCSM; 81 PD, 36 iRBD, 87 sleep-clinic controls). A subset of PSGs with low agreement between the human rater and the model (\k{appa} < 0.6) was re-scored by a second blinded human rater to identify sources of disagreement. Finally, we applied confidence-based thresholds to optimize REM sleep staging. The pretrained model achieved mean \k{appa} = 0.81 in PUB, but \k{appa} = 0.66 when applied directly to PACE/CBC. By fine-tuning the model, we developed a generalized model with \k{appa} = 0.74 on PACE/CBC (p < 0.001 vs. the pretrained model). In DCSM, mean and median \k{appa} increased from 0.60 to 0.64 (p < 0.001) and 0.64 to 0.69 (p < 0.001), respectively. In the interrater study, PSGs with low agreement between the model and the initial scorer showed similarly low agreement between human scorers. Applying a confidence threshold increased the proportion of correctly identified REM sleep epochs from 85% to 95.5%, while preserving sufficient (> 5 min) REM sleep for 95% of subjects.

[423] PlugSI: Plug-and-Play Test-Time Graph Adaptation for Spatial Interpolation

Xuhang Wu, Zhuoxuan Liang, Wei Li, Xiaohua Jia, Sumi Helal

Main category: cs.LG

TL;DR: PlugSI: A plug-and-play framework for spatial interpolation in sensor networks that adapts to unseen graph structures at test-time using dual adapters for topology and temporal stability.

Details

Motivation: High deployment costs hinder scalability of sensor networks in IoT/edge computing. Existing graph-based spatial interpolation methods rely on pre-trained models, lack adaptation to larger/unseen graphs at test-time, and overlook test data utilization.

Method: Proposes PlugSI with two key components: 1) Unknown Topology Adapter (UTA) that adapts to new graph structures in small batches at test-time, and 2) Temporal Balance Adapter (TBA) that maintains historical consensus to guide UTA and prevent noise-induced drifting.

Result: Extensive experiments show PlugSI can be seamlessly integrated into existing graph-based SI methods and provides significant improvement (e.g., 10.81% reduction in MAE).

Conclusion: PlugSI addresses limitations of current graph-based spatial interpolation methods by enabling test-time adaptation to unseen graph structures while maintaining temporal stability, improving performance on sensor network data.

Abstract: With the rapid advancement of IoT and edge computing, sensor networks have become indispensable, driving the need for large-scale sensor deployment. However, the high deployment cost hinders their scalability. To tackle the issues, Spatial Interpolation (SI) introduces virtual sensors to infer readings from observed sensors, leveraging graph structure. However, current graph-based SI methods rely on pre-trained models, lack adaptation to larger and unseen graphs at test-time, and overlook test data utilization. To address these issues, we propose PlugSI, a plug-and-play framework that refines test-time graph through two key innovations. First, we design an Unknown Topology Adapter (UTA) that adapts to the new graph structure of each small-batch at test-time, enhancing the generalization of SI pre-trained models. Second, we introduce a Temporal Balance Adapter (TBA) that maintains a stable historical consensus to guide UTA adaptation and prevent drifting caused by noise in the current batch. Empirically, extensive experiments demonstrate PlugSI can be seamlessly integrated into existing graph-based SI methods and provide significant improvement (e.g., a 10.81% reduction in MAE).

[424] CoFEH: LLM-driven Feature Engineering Empowered by Collaborative Bayesian Hyperparameter Optimization

Beicheng Xu, Keyao Ding, Wei Liu, Yupeng Lu, Bin Cui

Main category: cs.LG

TL;DR: CoFEH is a collaborative framework that interleaves LLM-based feature engineering with Bayesian hyperparameter optimization for end-to-end AutoML, using mutual conditioning between LLM and BO modules.

Details

Motivation: Traditional AutoML treats feature engineering as black-box search with rigid search spaces, while existing LLM-based methods only handle isolated subtasks and fail to jointly optimize FE with hyperparameter optimization, leading to suboptimal greedy workflows.

Method: CoFEH uses: 1) LLM-driven FE optimizer with Tree of Thought for flexible pipeline exploration, 2) Bayesian optimization module for HPO, 3) dynamic optimizer selector for adaptive scheduling, and 4) mutual conditioning mechanism to share context between LLM and BO.

Result: CoFEH outperforms both traditional and LLM-based FE baselines and achieves superior end-to-end performance under joint optimization of feature engineering and hyperparameter optimization.

Conclusion: The collaborative framework demonstrates that interleaving LLM-based FE with Bayesian HPO through mutual conditioning enables more robust end-to-end AutoML compared to traditional greedy approaches.

Abstract: Feature Engineering (FE) is pivotal in automated machine learning (AutoML) but remains a bottleneck for traditional methods, which treat it as a black-box search, operating within rigid, predefined search spaces and lacking domain awareness. While Large Language Models (LLMs) offer a promising alternative by leveraging semantic reasoning to generate unbounded operators, existing methods fail to construct free-form FE pipelines, remaining confined to isolated subtasks such as feature generation. Most importantly, they are rarely optimized jointly with hyperparameter optimization (HPO) of the ML model, leading to greedy “FE-then-HPO” workflows that cannot capture strong FE-HPO interactions. In this paper, we present CoFEH, a collaborative framework that interleaves LLM-based FE and Bayesian HPO for robust end-to-end AutoML. CoFEH uses an LLM-driven FE optimizer powered by Tree of Thought (ToT) to explore flexible FE pipelines, a Bayesian optimization (BO) module to solve HPO, and a dynamic optimizer selector that realizes interleaved optimization by adaptively scheduling FE and HPO steps. Crucially, we introduce a mutual conditioning mechanism that shares context between LLM and BO, enabling mutually informed decisions. Experiments show that CoFEH not only outperforms traditional and LLM-based FE baselines, but also achieves superior end-to-end performance under joint optimization.

[425] Differentiable Tripartite Modularity for Clustering Heterogeneous Graphs

Benoît Hurpeau

Main category: cs.LG

TL;DR: A differentiable tripartite modularity formulation for clustering heterogeneous graphs with three node types, enabling end-to-end community detection with linear complexity.

Details

Motivation: Existing differentiable modularity methods like DMoN work well for homogeneous and bipartite graphs but cannot handle higher-order relational structures with three or more entity types, which is common in real-world heterogeneous data like urban cadastral systems.

Method: Proposes a differentiable formulation of tripartite modularity using weighted co-paths across tripartite graphs with exact factorized computation to avoid dense third-order tensors. Includes structural normalization at pivot nodes to handle degree heterogeneity and ensure stable optimization. Can be optimized jointly with graph neural networks end-to-end while maintaining linear complexity in edges.

Result: Validated on large-scale urban cadastral data, showing robust convergence behavior and spatially coherent partitions. The framework demonstrates effectiveness for unsupervised clustering of heterogeneous graphs with three node types.

Conclusion: Differentiable tripartite modularity serves as a generic methodological building block for unsupervised clustering of heterogeneous graphs, extending differentiable community detection to higher-order relational structures beyond bipartite graphs.

Abstract: Clustering heterogeneous relational data remains a central challenge in graph learning, particularly when interactions involve more than two types of entities. While differentiable modularity objectives such as DMoN have enabled end-to-end community detection on homogeneous and bipartite graphs, extending these approaches to higher-order relational structures remains non-trivial. In this work, we introduce a differentiable formulation of tripartite modularity for graphs composed of three node types connected through mediated interactions. Community structure is defined in terms of weighted co-paths across the tripartite graph, together with an exact factorized computation that avoids the explicit construction of dense third-order tensors. A structural normalization at pivot nodes is introduced to control extreme degree heterogeneity and ensure stable optimization. The resulting objective can be optimized jointly with a graph neural network in an end-to-end manner, while retaining linear complexity in the number of edges. We validate the proposed framework on large-scale urban cadastral data, where it exhibits robust convergence behavior and produces spatially coherent partitions. These results highlight differentiable tripartite modularity as a generic methodological building block for unsupervised clustering of heterogeneous graphs.

[426] Statistical benchmarking of transformer models in low signal-to-noise time-series forecasting

Cyril Garcia, Guillaume Remy

Main category: cs.LG

TL;DR: Transformers with two-way attention (temporal + cross-sectional) outperform traditional methods for multivariate time-series forecasting in low-data regimes, especially with dynamic sparsification in noisy environments.

Details

Motivation: The paper addresses the challenge of multivariate time-series forecasting with limited data (only a few years of daily observations), particularly in noisy environments where signal-to-noise ratios are low. Traditional methods may struggle in these conditions, and there's a need to understand how transformer architectures perform and generalize in such challenging scenarios.

Method: The authors use synthetically generated processes with known temporal and cross-sectional dependency structures and varying signal-to-noise ratios. They conduct bootstrapped experiments with out-of-sample correlation evaluation against optimal ground-truth predictors. They introduce a two-way attention transformer that alternates between temporal and cross-sectional self-attention, and propose a dynamic sparsification procedure for attention matrices during training.

Result: Two-way attention transformers outperform standard baselines (Lasso, boosting methods, and fully connected multilayer perceptrons) across a wide range of settings, including low signal-to-noise regimes. Dynamic sparsification becomes significantly effective in noisy environments where correlation between target and optimal predictor is only a few percent. Analysis of learned attention patterns reveals interpretable structure and connections to sparsity-inducing regularization.

Conclusion: Transformer architectures with specialized attention mechanisms can effectively handle multivariate time-series forecasting in low-data, noisy regimes. The dynamic sparsification technique improves performance in noisy environments, and the learned attention patterns provide interpretability and insights into why these models generalize well under noise, connecting to classical regression regularization techniques.

Abstract: We study the performance of transformer architectures for multivariate time-series forecasting in low-data regimes consisting of only a few years of daily observations. Using synthetically generated processes with known temporal and cross-sectional dependency structures and varying signal-to-noise ratios, we conduct bootstrapped experiments that enable direct evaluation via out-of-sample correlations with the optimal ground-truth predictor. We show that two-way attention transformers, which alternate between temporal and cross-sectional self-attention, can outperform standard baselines-Lasso, boosting methods, and fully connected multilayer perceptrons-across a wide range of settings, including low signal-to-noise regimes. We further introduce a dynamic sparsification procedure for attention matrices applied during training, and demonstrate that it becomes significantly effective in noisy environments, where the correlation between the target variable and the optimal predictor is on the order of a few percent. Analysis of the learned attention patterns reveals interpretable structure and suggests connections to sparsity-inducing regularization in classical regression, providing insight into why these models generalize effectively under noise.

[427] Safeguarding Privacy: Privacy-Preserving Detection of Mind Wandering and Disengagement Using Federated Learning in Online Education

Anna Bodonhelyi, Mengdi Wang, Efe Bozkir, Babette Bühler, Enkelejda Kasneci

Main category: cs.LG

TL;DR: Privacy-preserving federated learning framework for detecting cognitive disengagement (mind wandering, boredom) and behavioral disengagement in online learning using video-based facial and gaze features.

Details

Motivation: Online learning lacks instructor support, leading to mind wandering and disengagement that harm learning outcomes. Automated detection via video analysis is promising but raises privacy concerns when using machine learning. Federated learning offers privacy-preserving decentralized training.

Method: Cross-device federated learning framework using facial expressions and gaze features from video to detect behavioral disengagement, mind wandering, and boredom. Incorporates features to handle eyeglasses challenges. Benchmarks multiple federated learning algorithms on five datasets.

Result: Extensive experiments show promising results for privacy-preserving educational technologies that promote learner engagement. The approach effectively addresses privacy concerns while maintaining detection performance.

Conclusion: Federated learning provides an effective privacy-preserving solution for real-time learner support in online education by detecting cognitive disengagement through video analysis without sharing sensitive data.

Abstract: Since the COVID-19 pandemic, online courses have expanded access to education, yet the absence of direct instructor support challenges learners’ ability to self-regulate attention and engagement. Mind wandering and disengagement can be detrimental to learning outcomes, making their automated detection via video-based indicators a promising approach for real-time learner support. However, machine learning-based approaches often require sharing sensitive data, raising privacy concerns. Federated learning offers a privacy-preserving alternative by enabling decentralized model training while also distributing computational load. We propose a framework exploiting cross-device federated learning to address different manifestations of behavioral and cognitive disengagement during remote learning, specifically behavioral disengagement, mind wandering, and boredom. We fit video-based cognitive disengagement detection models using facial expressions and gaze features. By adopting federated learning, we safeguard users’ data privacy through privacy-by-design and introduce a novel solution with the potential for real-time learner support. We further address challenges posed by eyeglasses by incorporating related features, enhancing overall model performance. To validate the performance of our approach, we conduct extensive experiments on five datasets and benchmark multiple federated learning algorithms. Our results show great promise for privacy-preserving educational technologies promoting learner engagement.

[428] Drug Release Modeling using Physics-Informed Neural Networks

Daanish Aleem Qureshi, Khemraj Shukla, Vikas Srivastava

Main category: cs.LG

TL;DR: PINNs and BPINNs for drug release prediction outperform classical models by integrating Fick’s diffusion law with limited experimental data, achieving accurate long-term predictions from short-term measurements.

Details

Motivation: Classical drug release models (Fick, Higuchi, Peppas) have limitations due to simplifying assumptions that reduce accuracy in complex geometries and release mechanisms. There's a need for more accurate predictive models that can work with limited experimental data.

Method: Proposed Physics-Informed Neural Networks (PINNs) and Bayesian PINNs (BPINNs) that embed Fick’s second law as loss function with 10,000 Latin-hypercube collocation points. Used experimental datasets to assess performance through MAE and RMSE metrics under noisy conditions and limited-data scenarios.

Result: Reduced mean error by up to 40% relative to classical baselines across all film types. PINN achieved RMSE <0.05 using only first 6% of release time data for planar films (94% reduction in experimental time). For wrinkled/ crumpled films, reached RMSE <0.05 in 33% of release time data. BPINNs provided better uncertainty quantification under noise.

Conclusion: The framework combining physical laws with experimental data enables highly accurate long-term release predictions from short-term measurements, offering accelerated characterization and more efficient early-stage drug release system formulation.

Abstract: Accurate modeling of drug release is essential for designing and developing controlled-release systems. Classical models (Fick, Higuchi, Peppas) rely on simplifying assumptions that limit their accuracy in complex geometries and release mechanisms. Here, we propose a novel approach using Physics-Informed Neural Networks (PINNs) and Bayesian PINNs (BPINNs) for predicting release from planar, 1D-wrinkled, and 2D-crumpled films. This approach uniquely integrates Fick’s diffusion law with limited experimental data to enable accurate long-term predictions from short-term measurements, and is systematically benchmarked against classical drug release models. We embedded Fick’s second law into PINN as loss with 10,000 Latin-hypercube collocation points and utilized previously published experimental datasets to assess drug release performance through mean absolute error (MAE) and root mean square error (RMSE), considering noisy conditions and limited-data scenarios. Our approach reduced mean error by up to 40% relative to classical baselines across all film types. The PINN formulation achieved RMSE <0.05 utilizing only the first 6% of the release time data (reducing 94% of release time required for the experiments) for the planar film. For wrinkled and crumpled films, the PINN reached RMSE <0.05 in 33% of the release time data. BPINNs provide tighter and more reliable uncertainty quantification under noise. By combining physical laws with experimental data, the proposed framework yields highly accurate long-term release predictions from short-term measurements, offering a practical route for accelerated characterization and more efficient early-stage drug release system formulation.

[429] Causal Identification in Multi-Task Demand Learning with Confounding

Varun Gupta, Vijay Kamble

Main category: cs.LG

TL;DR: A meta-learning framework for causal demand estimation with endogenous prices across many retail contexts with limited price variation per task.

Details

Motivation: Retail pricing requires estimating heterogeneous price-response functions across many contexts, but each context has limited historical price variation and prices are endogenous (chosen by managers/algorithms), making causal identification challenging.

Method: Proposes Decision-Conditioned Masked-Outcome Meta-Learning (DCMOML), which carefully designs the meta-learner’s information set to leverage cross-task heterogeneity while accounting for endogenous decision histories, enabling causal identification under mild restrictions on price adaptivity.

Result: The method identifies the conditional mean of task-specific causal parameters given the designed information set, providing guarantees for large-scale demand estimation with endogenous prices and small per-task samples.

Conclusion: Offers a principled foundation for deploying causal, data-driven pricing models in operational environments by addressing the fundamental challenge of endogeneity in multi-task demand learning.

Abstract: We study a canonical multi-task demand learning problem motivated by retail pricing, in which a firm seeks to estimate heterogeneous linear price-response functions across a large collection of decision contexts. Each context is characterized by rich observable covariates yet typically exhibits only limited historical price variation, motivating the use of multi-task learning to borrow strength across tasks. A central challenge in this setting is endogeneity: historical prices are chosen by managers or algorithms and may be arbitrarily correlated with unobserved, task-level demand determinants. Under such confounding by latent fundamentals, commonly used approaches, such as pooled regression and meta-learning, fail to identify causal price effects. We propose a new estimation framework that achieves causal identification despite arbitrary dependence between prices and latent task structure. Our approach, Decision-Conditioned Masked-Outcome Meta-Learning (DCMOML), involves carefully designing the information set of a meta-learner to leverage cross-task heterogeneity while accounting for endogenous decision histories. Under a mild restriction on price adaptivity in each task, we establish that this method identifies the conditional mean of the task-specific causal parameters given the designed information set. Our results provide guarantees for large-scale demand estimation with endogenous prices and small per-task samples, offering a principled foundation for deploying causal, data-driven pricing models in operational environments.

[430] Supervised Metric Regularization Through Alternating Optimization for Multi-Regime Physics-Informed Neural Networks

Enzo Nicolas Spotorno, Josafat Ribeiro Leal, Antonio Augusto Frohlich

Main category: cs.LG

TL;DR: TAPINN improves PINNs for parameterized dynamical systems with sharp transitions by using supervised metric regularization to structure latent space and alternating optimization to manage gradient conflicts.

Details

Motivation: Standard PINNs struggle with parameterized dynamical systems having sharp regime transitions (like bifurcations) due to spectral bias/mode collapse, where networks average distinct physical behaviors instead of learning proper mappings.

Method: Proposes Topology-Aware PINN (TAPINN) with supervised metric regularization to structure latent space, conditioning solver on latent states optimized for metric-based separation between regimes. Uses phase-based alternating optimization schedule to manage gradient conflicts between metric and physics objectives.

Result: Achieves ~49% lower physics residual (0.082 vs 0.160), stable convergence with 2.18x lower gradient variance than multi-output Sobolev Error baseline, and 5x fewer parameters than hypernetwork-based alternative on Duffing Oscillator.

Conclusion: TAPINN effectively mitigates spectral bias in parameterized PINNs for systems with sharp transitions through structured latent space and careful optimization scheduling.

Abstract: Standard Physics-Informed Neural Networks (PINNs) often face challenges when modeling parameterized dynamical systems with sharp regime transitions, such as bifurcations. In these scenarios, the continuous mapping from parameters to solutions can result in spectral bias or “mode collapse”, where the network averages distinct physical behaviors. We propose a Topology-Aware PINN (TAPINN) that aims to mitigate this challenge by structuring the latent space via Supervised Metric Regularization. Unlike standard parametric PINNs that map physical parameters directly to solutions, our method conditions the solver on a latent state optimized to reflect the metric-based separation between regimes, showing ~49% lower physics residual (0.082 vs. 0.160). We train this architecture using a phase-based Alternating Optimization (AO) schedule to manage gradient conflicts between the metric and physics objectives. Preliminary experiments on the Duffing Oscillator demonstrate that while standard baselines suffer from spectral bias and high-capacity Hypernetworks overfit (memorizing data while violating physics), our approach achieves stable convergence with 2.18x lower gradient variance than a multi-output Sobolev Error baseline, and 5x fewer parameters than a hypernetwork-based alternative.

[431] Online Monitoring Framework for Automotive Time Series Data using JEPA Embeddings

Alexander Fertig, Karthikeyan Chandra Sekaran, Lakshman Balasubramanian, Michael Botsch

Main category: cs.LG

TL;DR: Self-supervised JEPA-based framework for anomaly detection in autonomous vehicle object state representations without requiring anomaly labels.

Details

Motivation: Need for online monitoring frameworks to ensure safe operation of autonomous vehicles, particularly to detect anomalies in object state representations without requiring anomaly labels which are usually unavailable for unknown anomalies.

Method: Uses self-supervised embedding method with JEPA-based prediction task to translate object data into latent representation space, then applies established anomaly detection methods on these expressive embeddings.

Result: Framework demonstrated on nuScenes dataset, showing capability to detect anomalies in real-world autonomous driving scenarios without requiring labeled anomaly data.

Conclusion: Proposed framework enables anomaly detection for unknown anomalies in real-world autonomous vehicle operation where anomaly labels are unavailable, using self-supervised JEPA embeddings.

Abstract: As autonomous vehicles are rolled out, measures must be taken to ensure their safe operation. In order to supervise a system that is already in operation, monitoring frameworks are frequently employed. These run continuously online in the background, supervising the system status and recording anomalies. This work proposes an online monitoring framework to detect anomalies in object state representations. Thereby, a key challenge is creating a framework for anomaly detection without anomaly labels, which are usually unavailable for unknown anomalies. To address this issue, this work applies a self-supervised embedding method to translate object data into a latent representation space. For this, a JEPA-based self-supervised prediction task is constructed, allowing training without anomaly labels and the creation of rich object embeddings. The resulting expressive JEPA embeddings serve as input for established anomaly detection methods, in order to identify anomalies within object state representations. This framework is particularly useful for applications in real-world environments, where new or unknown anomalies may occur during operation for which there are no labels available. Experiments performed on the publicly available, real-world nuScenes dataset illustrate the framework’s capabilities.

[432] Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions

J Rosser, Robert Kirk, Edward Grefenstette, Jakob Foerster, Laura Ruis

Main category: cs.LG

TL;DR: Infusion framework uses influence functions to craft subtle training data perturbations that induce targeted model behavior changes through parameter shifts, evaluated on vision and language data poisoning tasks.

Details

Motivation: To explore the reverse application of influence functions - instead of attributing model behavior to training data, craft training data that induces specific model behavior, addressing training data interpretability for both adversaries and defenders.

Method: Uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. Evaluated on data poisoning tasks across vision (CIFAR-10) and language domains.

Result: On CIFAR-10, subtle edits to just 0.2% of training data can be competitive with inserting explicit behavior examples. The approach transfers across architectures (ResNet ↔ CNN). In language experiments, it’s most effective at amplifying behaviors the model has already learned.

Conclusion: Small, subtle edits to training data can systematically shape model behavior, highlighting the importance of training data interpretability for both security (adversaries) and defense perspectives.

Abstract: Influence functions are commonly used to attribute model behavior to training documents. We explore the reverse: crafting training data that induces model behavior. Our framework, Infusion, uses scalable influence-function approximations to compute small perturbations to training documents that induce targeted changes in model behavior through parameter shifts. We evaluate Infusion on data poisoning tasks across vision and language domains. On CIFAR-10, we show that making subtle edits via Infusion to just 0.2% (100/45,000) of the training documents can be competitive with the baseline of inserting a small number of explicit behavior examples. We also find that Infusion transfers across architectures (ResNet $\leftrightarrow$ CNN), suggesting a single poisoned corpus can affect multiple independently trained models. In preliminary language experiments, we characterize when our approach increases the probability of target behaviors and when it fails, finding it most effective at amplifying behaviors the model has already learned. Taken together, these results show that small, subtle edits to training data can systematically shape model behavior, underscoring the importance of training data interpretability for adversaries and defenders alike. We provide the code here: https://github.com/jrosseruk/infusion.

[433] Empirical Stability Analysis of Kolmogorov-Arnold Networks in Hard-Constrained Recurrent Physics-Informed Discovery

Enzo Nicolas Spotorno, Josafat Leal Filho, Antonio Augusto Medeiros Frohlich

Main category: cs.LG

TL;DR: Kolmogorov-Arnold Networks (KANs) integrated into hard-constrained physics-informed architectures show promise for learning residual manifolds in oscillatory systems, but exhibit hyperparameter fragility and instability compared to standard MLPs.

Details

Motivation: Motivated by the Kolmogorov-Arnold representation theorem and preliminary gray-box results, researchers hypothesized that KANs would enable more efficient recovery of unknown terms in oscillatory systems compared to traditional MLPs when integrated into hard-constrained recurrent physics-informed neural networks.

Method: Integration of KANs into hard-constrained recurrent physics-informed neural network (HRPINN) architectures, with sensitivity analysis on configuration sensitivity, parameter scale, and training paradigms. Evaluation on univariate polynomial residuals (Duffing oscillator) and multiplicative terms (Van der Pol oscillator).

Result: Small KANs are competitive on univariate polynomial residuals (Duffing) but exhibit severe hyperparameter fragility, instability in deeper configurations, and consistent failure on multiplicative terms (Van der Pol), generally being outperformed by standard MLPs.

Conclusion: The additive inductive bias in the original KAN formulation has limitations for state coupling in oscillatory systems, providing preliminary empirical evidence of inductive bias limitations for future hybrid modeling approaches.

Abstract: We investigate the integration of Kolmogorov-Arnold Networks (KANs) into hard-constrained recurrent physics-informed architectures (HRPINN) to evaluate the fidelity of learned residual manifolds in oscillatory systems. Motivated by the Kolmogorov-Arnold representation theorem and preliminary gray-box results, we hypothesized that KANs would enable efficient recovery of unknown terms compared to MLPs. Through initial sensitivity analysis on configuration sensitivity, parameter scale, and training paradigm, we found that while small KANs are competitive on univariate polynomial residuals (Duffing), they exhibit severe hyperparameter fragility, instability in deeper configurations, and consistent failure on multiplicative terms (Van der Pol), generally outperformed by standard MLPs. These empirical challenges highlight limitations of the additive inductive bias in the original KAN formulation for state coupling and provide preliminary empirical evidence of inductive bias limitations for future hybrid modeling.

[434] Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

Shijie Zhang, Xiang Guo, Rujun Guo, Shaoyu Liu, Xiaozhao Wang, Guanjun Jiang, Kevin Zhang

Main category: cs.LG

TL;DR: AFRL paradigm for search relevance models that outputs relevance score first, then explanation, enabling low latency while maintaining interpretability through mode-balanced optimization combining SFT and RL.

Details

Motivation: Need to balance millisecond-level response requirements with interpretable reasoning of LLMs in search relevance tasks, avoiding mode collapse in RL training.

Method: Answer-First, Reason Later (AFRL) paradigm with SFT+RL pipeline, mode-balanced optimization using SFT auxiliary loss in Stepwise-GRPO, automated instruction evolution, and multi-stage curriculum.

Result: 32B teacher model achieves SOTA performance, and AFRL enables efficient knowledge distillation to 0.6B model, reconciling reasoning depth with deployment latency.

Conclusion: AFRL paradigm successfully addresses latency-performance tradeoff in search relevance, with mode-balanced optimization preventing RL mode collapse while maintaining interpretability.

Abstract: Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a “Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)” pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to “reward hacking.” On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.

[435] A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

Chenruo Liu, Yijun Dong, Yiqiu Shen, Qi Lei

Main category: cs.LG

TL;DR: Theoretical analysis of iterative self-improvement for LLMs, showing finite-sample guarantees for reward improvement and explaining the feedback loop between model quality and data acceptance.

Details

Motivation: While iterative self-improvement has shown empirical success in fine-tuning LLMs on their own reward-verified outputs, there's limited theoretical foundation for this generative, iterative procedure in practical finite-sample settings.

Method: Model each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution, derive finite-sample guarantees for expected reward, and analyze the feedback loop between model quality and data acceptance. Also analyze task-centric reasoning with multiple difficulty levels.

Result: Analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation. For reasoning tasks, quantifiable conditions are proven where easy-to-hard curricula achieve better guarantees than fixed task mixtures.

Conclusion: Theoretical foundations for iterative self-improvement provide insights into the dynamics of self-improvement loops and curriculum learning strategies, validated through simulations and controlled experiments on reasoning tasks.

Abstract: Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated via Monte-Carlo simulations and controlled experiments on graph-based reasoning tasks.

[436] ADORA: Training Reasoning Models with Dynamic Advantage Estimation on Reinforcement Learning

Qingnan Ren, Shiting Huang, Zhen Fang, Zehui Chen, Lin Chen, Lijun Li, Feng Zhao

Main category: cs.LG

TL;DR: ADORA introduces dynamic advantage estimation for reinforcement learning in reasoning tasks by adaptively categorizing training samples based on evolving utility during online rollouts, improving policy optimization efficiency.

Details

Motivation: Current reinforcement learning methods for reasoning models use static advantage estimation, which leads to inefficient credit assignment by ignoring how sample utility changes over time. This results in suboptimal policy updates, slower convergence, and learning instability.

Method: ADORA dynamically adjusts advantage function weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples based on their evolving utility during online model rollouts. This can be integrated into existing policy optimization algorithms without major architectural changes.

Result: Extensive evaluations across diverse model families and data scales show ADORA significantly enhances long reasoning in geometric and mathematical tasks, achieving notable performance gains without requiring sensitive hyperparameter tuning.

Conclusion: ADORA is a robust and efficient framework that improves reinforcement learning for reasoning tasks by enabling more efficient policy updates through dynamic advantage estimation and adaptive sample prioritization.

Abstract: Reinforcement learning has become a cornerstone technique for developing reasoning models in complex tasks, ranging from mathematical problem-solving to imaginary reasoning. The optimization of these models typically relies on policy gradient methods, whose efficacy hinges on the accurate estimation of an advantage function. However, prevailing methods typically employ static advantage estimation, a practice that leads to inefficient credit assignment by neglecting the dynamic utility of training samples over time. This limitation results in suboptimal policy updates, which in turn manifest as slower convergence rates and increased learning instability, as models fail to adapt to evolving sample utilities effectively. To address this problem, we introduce \textbf{ADORA} (\textbf{A}dvantage \textbf{D}ynamics via \textbf{O}nline \textbf{R}ollout \textbf{A}daptation), a novel framework for policy optimization. ADORA dynamically adjusts the advantage function’s weighting by adaptively categorizing training data into temporarily advantageous and disadvantageous samples, based on their evolving utility during online model rollouts. This tailored data differentiation strategy allows ADORA to be seamlessly integrated into existing policy optimization algorithms without significant architectural modifications, enabling the policy to prioritize learning from more informative experiences and thereby achieve more efficient policy updates. Extensive evaluations across diverse model families and varying data scales demonstrate that ADORA is a robust and efficient framework. It significantly enhances long reasoning in both geometric and mathematical tasks, consistently achieving notable performance gains without requiring sensitive hyperparameter tuning.

[437] Position: Message-passing and spectral GNNs are two sides of the same coin

Antonis Vasileiou, Juan Cervino, Pascal Frossard, Charilaos I. Kanatsoulis, Christopher Morris, Michael T. Schaub, Pierre Vandergheynst, Zhiyang Wang, Guy Wolf, Ron Levie

Main category: cs.LG

TL;DR: This paper argues that the artificial divide between message-passing neural networks (MPNNs) and spectral graph neural networks hinders progress, proposing they should be understood as different parametrizations of permutation-equivariant operators on graph signals.

Details

Motivation: The paper identifies an artificial divide between two major GNN traditions (MPNNs from ML and spectral GNNs from signal processing) that hinders progress in graph learning. The authors aim to bridge this gap by providing a unified theoretical framework.

Method: The authors propose viewing both MPNNs and spectral GNNs as different parametrizations of permutation-equivariant operators acting on graph signals. They analyze expressive power equivalences and identify where genuine gaps exist, while highlighting complementary strengths of both approaches.

Result: The analysis shows that many popular GNN architectures are equivalent in expressive power, with genuine gaps arising only in specific regimes. MPNNs excel at discrete structure analysis using logic/graph isomorphism tools, while spectral GNNs provide principled tools for understanding smoothing, bottlenecks, stability, and community structure.

Conclusion: Progress in graph learning will be accelerated by understanding key similarities/differences between MPNNs and spectral GNNs and working toward unifying them within a common theoretical framework rather than treating them as competing paradigms.

Abstract: Graph neural networks (GNNs) are commonly divided into message-passing neural networks (MPNNs) and spectral graph neural networks, reflecting two largely separate research traditions in machine learning and signal processing. This paper argues that this divide is mostly artificial, hindering progress in the field. We propose a viewpoint in which both MPNNs and spectral GNNs are understood as different parametrizations of permutation-equivariant operators acting on graph signals. From this perspective, many popular architectures are equivalent in expressive power, while genuine gaps arise only in specific regimes. We further argue that MPNNs and spectral GNNs offer complementary strengths. That is, MPNNs provide a natural language for discrete structure and expressivity analysis using tools from logic and graph isomorphism research, while the spectral perspective provides principled tools for understanding smoothing, bottlenecks, stability, and community structure. Overall, we posit that progress in graph learning will be accelerated by clearly understanding the key similarities and differences between these two types of GNNs, and by working towards unifying these perspectives within a common theoretical and conceptual framework rather than treating them as competing paradigms.

[438] Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Akshay Mete, Shahid Aamir Sheikh, Tzu-Hsiang Lin, Dileep Kalathil, P. R. Kumar

Main category: cs.LG

TL;DR: Optimistic World Models (OWMs) introduce a gradient-based optimistic exploration framework for RL that biases world model learning toward higher-reward outcomes without needing uncertainty estimates or constrained optimization.

Details

Motivation: Efficient exploration in sparse-reward RL environments remains challenging. Current UCB-style methods have limitations, and there's a need for scalable, principled exploration methods that can work with modern world model architectures.

Method: OWMs incorporate optimism directly into model learning through an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This gradient-based approach requires no uncertainty estimates or constrained optimization, making it plug-and-play with existing world model frameworks like DreamerV3 and STORM.

Result: Optimistic DreamerV3 and Optimistic STORM demonstrate significant improvements in sample efficiency and cumulative return compared to baseline counterparts, showing the effectiveness of the optimistic exploration approach.

Conclusion: OWMs provide a scalable, gradient-based framework for optimistic exploration in RL that can be easily integrated with existing world model architectures, offering improved performance in sparse-reward environments.

Abstract: Efficient exploration remains a central challenge in reinforcement learning (RL), particularly in sparse-reward environments. We introduce Optimistic World Models (OWMs), a principled and scalable framework for optimistic exploration that brings classical reward-biased maximum likelihood estimation (RBMLE) from adaptive control into deep RL. In contrast to upper confidence bound (UCB)-style exploration methods, OWMs incorporate optimism directly into model learning by augmentation with an optimistic dynamics loss that biases imagined transitions toward higher-reward outcomes. This fully gradient-based loss requires neither uncertainty estimates nor constrained optimization. Our approach is plug-and-play with existing world model frameworks, preserving scalability while requiring only minimal modifications to standard training procedures. We instantiate OWMs within two state-of-the-art world model architectures, leading to Optimistic DreamerV3 and Optimistic STORM, which demonstrate significant improvements in sample efficiency and cumulative return compared to their baseline counterparts.

[439] Effectiveness of Binary Autoencoders for QUBO-Based Optimization Problems

Tetsuro Abe, Masashi Yamashita, Shu Tanaka

Main category: cs.LG

TL;DR: The paper analyzes how binary autoencoders improve black-box combinatorial optimization by learning latent representations that better preserve neighborhood structures and feasibility constraints compared to manual encodings.

Details

Motivation: In black-box combinatorial optimization with expensive evaluations, factorization machines with quantum annealing (FMQA) require binary decision variables. For non-binary structures like integer permutations, manual binary encodings often fail to preserve original neighborhood structures, leading to inefficient search and infeasible solutions that waste evaluations.

Method: The study combines FMQA with binary autoencoders (bAE) that learn compact binary latent codes from feasible solutions. Using a small traveling salesman problem as an interpretable testbed, the authors analyze how bAE representations compare to manually designed encodings in terms of geometric properties like alignment between tour distances and latent Hamming distances, neighborhood smoothness, and local optima.

Result: The bAE accurately reconstructs feasible tours and, compared to manual encodings at similar compression, better aligns tour distances with latent Hamming distances, yields smoother neighborhoods under small bit flips, and produces fewer local optima. These geometric properties explain why bAE+FMQA improves approximation ratios faster while maintaining feasibility throughout optimization.

Conclusion: Binary autoencoders learn latent representations that better preserve the geometric structure of combinatorial optimization problems, leading to more efficient black-box optimization. The analysis provides guidance for designing latent representations that maintain feasibility and neighborhood properties for improved search performance.

Abstract: In black-box combinatorial optimization, objective evaluations are often expensive, so high quality solutions must be found under a limited budget. Factorization machine with quantum annealing (FMQA) builds a quadratic surrogate model from evaluated samples and optimizes it on an Ising machine. However, FMQA requires binary decision variables, and for nonbinary structures such as integer permutations, the choice of binary encoding strongly affects search efficiency. If the encoding fails to reflect the original neighborhood structure, small Hamming moves may not correspond to meaningful modifications in the original solution space, and constrained problems can yield many infeasible candidates that waste evaluations. Recent work combines FMQA with a binary autoencoder (bAE) that learns a compact binary latent code from feasible solutions, yet the mechanism behind its performance gains is unclear. Using a small traveling salesman problem as an interpretable testbed, we show that the bAE reconstructs feasible tours accurately and, compared with manually designed encodings at similar compression, better aligns tour distances with latent Hamming distances, yields smoother neighborhoods under small bit flips, and produces fewer local optima. These geometric properties explain why bAE+FMQA improves the approximation ratio faster while maintaining feasibility throughout optimization, and they provide guidance for designing latent representations for black-box optimization.

[440] Long Chain-of-Thought Compression via Fine-Grained Group Policy Optimization

Xinchen Han, Hossam Afifi, Michel Marot, Xilu Wang, Lu Yin

Main category: cs.LG

TL;DR: FGO is a reinforcement learning algorithm that compresses verbose Chain-of-Thought reasoning in LLMs by refining group responses through subdivision and weighted assignment based on length and entropy.

Details

Motivation: LLMs often generate unnecessarily verbose Chain-of-Thought reasoning that increases computational costs and latency without proportional performance gains. There's a need to compress CoT reasoning efficiently while maintaining performance.

Method: FGO (Fine-grained Group policy Optimization) is a Reinforcement Learning algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy. It addresses limitations of Group Relative Policy Optimization (GRPO) including inefficient data utilization and entropy collapse.

Result: FGO achieves efficient CoT compression without degrading performance on multiple reasoning benchmarks (MATH500, AIME24, AMC23, Minerva). It successfully resolves key limitations of GRPO while maintaining reasoning quality.

Conclusion: FGO provides an effective solution for compressing verbose CoT reasoning in LLMs, reducing computational costs and latency while preserving reasoning performance, and addresses fundamental limitations of previous group policy optimization methods.

Abstract: Large Language Models (LLMs) often generate unnecessarily verbose Chain-of-Thought (CoT) reasoning that increases computational costs and latency without proportional performance gains. In this paper, we propose \textbf{F}ine-grained \textbf{G}roup policy \textbf{O}ptimization (\textbf{FGO}), a Reinforcement Learning (RL) algorithm that refines group responses by subdividing them and assigning appropriate weights based on length and entropy, thereby enabling effective CoT compression. Meanwhile, as an enhanced variant of Group Relative Policy Optimization (GRPO), FGO successfully addresses two major limitations of the GRPO: inefficient data utilization and entropy collapse. We evaluate FGO on multiple reasoning LLMs and benchmarks, including MATH500, AIME24, AMC23, and Minerva. Experimental results show that FGO achieves efficient CoT compression without degrading performance, and simultaneously resolves the key limitations of GRPO.

[441] WildCat: Near-Linear Attention in Theory and Practice

Tobias Schröder, Lester Mackey

Main category: cs.LG

TL;DR: WildCat is a method for compressing attention mechanisms using spectrally-accurate coreset subsampling with random pivoted Cholesky, achieving near-linear runtime with strong error guarantees.

Details

Motivation: Attention mechanisms are computationally expensive with quadratic scaling in sequence length, making them costly to deploy. Existing approximations either lack error guarantees or require quadratic runtime for high fidelity.

Method: WildCat uses random pivoted Cholesky to select a small weighted coreset for attention computation. It optimally weights elements to minimize reconstruction error while avoiding quadratic costs.

Result: WildCat achieves super-polynomial error decay (O(n^{-√log(log(n))})) with near-linear runtime (O(n^{1+o(1)})). GPU-optimized implementation shows benefits for image generation, classification, and language model KV cache compression.

Conclusion: WildCat provides a practical, high-accuracy approach to attention compression with strong theoretical guarantees, enabling efficient deployment of attention-based models across various applications.

Abstract: We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm – randomly pivoted Cholesky – and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.

[442] Vendi Novelty Scores for Out-of-Distribution Detection

Amey P. Pasarkar, Adji Bousso Dieng

Main category: cs.LG

TL;DR: Vendi Novelty Score (VNS) is a new OOD detection method that measures how much a test sample increases the diversity (Vendi Score) of in-distribution features, achieving state-of-the-art performance without density modeling.

Details

Motivation: Existing OOD detectors rely on model confidence scores or likelihood estimates with restrictive distributional assumptions. The authors propose a third paradigm based on diversity to overcome these limitations.

Method: VNS quantifies how much a test sample increases the Vendi Score (similarity-based diversity metric) of the in-distribution feature set. It’s linear-time, non-parametric, and combines class-conditional and dataset-level novelty signals.

Result: VNS achieves state-of-the-art OOD detection performance across multiple image classification benchmarks and network architectures. Remarkably, it retains performance with only 1% of training data.

Conclusion: VNS provides a principled, diversity-based approach to OOD detection that doesn’t require density modeling, works well with limited data, and is suitable for memory/access-constrained settings.

Abstract: Out-of-distribution (OOD) detection is critical for the safe deployment of machine learning systems. Existing post-hoc detectors typically rely on model confidence scores or likelihood estimates in feature space, often under restrictive distributional assumptions. In this work, we introduce a third paradigm and formulate OOD detection from a diversity perspective. We propose the Vendi Novelty Score (VNS), an OOD detector based on the Vendi Scores (VS), a family of similarity-based diversity metrics. VNS quantifies how much a test sample increases the VS of the in-distribution feature set, providing a principled notion of novelty that does not require density modeling. VNS is linear-time, non-parametric, and naturally combines class-conditional (local) and dataset-level (global) novelty signals. Across multiple image classification benchmarks and network architectures, VNS achieves state-of-the-art OOD detection performance. Remarkably, VNS retains this performance when computed using only 1% of the training data, enabling deployment in memory- or access-constrained settings.

[443] Step-resolved data attribution for looped transformers

Georgios Kaissis, David Mildenberger, Juan Felipe Gomez, Martin J. Menten, Eleni Triantafillou

Main category: cs.LG

TL;DR: SDI decomposes training data influence across recurrent loop iterations in transformers, providing per-step insights into latent reasoning processes.

Details

Motivation: Existing influence estimators aggregate scores across all loop iterations in recurrent transformers, obscuring when during the recurrent computation a training example matters for latent reasoning.

Method: Step-Decomposed Influence (SDI) decomposes TracIn into length-τ influence trajectories by unrolling recurrent computation graphs and attributing influence to specific loop iterations, with TensorSketch implementation to avoid materializing per-example gradients.

Result: SDI scales excellently on looped GPT-style models and algorithmic reasoning tasks, matches full-gradient baselines with low error, and supports data attribution and interpretability tasks with per-step insights.

Conclusion: SDI provides fine-grained influence analysis across recurrent iterations, enabling better understanding of how training examples shape internal computation in looped transformers during latent reasoning.

Abstract: We study how individual training examples shape the internal computation of looped transformers, where a shared block is applied for $τ$ recurrent iterations to enable latent reasoning. Existing training-data influence estimators such as TracIn yield a single scalar score that aggregates over all loop iterations, obscuring when during the recurrent computation a training example matters. We introduce \textit{Step-Decomposed Influence (SDI)}, which decomposes TracIn into a length-$τ$ influence trajectory by unrolling the recurrent computation graph and attributing influence to specific loop iterations. To make SDI practical at transformer scale, we propose a TensorSketch implementation that never materialises per-example gradients. Experiments on looped GPT-style models and algorithmic reasoning tasks show that SDI scales excellently, matches full-gradient baselines with low error and supports a broad range of data attribution and interpretability tasks with per-step insights into the latent reasoning process.

[444] Features as Rewards: Scalable Supervision for Open-Ended Tasks via Interpretability

Aaditya Vikram Prasad, Connor Watts, Jack Merullo, Dhruvil Gala, Owen Lewis, Thomas McGrath, Ekdeep Singh Lubana

Main category: cs.LG

TL;DR: RLFR uses language model features as reward functions for reinforcement learning to reduce hallucinations, achieving 58% reduction while maintaining benchmark performance.

Details

Motivation: Language models learn abstract features that can be used for more than just monitoring - they can serve as scalable supervision for open-ended tasks like hallucination reduction.

Method: Reinforcement Learning from Feature Rewards (RLFR) pipeline: 1) Novel probing framework identifies candidate hallucinated claims, 2) Uses model features as reward functions for RL, 3) Teaches model to intervene and correct completions when uncertain, 4) Enables scalable test-time compute guided by reward features.

Result: Operationalized on Gemma-3-12B-IT, resulting policy is 58% less likely to hallucinate compared to original model while preserving performance on standard benchmarks.

Conclusion: Features can serve as scalable supervision for open-ended tasks, introducing a novel paradigm for using interpretability in learning.

Abstract: Language models trained on large-scale datasets have been shown to learn features that encode abstract concepts such as factuality or intent. Such features are traditionally used for test-time monitoring or steering. We present an alternative affordance: features as scalable supervision for open-ended tasks. We consider the case of hallucination-reduction as a desirable, yet open-ended behavior and design a reinforcement learning (RL) pipeline, titled RLFR (Reinforcement Learning from Feature Rewards), that uses features as reward functions. Grounded in a novel probing framework that identifies candidate hallucinated claims, our pipeline teaches a model to intervene and correct its completions when it is uncertain of their factuality. Furthermore, the pipeline enables scalable test-time compute, guided once more by our reward features. This end-to-end process operationalized on Gemma-3-12B-IT results in a policy that is 58% less likely to hallucinate compared to the original model, while preserving performance on standard benchmarks. Taken together, by grounding supervision in the language of features, this paper introduces a novel paradigm in the use of interpretability for learning open-ended tasks.

[445] Learning on the Manifold: Unlocking Standard Diffusion Transformers with Representation Encoders

Amandeep Kumar, Vishal M. Patel

Main category: cs.LG

TL;DR: Riemannian Flow Matching with Jacobi Regularization (RJF) enables diffusion transformers to converge on representation encoder features by addressing geometric interference through manifold-constrained geodesics.

Details

Motivation: Standard diffusion transformers fail to converge on representation encoder features due to geometric interference - Euclidean flow matching forces paths through low-density interior of hyperspherical feature space rather than following the manifold surface.

Method: Proposes Riemannian Flow Matching with Jacobi Regularization (RJF) that constrains generative process to manifold geodesics and corrects for curvature-induced error propagation, enabling standard DiT architectures to converge without width scaling.

Result: RJF enables standard DiT-B architecture (131M parameters) to converge effectively, achieving FID of 3.37 where prior methods fail to converge.

Conclusion: Geometric interference is the fundamental cause of convergence failure in diffusion transformers on representation features, and RJF provides an effective solution by working on the manifold geometry.

Abstract: Leveraging representation encoders for generative modeling offers a path for efficient, high-fidelity synthesis. However, standard diffusion transformers fail to converge on these representations directly. While recent work attributes this to a capacity bottleneck proposing computationally expensive width scaling of diffusion transformers we demonstrate that the failure is fundamentally geometric. We identify Geometric Interference as the root cause: standard Euclidean flow matching forces probability paths through the low-density interior of the hyperspherical feature space of representation encoders, rather than following the manifold surface. To resolve this, we propose Riemannian Flow Matching with Jacobi Regularization (RJF). By constraining the generative process to the manifold geodesics and correcting for curvature-induced error propagation, RJF enables standard Diffusion Transformer architectures to converge without width scaling. Our method RJF enables the standard DiT-B architecture (131M parameters) to converge effectively, achieving an FID of 3.37 where prior methods fail to converge. Code: https://github.com/amandpkr/RJF

Iván Arcuschin, David Chanin, Adrià Garriga-Alonso, Oana-Maria Camburu

Main category: cs.LG

TL;DR: Automated pipeline for detecting unverbalized biases in LLMs’ chain-of-thought reasoning using statistical testing and concept generation.

Details

Motivation: LLMs often provide plausible chain-of-thought reasoning that may hide internal biases, making monitoring via stated reasoning unreliable. Existing bias evaluations require predefined categories and hand-crafted datasets, limiting scalability.

Method: Fully automated black-box pipeline that: 1) uses LLM autoraters to generate candidate bias concepts from task datasets, 2) tests each concept by generating positive/negative variations on progressively larger samples, 3) applies statistical techniques for multiple testing and early stopping, and 4) flags concepts as unverbalized biases if they yield statistically significant performance differences without being cited in CoTs.

Result: The pipeline automatically discovered previously unknown biases in six LLMs across three decision tasks (hiring, loan approval, university admissions), including Spanish fluency, English proficiency, and writing formality. It also validated biases manually identified by prior work (gender, race, religion, ethnicity).

Conclusion: The approach provides a practical, scalable path to automatic task-specific bias discovery in LLMs, moving beyond reliance on predefined bias categories and manual dataset creation.

Abstract: Large Language Models (LLMs) often provide chain-of-thought (CoT) reasoning traces that appear plausible, but may hide internal biases. We call these unverbalized biases. Monitoring models via their stated reasoning is therefore unreliable, and existing bias evaluations typically require predefined categories and hand-crafted datasets. In this work, we introduce a fully automated, black-box pipeline for detecting task-specific unverbalized biases. Given a task dataset, the pipeline uses LLM autoraters to generate candidate bias concepts. It then tests each concept on progressively larger input samples by generating positive and negative variations, and applies statistical techniques for multiple testing and early stopping. A concept is flagged as an unverbalized bias if it yields statistically significant performance differences while not being cited as justification in the model’s CoTs. We evaluate our pipeline across six LLMs on three decision tasks (hiring, loan approval, and university admissions). Our technique automatically discovers previously unknown biases in these models (e.g., Spanish fluency, English proficiency, writing formality). In the same run, the pipeline also validates biases that were manually identified by prior work (gender, race, religion, ethnicity). More broadly, our proposed approach provides a practical, scalable path to automatic task-specific bias discovery.

[447] Towards Explainable Federated Learning: Understanding the Impact of Differential Privacy

Júlio Oliveira, Rodrigo Ferreira, André Riker, Glaucio H. S. Carvalho, Eirini Eleni Tsilopoulou

Main category: cs.LG

TL;DR: FEXT-DP: A federated learning system using decision trees with differential privacy that balances privacy protection and explainability while analyzing the trade-off between DP and interpretability.

Details

Motivation: Modern ML systems need both data privacy (via federated learning and differential privacy) and explainability. The paper aims to create an ML model that combines enhanced data privacy with explainability, addressing the tension between privacy protection and interpretability.

Method: Proposes Federated EXplainable Trees with Differential Privacy (FEXT-DP) - a federated learning system based on decision trees (for lightweight operation and superior explainability) with an additional differential privacy layer. Analyzes the impact of DP on explainability.

Result: Performance assessment shows improvements in faster training (fewer rounds), lower Mean Squared Error, and better explainability compared to neural network-based FL systems.

Conclusion: FEXT-DP successfully combines privacy and explainability using tree-based federated learning with DP, though DP negatively impacts explainability - highlighting the privacy-explainability trade-off that needs careful consideration.

Abstract: Data privacy and eXplainable Artificial Intelligence (XAI) are two important aspects for modern Machine Learning systems. To enhance data privacy, recent machine learning models have been designed as a Federated Learning (FL) system. On top of that, additional privacy layers can be added, via Differential Privacy (DP). On the other hand, to improve explainability, ML must consider more interpretable approaches with reduced number of features and less complex internal architecture. In this context, this paper aims to achieve a machine learning (ML) model that combines enhanced data privacy with explainability. So, we propose a FL solution, called Federated EXplainable Trees with Differential Privacy (FEXT-DP), that: (i) is based on Decision Trees, since they are lightweight and have superior explainability than neural networks-based FL systems; (ii) provides additional layer of data privacy protection applying Differential Privacy (DP) to the Tree-Based model. However, there is a side effect adding DP: it harms the explainability of the system. So, this paper also presents the impact of DP protection on the explainability of the ML model. The carried out performance assessment shows improvements of FEXT-DP in terms of a faster training, i.e., numbers of rounds, Mean Squared Error and explainability.

[448] EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

Wujiang Xu, Wentian Zhao, Zhenting Wang, Yu-Jhe Li, Can Jin, Mingyu Jin, Kai Mei, Kun Wan, Dimitris N. Metaxas

Main category: cs.LG

TL;DR: EPO framework addresses exploration-exploitation cascade failure in multi-turn LLM agents with sparse rewards through entropy regularization, smoothing, and adaptive weighting.

Details

Motivation: Training LLM agents in multi-turn environments with sparse rewards (30+ turns per task) presents fundamental RL challenges, particularly the exploration-exploitation cascade failure where agents prematurely converge to flawed strategies then collapse into chaotic exploration.

Method: Entropy-regularized Policy Optimization (EPO) framework with three mechanisms: (1) entropy regularization for exploration, (2) entropy smoothing regularizer to bound policy entropy within historical averages, (3) adaptive phase-based weighting to balance exploration and exploitation across training phases.

Result: EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. The framework guarantees monotonically decreasing entropy variance while maintaining convergence.

Conclusion: Multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training. EPO successfully breaks the exploration-exploitation cascade failure cycle.

Abstract: Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

[449] Nudging the Boundaries of LLM Reasoning

Justin Chih-Yao Chen, Becky Xiangyu Peng, Prafulla Kumar Choubey, Kung-Hsiang Huang, Jiaxin Zhang, Mohit Bansal, Chien-Sheng Wu

Main category: cs.LG

TL;DR: NuRL is a reinforcement learning method that uses self-generated hints to help LLMs learn from previously unsolvable problems, pushing their reasoning upper limits beyond what standard RL can achieve.

Details

Motivation: Current online RL algorithms for LLMs cannot learn from "unsolvable" problems where models can't explore correct answers, leaving the model's upper limit unchanged. Hard samples produce no rewards or gradients, so they don't contribute to training.

Method: NuRL uses self-generated hints (abstract cues containing core knowledge) to reduce problem difficulty. For hard samples with 0% pass rate, hints are injected and new trajectories are generated, creating training signals for previously unsolvable samples without external models.

Result: NuRL achieves consistent improvements across 6 benchmarks and 3 models, raising models’ upper limits where GRPO leaves pass@1024 unchanged. Hints boost pass rates from 0% to non-zero for hard samples.

Conclusion: Self-generated hints enable LLMs to learn from previously unsolvable problems, pushing reasoning capabilities beyond standard RL limits. Best hints are abstract and high-level, most effective when applied necessarily after GRPO convergence.

Abstract: Current online reinforcement learning (RL) algorithms like GRPO share a key limitation in LLM reasoning: they cannot learn from problems that are “unsolvable” to the model. In other words, they can only improve performance on problems where the model is capable of exploring the correct answer. Consequently, the model’s “upper limit” remains unchanged after RL training, even though the likelihood of solving easier, solvable problems may increase. These hard samples cannot contribute to training, as no rollouts yield rewards and thus no gradients are produced. To unlock learning from these hard samples, we propose NuRL, a “nudging” method that aims to push the upper bound of LLM reasoning using self-generated hints, i.e., abstract cues that help reduce the problem difficulty for the model. Given a question and its gold answer, the model generates a CoT and then produces a hint containing the core knowledge needed to solve the problem. During training, we generate G rollouts from the base policy and use the pass rate to decide whether the hint should be injected. For hard samples with a 0% pass rate, we inject the hint and regenerate a new batch of trajectories. This yields two benefits: (1) the hint boosts pass rates (from 0% to non-zero), thereby introducing training signals for previously unsolvable samples, and (2) the hints are self-generated, avoiding distributional shift and do not rely on external models. NuRL achieves consistent improvements across 6 benchmarks and 3 models, while remaining complementary to test-time scaling. Notably, NuRL can raise the model’s upper limit, whereas GRPO leaves pass@1024 unchanged from the base model. Furthermore, we present a systematic study of what makes an effective hint and when hints are most useful. Interestingly, the best hints are abstract and high-level, and are most beneficial when applied necessarily and after GRPO has converged.

[450] ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

Bo Peng, Yadan Luo, Yonggang Zhang, Yixuan Li, Zhen Fang

Main category: cs.LG

TL;DR: A theoretical framework for OOD detection using Bregman divergence and exponential family distributions, with a ConjNorm method that searches for optimal norm coefficients and uses Monte Carlo importance sampling for tractable normalization.

Details

Motivation: Existing OOD detection methods based on logits, distances, or data distribution assumptions may fail to accurately reflect true data density or impose impractical constraints. There's a need for a unified theoretical perspective on density-based score design.

Method: Proposes a theoretical framework based on Bregman divergence extending to exponential family distributions. Introduces ConjNorm method that reframes density function design as search for optimal norm coefficient p. Uses Monte Carlo-based importance sampling for unbiased, tractable estimation of partition function to address computational challenges.

Result: ConjNorm establishes new state-of-the-art across OOD detection benchmarks, outperforming current best methods by up to 13.25% on CIFAR-100 and 28.19% (FPR95) on ImageNet-1K in various OOD detection setups.

Conclusion: The Bregman divergence framework provides unified perspective for density-based OOD detection, with ConjNorm method demonstrating superior performance through optimal norm coefficient search and tractable normalization estimation.

Abstract: Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$%$ and 28.19$%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.

[451] The Condensate Theorem: Transformers are O(n), Not $O(n^2)$

Jorge L. Ruiz Williams

Main category: cs.LG

TL;DR: The Condensate Theorem shows attention sparsity is a learned topological property, not architectural constraint, enabling lossless attention compression with massive speedups.

Details

Motivation: Current attention mechanisms have O(n²) computational complexity, creating bottlenecks for long sequences. The authors challenge the assumption that this quadratic cost is inherent to intelligence, proposing it's an implementation artifact.

Method: Empirical analysis of trained language models reveals attention mass concentrates on a topological manifold. The Condensate Theorem proves projecting attention onto this manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full attention. A Topological Attention kernel maps this topology to hardware.

Result: Validated across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral with bit-exact token matching on 1,500+ generated tokens. Achieved 159x measured speedup at 131K tokens (3.94ms vs 628ms) and projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention.

Conclusion: The quadratic attention bottleneck is an artifact of naive implementation, not intelligence. Attention sparsity emerges as a learned topological property that can be exploited for massive efficiency gains without sacrificing accuracy.

Abstract: We present the Condensate Theorem: attention sparsity is a learned topological property, not an architectural constraint. Through empirical analysis of trained language models, we find that attention mass concentrates on a distinct topological manifold – and this manifold can be identified dynamically without checking every position. We prove a general result: for any query, projecting attention onto the Condensate Manifold (Anchor + Window + Dynamic Top-k) achieves 100% output equivalence with full $O(n^2)$ attention. This is not an approximation – it is lossless parity. We validate this across GPT-2, Pythia, Qwen2, TinyLlama, and Mistral, demonstrating bit-exact token matching on 1,500+ generated tokens. By mapping this topology to hardware, our Topological Attention kernel achieves a 159x measured speedup at 131K tokens (3.94ms vs 628ms) and a projected >1,200x speedup at 1M tokens, reducing inference costs by >99.9% compared to Flash Attention. We conclude that the quadratic bottleneck is an artifact of naive implementation, not intelligence.

[452] On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

Nicholas E. Corrado, Josiah P. Hanna

Main category: cs.LG

TL;DR: PROPS is an adaptive off-policy sampling method that reduces sampling error in on-policy RL by using a behavior policy that increases probability of sampling under-sampled actions relative to the current policy.

Details

Motivation: On-policy RL algorithms suffer from high-variance gradient estimates due to sampling error when collecting finite trajectories, leading to data-inefficient learning. Recent work showed off-policy sampling can produce data with lower sampling error than on-policy sampling for policy evaluation.

Method: PROPS (Proximal Robust On-Policy Sampling) uses an adaptive off-policy sampling approach where a behavior policy collects data by increasing the probability of sampling actions that are under-sampled relative to the current policy, thereby reducing sampling error.

Result: Empirical evaluation on continuous-action MuJoCo benchmark tasks and discrete-action tasks shows that PROPS decreases sampling error throughout training and increases data efficiency of on-policy policy gradient algorithms.

Conclusion: Adaptive off-policy sampling can effectively reduce sampling error and improve data efficiency in on-policy RL training, addressing a fundamental limitation of traditional on-policy approaches.

Abstract: On-policy reinforcement learning (RL) algorithms are typically characterized as algorithms that perform policy updates using i.i.d. trajectories collected by the agent’s current policy. However, after observing only a finite number of trajectories, such on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to high-variance gradient estimates that yield data-inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error w.r.t. the expected on-policy distribution than on-policy sampling can produce (Zhong et. al, 2022). Motivated by this observation, we introduce an adaptive, off-policy sampling method to reduce sampling error during on-policy policy gradient RL training. Our method, Proximal Robust On-Policy Sampling (PROPS), reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled w.r.t. the current policy. We empirically evaluate PROPS on continuous-action MuJoCo benchmark tasks as well as discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) increases the data efficiency of on-policy policy gradient algorithms.

[453] General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design

Yue Jian, Curtis Wu, Danny Reidenbach, Aditi S. Krishnapriyan

Main category: cs.LG

TL;DR: BADGER is a binding-affinity guidance framework for diffusion models in structure-based drug design that improves ligand-protein binding affinity through classifier and classifier-free guidance approaches.

Details

Motivation: Current diffusion models for structure-based drug design often underemphasize binding affinity control during ligand generation, limiting their effectiveness in producing molecules with strong and specific binding to target proteins.

Method: BADGER incorporates two complementary strategies: (1) classifier guidance using gradient-based affinity signals during sampling, and (2) classifier-free guidance integrating affinity conditioning directly into diffusion model training. The framework also supports multi-constraint optimization for binding affinity, drug-likeness, and synthetic accessibility.

Result: BADGER achieves up to 60% improvement in ligand-protein binding affinity over prior methods and enables joint optimization of multiple drug design constraints.

Conclusion: BADGER provides a general framework for binding-affinity-aware diffusion models in drug design that can be added to any diffusion model and significantly improves binding affinity while enabling multi-constraint optimization.

Abstract: Structure-based drug design (SBDD) aims to generate ligands that bind strongly and specifically to target protein pockets. Recent diffusion models have advanced SBDD by capturing the distributions of atomic positions and types, yet they often underemphasize binding affinity control during generation. To address this limitation, we introduce \textbf{\textnormal{\textbf{BADGER}}}, a general \textbf{binding-affinity guidance framework for diffusion models in SBDD}. \textnormal{\textbf{BADGER} }incorporates binding affinity awareness through two complementary strategies: (1) \textit{classifier guidance}, which applies gradient-based affinity signals during sampling in a plug-and-play fashion, and (2) \textit{classifier-free guidance}, which integrates affinity conditioning directly into diffusion model training. Together, these approaches enable controllable ligand generation guided by binding affinity. \textnormal{\textbf{BADGER} } can be added to any diffusion model and achieves up to a \textbf{60% improvement in ligand–protein binding affinity} of sampled molecules over prior methods. Furthermore, we extend the framework to \textbf{multi-constraint diffusion guidance}, jointly optimizing for binding affinity, drug-likeness (QED), and synthetic accessibility (SA) to design realistic and synthesizable drug candidates.

[454] Data-efficient and Interpretable Inverse Materials Design using a Disentangled Variational Autoencoder

Cheng Zeng, Zulqarnain Khan, Nathan L. Post

Main category: cs.LG

TL;DR: A semi-supervised disentangled variational autoencoder approach for inverse materials design that separates target properties from other material characteristics to reduce ambiguity in the design process.

Details

Motivation: Current inverse materials design methods using unsupervised learning often create entangled latent spaces where target properties are mixed with other material characteristics, making the inverse design process ambiguous and less interpretable.

Method: Uses a semi-supervised learning approach based on a disentangled variational autoencoder to learn probabilistic relationships between features, latent variables, and target properties. Combines labelled and unlabelled data efficiently and uses expert-informed prior distributions for robustness with limited labelled data.

Result: Demonstrated on an experimental high-entropy alloy dataset with chemical compositions as input and single-phase formation as target property. The approach successfully disentangles target properties from other material characteristics, providing interpretable latent representations.

Conclusion: The disentangled model enables more interpretable and less ambiguous inverse materials design, and can be extended to handle multiple target properties for customized materials design applications.

Abstract: Inverse materials design has proven successful in accelerating novel material discovery. Many inverse materials design methods use unsupervised learning where a latent space is learned to offer a compact description of materials representations. A latent space learned this way is likely to be entangled, in terms of the target property and other properties of the materials. This makes the inverse design process ambiguous. Here, we present a semi-supervised learning approach based on a disentangled variational autoencoder to learn a probabilistic relationship between features, latent variables and target properties. This approach is data efficient because it combines all labelled and unlabelled data in a coherent manner, and it uses expert-informed prior distributions to improve model robustness even with limited labelled data. It is in essence interpretable, as the learnable target property is disentangled out of the other properties of the materials, and an extra layer of interpretability can be provided by a post-hoc analysis of the classification head of the model. We demonstrate this new approach on an experimental high-entropy alloy dataset with chemical compositions as input and single-phase formation as the single target property. High-entropy alloys were chosen as example materials because of the vast chemical space of their possible combinations of compositions and atomic configurations. While single property is used in this work, the disentangled model can be extended to customize for inverse design of materials with multiple target properties.

[455] BiSSL: Enhancing the Alignment Between Self-Supervised Pretraining and Downstream Fine-Tuning via Bilevel Optimization

Gustav Wagner Zakarias, Lars Kai Hansen, Zheng-Hua Tan

Main category: cs.LG

TL;DR: BiSSL is a bilevel training framework that improves alignment between self-supervised pretrained models and downstream tasks before fine-tuning, enhancing performance on image classification and object detection.

Details

Motivation: Self-supervised pretrained models often have poor alignment with downstream tasks, limiting fine-tuning effectiveness. The authors aim to bridge the gap between pretraining and fine-tuning stages.

Method: BiSSL introduces an intermediate bilevel optimization stage after self-supervised pretraining. The lower-level objective handles pretext tasks, while the upper-level objective incorporates downstream tasks, explicitly modeling interdependence between pretraining and fine-tuning.

Result: BiSSL significantly improves accuracy on 12 downstream image classification datasets and object detection tasks using SimCLR and BYOL pretrained ResNet-50 backbones on ImageNet.

Conclusion: The bilevel framework enhances downstream alignment by facilitating information sharing between pretraining and fine-tuning, leading to better model initialization for downstream tasks.

Abstract: Models initialized from self-supervised pretraining may suffer from poor alignment with downstream tasks, reducing the extent to which subsequent fine-tuning can adapt pretrained features toward downstream objectives. To mitigate this, we introduce BiSSL, a novel bilevel training framework that enhances the alignment of self-supervised pretrained models with downstream tasks prior to fine-tuning. BiSSL acts as an intermediate training stage conducted after conventional self-supervised pretraining and is tasked with solving a bilevel optimization problem that incorporates the pretext and downstream training objectives in its lower- and upper-level objectives, respectively. This approach explicitly models the interdependence between the pretraining and fine-tuning stages within the conventional self-supervised learning pipeline, facilitating enhanced information sharing between them that ultimately leads to a model initialization better aligned with the downstream task. We propose a general training algorithm for BiSSL that is compatible with a broad range of pretext and downstream tasks. Using SimCLR and Bootstrap Your Own Latent to pretrain ResNet-50 backbones on the ImageNet dataset, we demonstrate that our proposed framework significantly improves accuracy on the vast majority of 12 downstream image classification datasets, as well as on object detection. Exploratory analyses alongside investigative experiments further provide compelling evidence that BiSSL enhances downstream alignment.

[456] ParisKV: Fast and Drift-Robust KV-Cache Retrieval for Long-Context LLMs

Yanlin Qi, Xinhang Chen, Huiqiang Jiang, Qitong Wang, Botao Peng, Themis Palpanas

Main category: cs.LG

TL;DR: ParisKV is a GPU-native KV-cache retrieval framework for long-context LLM inference that addresses distribution drift and latency issues through collision-based candidate selection and quantized inner-product reranking.

Details

Motivation: Existing KV-cache retrieval methods struggle with distribution drift and high latency at scale, especially for million-token contexts, making efficient long-context LLM inference challenging.

Method: ParisKV uses collision-based candidate selection followed by quantized inner-product reranking estimator. It supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA) for on-demand top-k fetching with minimal overhead.

Result: ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art decoding efficiency: matches/exceeds full attention speed at batch size 1, 2.8× higher throughput, scales to million-token contexts where full attention fails. At million-token scale, reduces decode latency by 17× vs MagicPIG and 44× vs PQCache.

Conclusion: ParisKV provides a drift-robust, GPU-native solution for efficient KV-cache retrieval that enables practical million-token context LLM inference with significant performance improvements over existing methods.

Abstract: KV-cache retrieval is essential for long-context LLM inference, yet existing methods struggle with distribution drift and high latency at scale. We introduce ParisKV, a drift-robust, GPU-native KV-cache retrieval framework based on collision-based candidate selection, followed by a quantized inner-product reranking estimator. For million-token contexts, ParisKV supports CPU-offloaded KV caches via Unified Virtual Addressing (UVA), enabling on-demand top-$k$ fetching with minimal overhead. ParisKV matches or outperforms full attention quality on long-input and long-generation benchmarks. It achieves state-of-the-art long-context decoding efficiency: it matches or exceeds full attention speed even at batch size 1 for long contexts, delivers up to 2.8$\times$ higher throughput within full attention’s runnable range, and scales to million-token contexts where full attention runs out of memory. At million-token scale, ParisKV reduces decode latency by 17$\times$ and 44$\times$ compared to MagicPIG and PQCache, respectively, two state-of-the-art KV-cache Top-$k$ retrieval baselines.

[457] Influence of Recommender Systems on Users: A Dynamical Systems Analysis

Prabhat Lankireddy, Jayakrishnan Nair, D Manjunath

Main category: cs.LG

TL;DR: Analysis of how recommender systems affect user preferences over time, showing that algorithms that exploit more can create filter bubbles by polarizing user preferences.

Details

Motivation: To understand the unintended effects of recommender systems on user preferences, particularly how the mismatch between static model assumptions and evolving user preferences affected by recommendations leads to filter bubbles and polarization.

Method: Introduces a model for coupled evolution of linear bandit recommendation systems and users whose preferences shift toward recommendations. Uses stochastic approximation theory to derive a dynamical system that asymptotically approximates the mean behavior of the stochastic model.

Result: Under certain conditions, the recommender system can learn population preferences despite model mismatch. The exploration-exploitation tradeoff significantly affects long-term user preferences - algorithms that exploit more can polarize preferences, creating filter bubbles.

Conclusion: Recommender systems have significant unintended effects on user preferences over time, with exploitation-heavy algorithms leading to polarization and filter bubbles. The paper provides analytical tools to understand these dynamics.

Abstract: We analyze the unintended effects that recommender systems have on the preferences of users that they are learning. We consider a contextual multi-armed bandit recommendation algorithm that learns optimal product recommendations based on user and product attributes. It is well known that the sequence of recommendations affects user preferences. However, typical learning algorithms treat the user attributes as static and disregard the impact of their recommendations on user preferences. Our interest is to analyze the effect of this mismatch between the model assumption of a static environment and the reality of an evolving environment affected by the recommendations. To perform this analysis, we introduce a model for the coupled evolution of a linear bandit recommendation system and its users, whose preferences are drawn towards the recommendations made by the algorithm. We describe a method, that is grounded in stochastic approximation theory, to come up with a dynamical system model that asymptotically approximates the mean behavior of the stochastic model. The resulting dynamical system captures the coupled evolution of the population preferences and the learning algorithm. Analyzing this dynamical system gives insight into the long-term properties of user preferences and the learning algorithm. Under certain conditions, we show that the RS is able to learn the population preferences in spite of the model mismatch. We discuss and characterize the relation between various parameters of the model and the long term preferences of users in this work. A key observation is that the exploration-exploitation tradeoff used by the recommendation algorithm significantly affects the long term preferences of users. Algorithms that exploit more can polarize user preferences, leading to the well-known filter bubble phenomenon.

[458] Exploring the Impact of Dataset Statistical Effect Size on Model Performance and Data Sample Size Sufficiency

Arya Hatamian, Lionel Levine, Haniyeh Ehsani Oskouie, Majid Sarrafzadeh

Main category: cs.LG

TL;DR: Statistical measures like effect size don’t reliably predict model performance or sample size adequacy for machine learning datasets.

Details

Motivation: The paper addresses the critical need to assess data sufficiency before training ML models, as current methods can't prospectively determine if a dataset is adequate for effective model training.

Method: Two experiments exploring whether basic descriptive statistical measures (specifically effect size of features) correlate with: 1) resulting model performance, and 2) learning rate convergence speed and required sample size.

Result: Effect size is not an effective heuristic for determining adequate sample size or projecting model performance. No reliable correlation found between effect size and classifier success or convergence rate.

Conclusion: Additional work is still needed to develop methods for prospectively assessing data adequacy, as simple statistical measures like effect size don’t provide reliable predictive capability.

Abstract: Having a sufficient quantity of quality data is a critical enabler of training effective machine learning models. Being able to effectively determine the adequacy of a dataset prior to training and evaluating a model’s performance would be an essential tool for anyone engaged in experimental design or data collection. However, despite the need for it, the ability to prospectively assess data sufficiency remains an elusive capability. We report here on two experiments undertaken in an attempt to better ascertain whether or not basic descriptive statistical measures can be indicative of how effective a dataset will be at training a resulting model. Leveraging the effect size of our features, this work first explores whether or not a correlation exists between effect size, and resulting model performance (theorizing that the magnitude of the distinction between classes could correlate to a classifier’s resulting success). We then explore whether or not the magnitude of the effect size will impact the rate of convergence of our learning rate, (theorizing again that a greater effect size may indicate that the model will converge more rapidly, and with a smaller sample size needed). Our results appear to indicate that this is not an effective heuristic for determining adequate sample size or projecting model performance, and therefore that additional work is still needed to better prospectively assess adequacy of data.

[459] Symmetry-Guided Memory Augmentation for Efficient Locomotion Learning

Kaixi Bao, Chenhao Li, Yarden As, Andreas Krause, Marco Hutter

Main category: cs.LG

TL;DR: SGMA improves RL training efficiency for legged robots using symmetry-based experience augmentation and memory state transformations.

Details

Motivation: Training RL policies for legged locomotion requires extensive environment interactions which are costly and time-consuming, creating a need for more data-efficient approaches.

Method: Symmetry-Guided Memory Augmentation (SGMA) combines structured experience augmentation (leveraging robot and task symmetries) with memory-based context inference, extending transformations to policy memory states to retain task-relevant context.

Result: The method achieves efficient policy training while maintaining robust performance across diverse locomotion tasks involving joint failures and payload variations, demonstrated on quadruped and humanoid robots in simulation and real quadruped platform.

Conclusion: SGMA provides a practical route toward data-efficient RL for legged robots by leveraging symmetries and memory augmentation without requiring extra environment interactions.

Abstract: Training reinforcement learning (RL) policies for legged locomotion often requires extensive environment interactions, which are costly and time-consuming. We propose Symmetry-Guided Memory Augmentation (SGMA), a framework that improves training efficiency by combining structured experience augmentation with memory-based context inference. Our method leverages robot and task symmetries to generate additional, physically consistent training experiences without requiring extra interactions. To avoid the pitfalls of naive augmentation, we extend these transformations to the policy’s memory states, enabling the agent to retain task-relevant context and adapt its behavior accordingly. We evaluate the approach on quadruped and humanoid robots in simulation, as well as on a real quadruped platform. Across diverse locomotion tasks involving joint failures and payload variations, our method achieves efficient policy training while maintaining robust performance, demonstrating a practical route toward data-efficient RL for legged robots.

[460] Deep Meta Coordination Graphs for Multi-agent Reinforcement Learning

Nikunj Gupta, James Zachary Hare, Jesse Milzman, Rajgopal Kannan, Viktor Prasanna

Main category: cs.LG

TL;DR: DMCG learns dynamic coordination graphs for multi-agent reinforcement learning, using graph neural networks to represent agent interactions and improve coordination in cooperative tasks.

Details

Motivation: Traditional coordination graphs in MARL often use fixed or simple structures that may not capture complex agent interactions. There's a need for more expressive, learnable coordination representations that can adapt to different tasks and improve coordination efficiency.

Method: Proposes deep meta coordination graphs (DMCG) that dynamically compose meta coordination graphs to represent agent interactions. Uses graph convolutional networks to integrate agent information and jointly optimizes graphs with agents’ value functions for end-to-end learning of interaction representations and coordinated policies.

Result: DMCG achieves state-of-the-art coordination performance and sample efficiency on challenging cooperative tasks, outperforming both graph-based and non-graph-based MARL baselines. Ablation studies confirm the importance of design choices.

Conclusion: Dynamic, learnable coordination graphs through DMCG enable more expressive representation of agent interactions, leading to improved coordination and sample efficiency in cooperative MARL tasks.

Abstract: This paper presents deep meta coordination graphs (DMCG) for learning cooperative policies in multi-agent reinforcement learning (MARL). Coordination graph formulations encode local interactions and accordingly factorize the joint value function of all agents to improve efficiency in MARL. Through DMCG, we dynamically compose what we refer to as \textit{meta coordination graphs}, to learn a more expressive representation of agent interactions and use them to integrate agent information through graph convolutional networks. The goal is to enable an evolving coordination graph to guide effective coordination in cooperative MARL tasks. The graphs are jointly optimized with agents’ value functions to learn to implicitly reason about joint actions, facilitating the end-to-end learning of interaction representations and coordinated policies. We demonstrate that DMCG consistently achieves state-of-the-art coordination performance and sample efficiency on challenging cooperative tasks, outperforming several prior graph-based and non-graph-based MARL baselines. Through several ablations, we also isolate the impact of individual components in DMCG, showing that the observed improvements are due to the meaningful design choices in this approach. We also include an analysis of its computational complexity to discuss its practicality in real-world applications. All codes can be found here: {\color{blue}{https://github.com/Nikunj-Gupta/dmcg-marl}.

[461] A Survey on Active Feature Acquisition Strategies

Linus Aronsson, Arman Rahbar, Morteza Haghir Chehreghani

Main category: cs.LG

TL;DR: A survey paper presenting a unified POMDP formulation for Active Feature Acquisition (AFA), connecting it to structured POMDP literature and providing taxonomy of methods with connections to POMDP planning approaches.

Details

Motivation: To provide a unified theoretical framework for Active Feature Acquisition (AFA) by connecting it to partially observable Markov decision processes (POMDPs), enabling better understanding of existing methods and more principled algorithm design.

Method: Proposes an explicit POMDP formulation for AFA, connects it to structured POMDP literature (information-gathering and sensing POMDPs), and develops a taxonomy of AFA methods mirroring standard POMDP approaches: embedded cost-aware predictors, model-based methods, model-free methods, and hybrid approaches.

Result: Provides a unified perspective on AFA, clarifies connections among existing methods, establishes connections to adaptive stochastic optimization for formal guarantees, and offers a comprehensive taxonomy for comparing problem settings and methods.

Conclusion: The POMDP-centric view offers a principled foundation for AFA research, enables leveraging established POMDP results, and highlights directions for future work including formal guarantees and hybrid approaches.

Abstract: Active feature acquisition (AFA) studies how to sequentially acquire features for each data instance to trade off predictive performance against acquisition cost. This survey offers the first unified treatment of AFA via an explicit partially observable Markov decision process (POMDP) formulation. We place this formulation in the broader literature on optimal information acquisition and, more specifically, in a family of structured POMDPs (for example, information-gathering and sensing POMDPs) whose assumptions and algorithmic tools directly apply to AFA. This connection provides a common language for comparing problem settings and methods, and it highlights where AFA can leverage established results in structured POMDP planning and approximation. Building on this perspective, we present an up-to-date taxonomy of AFA methods that (roughly) mirrors standard approaches to solving POMDPs: (i) embedded cost-aware predictors (notably cost-sensitive decision trees and ensembles), (ii) model-based methods that plan using learned probabilistic components, (iii) model-free methods that learn acquisition policies from simulated episodes, and (iv) hybrid methods that combine the strengths of model-based and model-free approaches. We argue that this POMDP-centric view clarifies connections among existing methods and motivates more principled algorithm design. Since much prior work is heuristic and lacks formal guarantees, we also outline routes to guarantees by connecting AFA to adaptive stochastic optimization. We conclude by highlighting open challenges and promising directions for future research.

[462] Plasticine: Accelerating Research in Plasticity-Motivated Deep Reinforcement Learning

Mingqi Yuan, Qi Wang, Guozheng Ma, Caihao Sun, Bo Li, Xin Jin, Yunbo Wang, Xiaokang Yang, Wenjun Zeng, Dacheng Tao, Jiayu Chen

Main category: cs.LG

TL;DR: Plasticine is an open-source framework for benchmarking plasticity optimization in deep reinforcement learning, addressing the problem of neural networks losing adaptation ability during training.

Details

Motivation: Deep RL systems suffer from plasticity loss where neural networks gradually lose adaptation ability during training, but the field lacks unified benchmarks and evaluation protocols for studying this issue.

Method: Developed Plasticine framework with single-file implementations of over 13 mitigation methods, 6 evaluation metrics, and learning scenarios with increasing non-stationarity levels from standard to continually varying environments.

Result: Created the first open-source framework for systematically quantifying plasticity loss, evaluating mitigation strategies, and analyzing plasticity dynamics across different contexts.

Conclusion: Plasticine provides a standardized platform for researchers to study and address plasticity loss in deep RL, advancing lifelong learning capabilities crucial for AGI development.

Abstract: Developing lifelong learning agents is crucial for artificial general intelligence (AGI). However, deep reinforcement learning (RL) systems often suffer from plasticity loss, where neural networks gradually lose their ability to adapt during training. Despite its significance, this field lacks unified benchmarks and evaluation protocols. We introduce Plasticine, the first open-source framework for benchmarking plasticity optimization in deep RL. Plasticine provides single-file implementations of over 13 mitigation methods, 6 evaluation metrics, and learning scenarios with increasing non-stationarity levels from standard to continually varying environments. This framework enables researchers to systematically quantify plasticity loss, evaluate mitigation strategies, and analyze plasticity dynamics across different contexts. Our documentation, examples, and source code are available at https://github.com/RLE-Foundation/Plasticine.

[463] TabNSA: Native Sparse Attention for Efficient Tabular Data Learning

Ali Eslamian, Qiang Cheng

Main category: cs.LG

TL;DR: TabNSA: A deep learning framework combining Native Sparse Attention with TabMixer backbone for efficient tabular data modeling, enhanced with LLM integration for few-shot learning.

Details

Motivation: Tabular data presents unique challenges for deep learning due to heterogeneous feature types, lack of spatial structure, and often limited sample sizes. Existing methods struggle with computational efficiency and representation learning for tabular data.

Method: TabNSA integrates Native Sparse Attention (NSA) with TabMixer backbone. NSA uses hierarchical sparse attention with token compression, selective preservation, and sliding windows to reduce quadratic complexity. TabMixer captures non-linear dependencies through parallel MLP branches. The modules combine via element-wise summation and mean pooling. The framework can be augmented with fine-tuned LLMs for few-shot learning.

Result: Extensive experiments across supervised and transfer learning settings show TabNSA consistently outperforms state-of-the-art deep learning models. The LLM-augmented version effectively addresses few-shot learning challenges through language-guided generalization on diverse tabular benchmarks.

Conclusion: TabNSA provides an efficient and effective framework for tabular data modeling that addresses computational challenges while capturing complex dependencies, with LLM integration enabling strong few-shot learning performance.

Abstract: Tabular data poses unique challenges for deep learning due to its heterogeneous feature types, lack of spatial structure, and often limited sample sizes. We propose TabNSA, a novel deep learning framework that integrates Native Sparse Attention (NSA) with a TabMixer backbone to efficiently model tabular data. TabNSA tackles computational and representational challenges by dynamically focusing on relevant feature subsets per instance. The NSA module employs a hierarchical sparse attention mechanism, including token compression, selective preservation, and localized sliding windows, to significantly reduce the quadratic complexity of standard attention operations while addressing feature heterogeneity. Complementing this, the TabMixer backbone captures complex, non-linear dependencies through parallel multilayer perceptron (MLP) branches with independent parameters. These modules are synergistically combined via element-wise summation and mean pooling, enabling TabNSA to model both global context and fine-grained interactions. Extensive experiments across supervised and transfer learning settings show that TabNSA consistently outperforms state-of-the-art deep learning models. Furthermore, by augmenting TabNSA with a fine-tuned large language model (LLM), we enable it to effectively address Few-Shot Learning challenges through language-guided generalization on diverse tabular benchmarks. Code available on: https://github.com/aseslamian/TabNSA

[464] Layer-wise Update Aggregation with Recycling for Communication-Efficient Federated Learning

Jisoo Kim, Sungmin Kang, Sunwoo Lee

Main category: cs.LG

TL;DR: FedLUAR reduces FL communication costs by recycling previous layer updates instead of discarding them, achieving similar accuracy with only 17% communication cost.

Details

Motivation: Communication cost is a major bottleneck in Federated Learning, and existing methods that drop updates based on gradient magnitude are inefficient. The authors propose that recycling previous updates could be more effective than simply discarding them.

Method: FedLUAR uses a layer-wise update aggregation with recycling scheme. It defines a metric to quantify how aggregated gradients influence model parameters in each layer, selects layers based on this metric, and recycles their previous updates on the server side.

Result: The method significantly reduces communication cost while maintaining model accuracy. For example, it achieves nearly the same AG News accuracy as FedAvg while reducing communication cost to just 17%.

Conclusion: Update recycling is an effective approach for communication-efficient Federated Learning that outperforms simple update dropping methods.

Abstract: Expensive communication cost is a common performance bottleneck in Federated Learning (FL), which makes it less appealing in real-world applications. Many communication-efficient FL methods focus on discarding a part of model updates mostly based on gradient magnitude. In this study, we find that recycling previous updates, rather than simply dropping them, more effectively reduces the communication cost while maintaining FL performance. We propose FedLUAR, a Layer-wise Update Aggregation with Recycling scheme for communication-efficient FL. We first define a useful metric that quantifies the extent to which the aggregated gradients influences the model parameter values in each layer. FedLUAR selects a few layers based on the metric and recycles their previous updates on the server side. Our extensive empirical study demonstrates that the update recycling scheme significantly reduces the communication cost while maintaining model accuracy. For example, our method achieves nearly the same AG News accuracy as FedAvg, while reducing the communication cost to just 17%.

[465] Faithful Group Shapley Value

Kiljae Lee, Ziqi Liu, Weijing Tang, Yuan Zhang

Main category: cs.LG

TL;DR: FGSV is a new group-level data valuation method that prevents strategic manipulation through shell company attacks while maintaining computational efficiency.

Details

Motivation: Existing group-level Data Shapley extensions are vulnerable to strategic manipulation where groups can split into "shell companies" to artificially inflate their valuations, compromising fairness in data valuation.

Method: Proposes Faithful Group Shapley Value (FGSV) with mathematical guarantees against shell company attacks, and develops a provably fast and accurate approximation algorithm for computing FGSV.

Result: Empirical experiments show FGSV significantly outperforms state-of-the-art methods in computational efficiency and approximation accuracy while ensuring faithful group-level valuation.

Conclusion: FGSV provides a robust solution to group-level data valuation that prevents strategic manipulation while maintaining practical computational efficiency.

Abstract: Data Shapley is an important tool for data valuation, which quantifies the contribution of individual data points to machine learning models. In practice, group-level data valuation is desirable when data providers contribute data in batch. However, we identify that existing group-level extensions of Data Shapley are vulnerable to shell company attacks, where strategic group splitting can unfairly inflate valuations. We propose Faithful Group Shapley Value (FGSV) that uniquely defends against such attacks. Building on original mathematical insights, we develop a provably fast and accurate approximation algorithm for computing FGSV. Empirical experiments demonstrate that our algorithm significantly outperforms state-of-the-art methods in computational efficiency and approximation accuracy, while ensuring faithful group-level valuation.

[466] Ice-FMBench: A Foundation Model Benchmark for Sea Ice Type Segmentation

Samira Alkaee Taleghan, Morteza Karimzadeh, Andrew P. Barrett, Walter N. Meier, Farnoush Banaei-Kashani

Main category: cs.LG

TL;DR: IceFMBench: A benchmark framework for evaluating foundation models on sea ice type segmentation from SAR imagery, addressing challenges of polar remote sensing and limited labeled data.

Details

Motivation: Sea ice type segmentation is crucial for polar navigation and climate monitoring, but deep learning requires extensive labeled data. Foundation models show promise but face challenges with SAR imagery's unique characteristics and polar-specific sensor modes.

Method: Developed IceFMBench with standardized dataset, evaluation metrics, and selected remote sensing foundation models. Conducted comparative evaluation and proposed multi-teacher knowledge distillation for spatiotemporal transferability.

Result: Created comprehensive benchmark framework for sea ice segmentation, enabling evaluation of foundation models on SAR imagery with case studies on temporal and spatial transferability.

Conclusion: IceFMBench addresses the gap in applying foundation models to sea ice segmentation, providing tools for model evaluation and knowledge distillation to improve transferability in polar environments.

Abstract: Accurate segmentation and mapping of sea ice types is crucial for safe polar navigation, offshore operations, and climate monitoring. While deep learning has demonstrated strong potential for automating sea ice type segmentation, its success often relies on access to extensive expert labeled datasets, which is both resource intensive and time consuming to create. However, foundation models (FMs), recently developed through self-supervised training on large-scale datasets, have demonstrated impressive performance. Nevertheless, their applicability to sea ice type segmentation based on Synthetic Aperture Radar (SAR) imagery remains uncertain due to the unique challenges posed by sea ice such as intricate geophysical patterns, pronounced seasonal variability, and SAR-specific artifacts like banding, scalloping, and heterogeneous backscatter as well as the fact that SAR data in polar regions are often acquired using specialized sensor modes that differ markedly from those used to collect FM training data at lower latitudes, limiting their direct transferability to polar environments. To address this gap, we contribute: (1) IceFMBench, a comprehensive benchmark framework for evaluation of the state-of-the-art remote sensing FMs on the sea ice type segmentation task using Sentinel1 SAR imagery, where IceFMBench is composed of a widely used standardized dataset, diverse evaluation metrics, and a representative set of selected remote sensing FM models suitable for sea ice type segmentation, with the ability to include new models side by side the existing models; (2) an extensive comparative evaluation of the representative FMs using IceFMBench, with additional case studies to assess performance of the top-performing model in terms of transferability across temporal and spatial domains and (3) a multi teacher knowledge distillation approach to address lack of spatiotemporal transferability.

[467] Massive-STEPS: Massive Semantic Trajectories for Understanding POI Check-ins – Dataset and Benchmarks

Wilson Wongso, Hao Xue, Flora D. Salim

Main category: cs.LG

TL;DR: Massive-STEPS is a large-scale benchmark dataset for POI trajectory modeling spanning 15 diverse cities with recent check-in data (2017-2018) to address limitations of older datasets and lack of global representation.

Details

Motivation: The paper addresses two key challenges in human mobility research: reliance on outdated datasets (2012-2013) and lack of reproducible, city-level check-in datasets that reflect diverse global regions, hindering progress in POI trajectory modeling.

Method: Created Massive-STEPS by building upon the Semantic Trails dataset and enriching it with semantic POI metadata. The dataset spans 15 geographically and culturally diverse cities with more recent (2017-2018) and longer-duration (24 months) check-in data. Benchmarked various POI models using both supervised and zero-shot approaches across multiple urban contexts.

Result: Produced a large-scale, publicly available benchmark dataset that enables reproducible research in human mobility and POI trajectory modeling. The dataset provides more recent and geographically diverse data than previous benchmarks.

Conclusion: Massive-STEPS facilitates reproducible and equitable research in human mobility by addressing dataset limitations and enabling benchmarking across diverse urban contexts, with potential applications in urban planning, personalized services, and generative agent simulation.

Abstract: Understanding human mobility through Point-of-Interest (POI) trajectory modeling is increasingly important for applications such as urban planning, personalized services, and generative agent simulation. However, progress in this field is hindered by two key challenges: the over-reliance on older datasets from 2012-2013 and the lack of reproducible, city-level check-in datasets that reflect diverse global regions. To address these gaps, we present Massive-STEPS (Massive Semantic Trajectories for Understanding POI Check-ins), a large-scale, publicly available benchmark dataset built upon the Semantic Trails dataset and enriched with semantic POI metadata. Massive-STEPS spans 15 geographically and culturally diverse cities and features more recent (2017-2018) and longer-duration (24 months) check-in data than prior datasets. We benchmarked a wide range of POI models on Massive-STEPS using both supervised and zero-shot approaches, and evaluated their performance across multiple urban contexts. By releasing Massive-STEPS, we aim to facilitate reproducible and equitable research in human mobility and POI trajectory modeling. The dataset and benchmarking code are available at: https://github.com/cruiseresearchgroup/Massive-STEPS.

[468] Generalizing Scaling Laws for Dense and Sparse Large Language Models

Md Arafat Hossain, Xingfu Wu, Valerie Taylor, Ali Jannesari

Main category: cs.LG

TL;DR: Proposes a unified scaling law framework applicable to both dense and sparse LLMs, enabling optimal model size prediction and resource allocation for pretraining.

Details

Motivation: Existing scaling laws are architecture-specific (dense or sparse), making it challenging to optimally predict model size and allocate resources for LLM pretraining across different architectures.

Method: Revisits existing empirical scaling laws and proposes a generalized scaling law that provides a unified framework applicable to both dense and sparse LLMs, including Mixture-of-Experts models.

Result: The proposed scaling law captures the scaling behavior of existing laws, shows effectiveness for MoE-based large LLMs like DeepSeek-V3 in IsoFLOP comparisons, and enables estimation of optimal hyperparameters.

Conclusion: A unified scaling law framework enables better model size prediction and resource allocation for LLM pretraining across both dense and sparse architectures.

Abstract: Despite recent advancements of large language models (LLMs), optimally predicting the model size for LLM pretraining or allocating optimal resources still remains a challenge. Several efforts have addressed the challenge by proposing different empirical scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing empirical scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws and demonstrate that our proposed scaling law captures the scaling behavior of existing scaling laws. Further, we show an IsoFLOP comparison between our proposed scaling law and the state-of-the-art scaling law to illustrate the effectiveness of our proposed scaling law for Mixture-of-Expert (MoE)-based very large LLMs like DeepSeek-V3. Our proposed scaling law can be used to estimate the best model hyperparameters (Model size, Tokens and Compute) for a given sparsity or to identify the optimal sparsity for the given model hyperparameters.

[469] Pave Your Own Path: Graph Gradual Domain Adaptation on Fused Gromov-Wasserstein Geodesics

Zhichen Zeng, Ruizhong Qiu, Wenxuan Bao, Tianxin Wei, Xiao Lin, Yuchen Yan, Tarek F. Abdelzaher, Jiawei Han, Hanghang Tong

Main category: cs.LG

TL;DR: Gadget is a gradual domain adaptation framework for non-IID graph data that addresses large distribution shifts by finding optimal intermediate domains via Fused Gromov-Wasserstein geodesics.

Details

Motivation: Existing graph domain adaptation methods assume mild distribution shifts and focus on IID data with predefined paths, limiting their applicability to real-world scenarios with large shifts on non-IID graphs.

Method: Uses Fused Gromov-Wasserstein distance as domain discrepancy measure for non-IID graphs, derives error bounds, identifies FGW geodesic as optimal path, and proposes algorithm to generate intermediate domains that can be integrated with existing graph DA methods.

Result: Improves state-of-the-art graph domain adaptation methods by up to 6.8% accuracy on real-world datasets by handling large distribution shifts on graphs.

Conclusion: Gadget successfully bridges the gap in gradual domain adaptation for non-IID graph data, providing theoretical foundation and practical algorithm for handling large distribution shifts in graph neural networks.

Abstract: Graph neural networks, despite their impressive performance, are highly vulnerable to distribution shifts on graphs. Existing graph domain adaptation (graph DA) methods often implicitly assume a mild shift between source and target graphs, limiting their applicability to real-world scenarios with large shifts. Gradual domain adaptation (GDA) has emerged as a promising approach for addressing large shifts by gradually adapting the source model to the target domain via a path of unlabeled intermediate domains. Existing GDA methods exclusively focus on independent and identically distributed (IID) data with a predefined path, leaving their extension to non-IID graphs without a given path an open challenge. To bridge this gap, we present Gadget, the first GDA framework for non-IID graph data. First (theoretical foundation), the Fused Gromov-Wasserstein (FGW) distance is adopted as the domain discrepancy for non-IID graphs, based on which, we derive an error bound on node, edge and graph-level tasks, showing that the target domain error is proportional to the length of the path. Second (optimal path), guided by the error bound, we identify the FGW geodesic as the optimal path, which can be efficiently generated by our proposed algorithm. The generated path can be seamlessly integrated with existing graph DA methods to handle large shifts on graphs, improving state-of-the-art graph DA methods by up to 6.8% in accuracy on real-world datasets.

[470] AFABench: A Generic Framework for Benchmarking Active Feature Acquisition

Valter Schütz, Han Wu, Reza Rezvan, Linus Aronsson, Morteza Haghir Chehreghani

Main category: cs.LG

TL;DR: AFABench is the first standardized benchmark framework for Active Feature Acquisition (AFA) that enables systematic evaluation of feature selection methods across diverse datasets and policies.

Details

Motivation: Real-world scenarios often face challenges in acquiring all features due to cost, latency, or privacy constraints. While many AFA methods exist, the lack of standardized benchmarks has hindered fair and systematic evaluation of these approaches.

Method: The authors introduce AFABench with diverse synthetic and real-world datasets, support for various acquisition policies (static, myopic, RL-based), and a modular design for easy integration. They also create CUBE-NM, a novel synthetic dataset to test lookahead capabilities.

Result: The benchmark enables comprehensive evaluation of representative AFA algorithms, highlighting key trade-offs between different strategies and providing insights for future research.

Conclusion: AFABench addresses the critical need for standardized evaluation in AFA research, offering a framework that facilitates fair comparison and advancement of feature acquisition methods.

Abstract: In many real-world scenarios, acquiring all features of a data instance can be expensive or impractical due to monetary cost, latency, or privacy concerns. Active Feature Acquisition (AFA) addresses this challenge by dynamically selecting a subset of informative features for each data instance, trading predictive performance against acquisition cost. While numerous methods have been proposed for AFA, ranging from myopic information-theoretic strategies to non-myopic reinforcement learning approaches, fair and systematic evaluation of these methods has been hindered by a lack of standardized benchmarks. In this paper, we introduce AFABench, the first benchmark framework for AFA. Our benchmark includes a diverse set of synthetic and real-world datasets, supports a wide range of acquisition policies, and provides a modular design that enables easy integration of new methods and tasks. We implement and evaluate representative algorithms from all major categories, including static, myopic, and reinforcement learning-based approaches. To test the lookahead capabilities of AFA policies, we introduce a novel synthetic dataset, CUBE-NM, designed to expose the limitations of myopic selection. Our results highlight key trade-offs between different AFA strategies and provide actionable insights for future research. The benchmark code is available at: https://github.com/Linusaronsson/AFA-Benchmark.

[471] Deconstructing Positional Information: From Attention Logits to Training Biases

Zihan Gu, Ruoyu Chen, Han Zhang, Hua Zhang, Yue Hu

Main category: cs.LG

TL;DR: Theoretical analysis of positional encodings in Transformers reveals that multiplicative encodings outperform additive ones on tasks requiring strong integration of positional and semantic information, and identifies a training bias called “single-head deposit pattern” inherent to multiplicative encodings.

Details

Motivation: While positional encodings enable Transformers to incorporate sequential information, their theoretical understanding is limited to basic properties like distance attenuation and translation invariance. The interplay between positional and semantic information remains underexplored, especially since natural language lacks purely positional data.

Method: The authors deconstruct attention-logit computation and provide structured analysis of positional encodings, categorizing them into additive and multiplicative forms. They design a synthetic task that explicitly requires strong integration of positional and semantic cues to probe differences between encoding types. Through ablation studies and theoretical analysis, they investigate training dynamics.

Result: Multiplicative encodings achieve clear performance advantage on the synthetic task requiring strong positional-semantic integration. The evaluation reveals a hidden training bias called “single-head deposit pattern” - an information aggregation effect in shallow layers that is proven to be inherent in multiplicative encodings.

Conclusion: The findings deepen understanding of positional encodings and reveal fundamental differences between additive and multiplicative forms in how they capture positional information. The identified training bias calls for further study of positional encoding training dynamics in Transformers.

Abstract: Positional encodings enable Transformers to incorporate sequential information, yet their theoretical understanding remains limited to two properties: distance attenuation and translation invariance. Because natural language lacks purely positional data, the interplay between positional and semantic information is still underexplored. We address this gap by deconstructing the attention-logit computation and providing a structured analysis of positional encodings, categorizing them into additive and multiplicative forms. The differing properties of these forms lead to distinct mechanisms for capturing positional information. To probe this difference, we design a synthetic task that explicitly requires strong integration of positional and semantic cues. As predicted, multiplicative encodings achieve a clear performance advantage on this task. Moreover, our evaluation reveals a hidden training bias: an information aggregation effect in shallow layers that we term the single-head deposit pattern. Through ablation studies and theoretical analysis, we proved that this phenomenon is inherent in multiplicative encodings. These findings deepen the understanding of positional encodings and call for further study of their training dynamics.

[472] HiCL: Hippocampal-Inspired Continual Learning

Kushal Kapoor, Wyatt Mackey, Yiannis Aloimonos, Xiaomin Lin

Main category: cs.LG

TL;DR: HiCL is a hippocampal-inspired dual-memory continual learning architecture that mitigates catastrophic forgetting through biologically-inspired modules and a novel DG-gated mixture-of-experts mechanism.

Details

Motivation: To address catastrophic forgetting in continual learning by drawing inspiration from hippocampal circuitry, which naturally handles sequential learning without interference.

Method: Uses grid-cell-like encoding, dentate gyrus-inspired sparse pattern separation, CA3-like autoassociative memory, and a novel DG-gated mixture-of-experts mechanism for task routing based on cosine similarity between sparse DG representations and learned task prototypes.

Result: Achieves near state-of-the-art results on continual learning benchmarks with lower computational costs, effectively reducing task interference.

Conclusion: HiCL demonstrates that biologically-inspired architectures can effectively mitigate catastrophic forgetting in continual learning through principled gating strategies and dual-memory systems.

Abstract: We propose HiCL, a novel hippocampal-inspired dual-memory continual learning architecture designed to mitigate catastrophic forgetting by using elements inspired by the hippocampal circuitry. Our system encodes inputs through a grid-cell-like layer, followed by sparse pattern separation using a dentate gyrus-inspired module with top-k sparsity. Episodic memory traces are maintained in a CA3-like autoassociative memory. Task-specific processing is dynamically managed via a DG-gated mixture-of-experts mechanism, wherein inputs are routed to experts based on cosine similarity between their normalized sparse DG representations and learned task-specific DG prototypes computed through online exponential moving averages. This biologically grounded yet mathematically principled gating strategy enables differentiable, scalable task-routing without relying on a separate gating network, and enhances the model’s adaptability and efficiency in learning multiple sequential tasks. Cortical outputs are consolidated using Elastic Weight Consolidation weighted by inter-task similarity. Crucially, we incorporate prioritized replay of stored patterns to reinforce essential past experiences. Evaluations on standard continual learning benchmarks demonstrate the effectiveness of our architecture in reducing task interference, achieving near state-of-the-art results in continual learning tasks at lower computational costs. Our code is available here https://github.com/kushalk173-sc/HiCL.

[473] What Do You Need for Compositional Generalization in Diffusion Planning?

Quentin Clark, Florian Shkurti

Main category: cs.LG

TL;DR: Diffusion planners trained via generative behavioral cloning can achieve compositional generalization through shift equivariance, local receptive fields, and inference choices, enabling stitching of sub-trajectories without dynamic programming.

Details

Motivation: While stitching and compositional generalization are recognized strengths in offline RL and generative behavioral cloning methods, the underlying factors enabling this capability are poorly understood, hindering development of algorithms that can reliably stitch by design.

Method: Focuses on diffusion planners trained via generative behavioral cloning, identifying three key properties: shift equivariance, local receptive fields, and inference choices. Develops a new architecture called Eq-Net based on these findings, which is simpler and more efficient than existing methods like replanning or data scaling.

Result: Experiments show local receptive fields are more important than shift equivariance for composition, but both are crucial. Eq-Net produces diverse trajectories competitive with more computationally expensive methods and enables generalization in goal-conditioned settings, demonstrating significant compositional generalization in navigation and manipulation tasks.

Conclusion: The paper identifies architectural properties enabling compositional generalization in diffusion planners, develops a simple yet effective architecture (Eq-Net) that achieves competitive performance without expensive computation, advancing understanding of how generative behavioral cloning methods achieve stitching capabilities.

Abstract: In policy learning, stitching and compositional generalization refer to the extent to which the policy is able to piece together sub-trajectories of data it is trained on to generate new and diverse behaviours. While stitching has been identified as a significant strength of offline reinforcement learning, recent generative behavioural cloning (BC) methods have also shown proficiency at stitching. However, the main factors behind this are poorly understood, hindering the development of new algorithms that can reliably stitch by design. Focusing on diffusion planners trained via generative behavioural cloning, and without resorting to dynamic programming or TD-learning, we find three properties are key enablers for composition: shift equivariance, local receptive fields, and inference choices. We use these properties to explain architecture, data, and inference choices in existing generative BC methods based on diffusion planning including replanning frequency, data augmentation, and data scaling. Our experiments show that while local receptive fields are more important than shift equivariance in creating a diffusion planner capable of composition, both are crucial. Using findings from our experiments, we develop a new architecture for diffusion planners called Eq-Net, that is simple, produces diverse trajectories competitive with more computationally expensive methods such as replanning or scaling data, and can be guided to enable generalization in goal-conditioned settings. We show that Eq-Net exhibits significant compositional generalization in a variety of navigation and manipulation tasks designed to test planning diversity.

[474] SIMSHIFT: A Benchmark for Adapting Neural Surrogates to Distribution Shifts

Paul Setinek, Gianluca Galletti, Thomas Gross, Dominik Schnürer, Johannes Brandstetter, Werner Zellinger

Main category: cs.LG

TL;DR: SIMSHIFT benchmark for evaluating neural PDE surrogates under domain shifts, with UDA methods applied to industrial simulations

Details

Motivation: Neural PDE surrogates degrade on out-of-distribution configurations; UDA techniques from vision/language haven't been explored for complex engineering simulations

Method: Created SIMSHIFT benchmark with 4 industrial simulation tasks; extended established UDA methods to state-of-the-art neural surrogates

Result: Experiments highlight challenges of OOD neural surrogate modeling, demonstrate UDA potential in simulation, reveal open problems for robust surrogates

Conclusion: SIMSHIFT enables systematic evaluation of neural surrogates under distribution shifts; UDA shows promise but needs further development for industrial scenarios

Abstract: Neural surrogates for Partial Differential Equations (PDEs) often suffer significant performance degradation when evaluated on problem configurations outside their training distribution, such as new initial conditions or structural dimensions. While Unsupervised Domain Adaptation (UDA) techniques have been widely used in vision and language to generalize across domains without additional labeled data, their application to complex engineering simulations remains largely unexplored. In this work, we address this gap through two focused contributions. First, we introduce SIMSHIFT, a novel benchmark dataset and evaluation suite composed of four industrial simulation tasks spanning diverse processes and physics: hot rolling, sheet metal forming, electric motor design and heatsink design. Second, we extend established UDA methods to state-of-the-art neural surrogates and systematically evaluate them. Extensive experiments on SIMSHIFT highlight the challenges of out-of-distribution neural surrogate modeling, demonstrate the potential of UDA in simulation, and reveal open problems in achieving robust neural surrogates under distribution shifts in industrially relevant scenarios. Our codebase is available at https://github.com/psetinek/simshift

[475] Bridging Past and Future: Distribution-Aware Alignment for Time Series Forecasting

Yifan Hu, Jie Yang, Tian Zhou, Peiyuan Liu, Yujin Tang, Rong Jin, Liang Sun

Main category: cs.LG

TL;DR: TimeAlign is a lightweight plug-and-play framework for time series forecasting that aligns past and future representations through reconstruction tasks to bridge distribution gaps between input histories and future targets.

Details

Motivation: Contrastive and representation-learning methods have been successful in vision and NLP but remain underutilized in time series forecasting. The authors believe these methods hold strong promise for time series forecasting, particularly for addressing the distributional gap between input histories and future targets.

Method: TimeAlign introduces a novel representation paradigm that explicitly aligns past and future representations through a simple reconstruction task. It’s a lightweight, plug-and-play framework that can be integrated with any base forecaster by aligning auxiliary features and feeding them back into the forecasting model.

Result: Extensive experiments across eight benchmarks show superior performance. Studies indicate gains primarily come from correcting frequency mismatches between historical inputs and future outputs. Theoretical justifications show how reconstruction improves forecasting generalization and how alignment increases mutual information between learned representations and predicted targets.

Conclusion: TimeAlign successfully demonstrates the value of representation learning methods in time series forecasting by bridging the distribution gap between past and future through alignment, offering a practical plug-and-play solution that improves forecasting performance across multiple benchmarks.

Abstract: Although contrastive and other representation-learning methods have long been explored in vision and NLP, their adoption in modern time series forecasters remains limited. We believe they hold strong promise for this domain. To unlock this potential, we explicitly align past and future representations, thereby bridging the distributional gap between input histories and future targets. To this end, we introduce TimeAlign, a lightweight, plug-and-play framework that establishes a new representation paradigm, distinct from contrastive learning, by aligning auxiliary features via a simple reconstruction task and feeding them back into any base forecaster. Extensive experiments across eight benchmarks verify its superior performance. Further studies indicate that the gains arise primarily from correcting frequency mismatches between historical inputs and future outputs. Additionally, we provide two theoretical justifications for how reconstruction improves forecasting generalization and how alignment increases the mutual information between learned representations and predicted targets. The code is available at https://github.com/TROUBADOUR000/TimeAlign.

[476] Deep Learning Foundation Models from Classical Molecular Descriptors

Jackson W. Burns, Akshat Shirish Zalte, Charlles R. A. Abreu, Jochen Sieg, Christian Feldmann, Miriam Mathea, William H. Green

Main category: cs.LG

TL;DR: CheMeleon is a foundation model for molecular property prediction that outperforms classical ML methods on real-world benchmarks with limited training data, achieving 75% win rate on Polaris and 97% on MoleculeACE tasks.

Details

Motivation: Deep learning methods for molecular property prediction have not outperformed classical machine learning methods on practical, real-world benchmarks with limited training data, creating a gap that needs to be bridged.

Method: CheMeleon is a O(10M) parameter foundation model that enables directed message-passing neural networks. It uses low-noise molecular descriptors for pre-training instead of noisy experimental data or biased quantum mechanical simulations, learning rich and transferable molecular representations.

Result: On 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a 75% win rate on Polaris tasks (vs 68% for Random Forest, 36% for fastprop, 32% for Chemprop) and a 97% win rate on MoleculeACE assays (vs 50% for Random Forest).

Conclusion: CheMeleon successfully bridges the gap between deep learning and classical methods for molecular property prediction, suggesting a new avenue for foundation model pre-training using low-noise molecular descriptors rather than conventional noisy data sources.

Abstract: Fast and accurate data-driven prediction of molecular properties is pivotal to scientific advancements across myriad chemical domains. Deep learning methods have recently garnered much attention, despite their inability to outperform classical machine learning methods when tested on practical, real-world benchmarks with limited training data. This study seeks to bridge this gap with CheMeleon, a O(10M) parameter foundation model that enables directed message-passing neural networks to finally exceed the performance of classical methods. Evaluated on 58 benchmark datasets from Polaris and MoleculeACE, CheMeleon achieves a win rate of 75% on Polaris tasks, outperforming baselines like Random Forest (68%), fastprop (36%), and Chemprop (32%), and a 97% win rate on MoleculeACE assays, surpassing Random Forest (50%) and other foundation models. Unlike conventional pre-training approaches that rely on noisy experimental data or biased quantum mechanical simulations, CheMeleon utilizes low-noise molecular descriptors to learn rich and highly transferable molecular representations, suggesting a new avenue for foundation model pre-training.

[477] ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation

Mohammadreza Bakhtyari, Bogdan Mazoure, Renato Cordeiro de Amorim, Guillaume Rabusseau, Vladimir Makarenkov

Main category: cs.LG

TL;DR: ClustRecNet is a deep learning framework that recommends suitable clustering algorithms for tabular data by learning high-order representations directly from raw data, outperforming existing AutoML approaches.

Details

Motivation: The fundamental challenge in unsupervised learning is identifying effective clustering algorithms for tabular datasets. Current approaches rely on manual feature engineering or traditional AutoML methods, creating a knowledge bottleneck.

Method: Built a comprehensive repository of 34,000 synthetic datasets with diverse structures, ran 10 clustering algorithms, and used Adjusted Rand Index (ARI) for ground-truth labels. ClustRecNet integrates convolutional, residual, and attention mechanisms to capture local/global structural patterns directly from raw tabular data.

Result: Outperforms state-of-the-art AutoML approaches (ML2DAC and AutoML4Clust), achieving 0.497 ARI gain over Calinski-Harabasz index on synthetic data and 15.3% ARI improvement over ML2DAC on real-world benchmarks.

Conclusion: First successful application of deep learning to automatically recommend suitable clustering algorithms for tabular data, effectively bypassing manual feature engineering bottlenecks.

Abstract: In unsupervised learning, identifying an effective clustering algorithm for a given tabular dataset remains a fundamental challenge. We introduce ClustRecNet, a novel end-to-end deep learning framework that recommends a suitable clustering algorithm by directly learning high-order representations of raw tabular data. To facilitate robust meta-learning, we construct a comprehensive repository of 34,000 synthetic datasets with diverse structures, run 10 prominent clustering algorithms, and use Adjusted Rand Index (ARI) to establish ground-truth labels. ClustRecNet integrates convolutional, residual, and attention mechanisms to capture both local/global structural patterns, effectively bypassing the knowledge bottleneck associated with manual feature engineering. Extensive evaluations on both synthetic and real-world benchmarks demonstrate that ClustRecNet consistently outperforms state-of-the-art Automated Machine Learning (AutoML) approaches, including ML2DAC and AutoML4Clust. Our framework achieves an average 0.497 ARI gain over the well-known Calinski-Harabasz cluster validity index on synthetic data and an average 15.3% ARI improvement over the leading AutoML approach (ML2DAC) on real-world benchmarks. To the best of our knowledge, we are the first to successively apply deep learning to automatically recommend suitable clustering algorithms for tabular data at hand.

[478] Estimating Interventional Distributions with Uncertain Causal Graphs through Meta-Learning

Anish Dhir, Cristiana Diaconu, Valentinian Mihai Lungu, James Requeima, Richard E. Turner, Mark van der Wilk

Main category: cs.LG

TL;DR: MACE-TNP: A meta-learning approach using transformer neural processes to approximate Bayesian model-averaged causal inference without expensive computations.

Details

Motivation: Causal inference requires estimating intervention effects, but observational data often supports multiple causal structures. Bayesian inference over all possible structures is computationally intractable due to super-exponential growth of possible graphs.

Method: Proposes Model-Averaged Causal Estimation Transformer Neural Process (MACE-TNP), an end-to-end meta-learning model trained to predict Bayesian model-averaged interventional posterior distributions, bypassing expensive calculations.

Result: Empirically demonstrates that MACE-TNP outperforms strong Bayesian baselines in approximating complex Bayesian causal inference.

Conclusion: Meta-learning provides a flexible and scalable paradigm for approximating Bayesian causal inference that can scale to increasingly challenging settings.

Abstract: In scientific domains – from biology to the social sciences – many questions boil down to \textit{What effect will we observe if we intervene on a particular variable?} If the causal relationships (e.g.~a causal graph) are known, it is possible to estimate the intervention distributions. In the absence of this domain knowledge, the causal structure must be discovered from the available observational data. However, observational data are often compatible with multiple causal graphs, making methods that commit to a single structure prone to overconfidence. A principled way to manage this structural uncertainty is via Bayesian inference, which averages over a posterior distribution on possible causal structures and functional mechanisms. Unfortunately, the number of causal structures grows super-exponentially with the number of nodes in the graph, making computations intractable. We propose to circumvent these challenges by using meta-learning to create an end-to-end model: the Model-Averaged Causal Estimation Transformer Neural Process (MACE-TNP). The model is trained to predict the Bayesian model-averaged interventional posterior distribution, and its end-to-end nature bypasses the need for expensive calculations. Empirically, we demonstrate that MACE-TNP outperforms strong Bayesian baselines. Our work establishes meta-learning as a flexible and scalable paradigm for approximating complex Bayesian causal inference, that can be scaled to increasingly challenging settings in the future.

[479] Noisy-Pair Robust Representation Alignment for Positive-Unlabeled Learning

Hengwei Zhao, Zhengzhong Tu, Zhuo Zheng, Wei Wang, Junjue Wang, Rusty Feagin, Wenzhe Jiao

Main category: cs.LG

TL;DR: NcPU: A non-contrastive PU learning framework that addresses representation learning challenges in Positive-Unlabeled classification without requiring auxiliary negatives or pre-estimated parameters.

Details

Motivation: Current PU learning methods underperform supervised counterparts on complex datasets due to unreliable supervision making discriminative representation learning challenging. There's a need for methods that work without auxiliary negatives or pre-estimated parameters.

Method: Proposes NcPU combining: 1) NoiSNCL - noisy-pair robust supervised non-contrastive loss for intra-class representation alignment despite unreliable supervision, and 2) PLD - phantom label disambiguation scheme providing conservative negative supervision via regret-based label updates. Theoretically linked through Expectation-Maximization framework.

Result: NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging post-disaster building damage mapping applications. NoiSNCL enables simple PU methods to achieve competitive performance.

Conclusion: NcPU effectively addresses representation learning challenges in PU learning without requiring auxiliary information, showing promise for real-world applications with complex datasets.

Abstract: Positive-Unlabeled (PU) learning aims to train a binary classifier (positive vs. negative) where only limited positive data and abundant unlabeled data are available. While widely applicable, state-of-the-art PU learning methods substantially underperform their supervised counterparts on complex datasets, especially without auxiliary negatives or pre-estimated parameters (e.g., a 14.26% gap on CIFAR-100 dataset). We identify the primary bottleneck as the challenge of learning discriminative representations under unreliable supervision. To tackle this challenge, we propose NcPU, a non-contrastive PU learning framework that requires no auxiliary information. NcPU combines a noisy-pair robust supervised non-contrastive loss (NoiSNCL), which aligns intra-class representations despite unreliable supervision, with a phantom label disambiguation (PLD) scheme that supplies conservative negative supervision via regret-based label updates. Theoretically, NoiSNCL and PLD can iteratively benefit each other from the perspective of the Expectation-Maximization framework. Empirically, extensive experiments demonstrate that: (1) NoiSNCL enables simple PU methods to achieve competitive performance; and (2) NcPU achieves substantial improvements over state-of-the-art PU methods across diverse datasets, including challenging datasets on post-disaster building damage mapping, highlighting its promise for real-world applications. Code: Code will be open-sourced after review.

[480] PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits

Loka Li, Wong Yu Kang, Minghao Fu, Guangyi Chen, Zhenhao Chen, Gongxu Luo, Yuewen Sun, Salman Khan, Peter Spirtes, Kun Zhang

Main category: cs.LG

TL;DR: PersonaX is a multimodal dataset collection for analyzing human behavior traits across modalities, featuring behavioral trait assessments from LLMs combined with facial imagery and biographical data.

Details

Motivation: Existing resources lack datasets that combine behavioral descriptors with complementary modalities like facial attributes and biographical information, limiting comprehensive analysis of human behavior traits across modalities.

Method: Created two datasets: CelebPersona (9,444 public figures) and AthlePersona (4,181 athletes). Includes behavioral trait assessments from three high-performing LLMs, facial imagery, and structured biographical features. Analyzed using statistical independence tests and a novel causal representation learning framework with theoretical identifiability guarantees.

Result: Experiments on synthetic and real-world data demonstrate the effectiveness of the approach. PersonaX enables studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes.

Conclusion: PersonaX establishes a foundation for multimodal trait analysis and causal reasoning by unifying structured and unstructured analysis of behavioral traits across modalities.

Abstract: Understanding human behavior traits is central to applications in human-computer interaction, computational social science, and personalized AI systems. Such understanding often requires integrating multiple modalities to capture nuanced patterns and relationships. However, existing resources rarely provide datasets that combine behavioral descriptors with complementary modalities such as facial attributes and biographical information. To address this gap, we present PersonaX, a curated collection of multimodal datasets designed to enable comprehensive analysis of public traits across modalities. PersonaX consists of (1) CelebPersona, featuring 9444 public figures from diverse occupations, and (2) AthlePersona, covering 4181 professional athletes across 7 major sports leagues. Each dataset includes behavioral trait assessments inferred by three high-performing large language models, alongside facial imagery and structured biographical features. We analyze PersonaX at two complementary levels. First, we abstract high-level trait scores from text descriptions and apply five statistical independence tests to examine their relationships with other modalities. Second, we introduce a novel causal representation learning (CRL) framework tailored to multimodal and multi-measurement data, providing theoretical identifiability guarantees. Experiments on both synthetic and real-world data demonstrate the effectiveness of our approach. By unifying structured and unstructured analysis, PersonaX establishes a foundation for studying LLM-inferred behavioral traits in conjunction with visual and biographical attributes, advancing multimodal trait analysis and causal reasoning. The code is available at https://github.com/lokali/PersonaX.

[481] Attention Sinks and Compression Valleys in LLMs are Two Sides of the Same Coin

Enrique Queipo-de-Llano, Álvaro Arroyo, Federico Barbero, Xiaowen Dong, Michael Bronstein, Yann LeCun, Ravid Shwartz-Ziv

Main category: cs.LG

TL;DR: The paper reveals a connection between attention sinks and compression valleys in LLMs, showing both stem from massive activations in residual streams, and proposes a Mix-Compress-Refine theory of information flow.

Details

Motivation: To understand the puzzling phenomena of attention sinks and compression valleys in large language models, which have been studied separately but may share underlying mechanisms.

Method: Theoretical analysis proving massive activations cause representational compression, experimental validation across models (410M-120B parameters), and targeted ablation studies to test predictions.

Result: Shows that when beginning-of-sequence tokens develop extreme activation norms in middle layers, both compression valleys and attention sinks emerge simultaneously, supporting the unified theory.

Conclusion: Proposes Mix-Compress-Refine theory explaining how LLMs organize computation: early layers mix broadly, middle layers compress with limited mixing, late layers refine selectively, explaining task-dependent representation differences.

Abstract: Attention sinks and compression valleys have attracted significant attention as two puzzling phenomena in large language models, but have been studied in isolation. In this work, we present a surprising connection between attention sinks and compression valleys, tracing both to the formation of massive activations in the residual stream. We prove theoretically that massive activations necessarily produce representational compression and establish bounds on the resulting entropy reduction. Through experiments across several models (410M-120B parameters), we confirm that when the beginning-of-sequence token develops extreme activation norms in the middle layers, both compression valleys and attention sinks emerge simultaneously. Targeted ablation studies validate our theoretical predictions. This unified view motivates us to propose the Mix-Compress-Refine theory of information flow, as an attempt to explain how LLMs organize their computation in depth by controlling attention and representational compression via massive activations. Specifically, we posit that Transformer-based LLMs process tokens in three distinct phases: (1) broad mixing in the early layers, (2) compressed computation with limited mixing in the middle layers, and (3) selective refinement in the late layers. Our framework helps explain why embedding tasks perform best at intermediate layers, whereas generation tasks benefit from full-depth processing, clarifying differences in task-dependent representations.

[482] Exact Subgraph Isomorphism Network with Mixed $L_{0,2}$ Norm Constraint for Predictive Graph Mining

Taiga Kojima, Haruto Kajita, Ayato Kohara, Masayuki Karasuyama

Main category: cs.LG

TL;DR: EIN (Exact subgraph Isomorphism Network) combines exact subgraph enumeration with neural networks and sparse regularization for graph-level prediction tasks, achieving high performance and interpretability.

Details

Motivation: Graph-level prediction tasks require understanding subgraph information, but building models with both high discriminative ability and interpretability remains challenging. Existing methods often sacrifice one for the other.

Method: EIN combines exact subgraph enumeration to capture structural information, neural networks for discriminative learning, and mixed L0,2 norm sparse regularization for feature selection and computational efficiency.

Result: EIN achieves competitive prediction performance compared to standard graph neural networks while enabling interpretable subgraph identification and effective pruning strategies.

Conclusion: EIN successfully addresses the trade-off between performance and interpretability in graph-level prediction through its combination of exact enumeration, neural networks, and sparse regularization.

Abstract: In the graph-level prediction task (predict a label for a given graph), the information contained in subgraphs of the input graph plays a key role. In this paper, we propose Exact subgraph Isomorphism Network (EIN), which combines the exact subgraph enumeration, a neural network, and a sparse regularization by the mixed $L_{0,2}$ norm constraint. In general, building a graph-level prediction model achieving high discriminative ability along with interpretability is still a challenging problem. Our combination of the subgraph enumeration and neural network contributes to high discriminative ability about the subgraph structure of the input graph. Further, the sparse regularization in EIN enables us 1) to derive an effective pruning strategy that mitigates computational difficulty of the enumeration while maintaining the prediction performance, and 2) to identify important subgraphs that contributes to high interpretability. We empirically show that EIN has sufficiently high prediction performance compared with standard graph neural network models, and also, we show examples of post-hoc analysis based on the selected subgraphs.

[483] Bandits with Single-Peaked Preferences and Limited Resources

Omer Ben-Porat, Gur Keinan, Rotem Torkan

Main category: cs.LG

TL;DR: Online stochastic matching algorithm for single-peaked preferences with budget constraints and efficient regret bounds

Details

Motivation: Online matching with budget constraints is computationally hard (NP-hard) without structural assumptions, motivating the use of single-peaked preferences from social choice theory to enable efficient algorithms.

Method: Developed efficient offline algorithm for budgeted matching with single-peaked preferences, then leveraged into online algorithm using novel PQ tree-based order approximation; also created UCB-like algorithm when structure is known.

Result: Achieved regret bound of Õ(UKT^{2/3}) for general case and Õ(U√(TK)) when single-peaked structure is known, providing efficient solutions to computationally hard problem.

Conclusion: Single-peaked preferences enable efficient online matching algorithms with provable regret bounds, overcoming computational hardness through structural assumptions from social choice theory.

Abstract: We study an online stochastic matching problem in which an algorithm sequentially matches $U$ users to $K$ arms, aiming to maximize cumulative reward over $T$ rounds under budget constraints. Without structural assumptions, computing the optimal matching is NP-hard, making online learning computationally infeasible. To overcome this barrier, we focus on single-peaked preferences – a well-established structure in social choice theory, where users’ preferences are unimodal with respect to a common order over arms. We devise an efficient algorithm for the offline budgeted matching problem, and leverage it into an efficient online algorithm with a regret of $\tilde O(UKT^{2/3})$. Our approach relies on a novel PQ tree-based order approximation method. If the single-peaked structure is known, we develop an efficient UCB-like algorithm that achieves a regret bound of $\tilde O(U\sqrt{TK})$.

[484] One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning

Minh Le, Bao-Ngoc Dao, Huy Nguyen, Quyen Tran, Anh Nguyen, Nhat Ho

Main category: cs.LG

TL;DR: SMoPE: A sparse mixture-of-experts prompt framework for continual learning that balances task-specific and shared prompts to reduce computational overhead while maintaining performance.

Details

Motivation: Current prompt-based continual learning methods face a trade-off: task-specific prompts are effective but computationally expensive (scaling linearly with tasks), while shared prompts are efficient but suffer from knowledge interference between tasks.

Method: Proposes SMoPE framework that organizes shared prompts into multiple “prompt experts” within a sparse MoE architecture. Uses prompt-attention score aggregation for dynamic expert selection, adaptive noise for balanced utilization, and prototype-based loss for expert specialization using prefix keys as implicit memory.

Result: Extensive experiments across multiple CL benchmarks show SMoPE consistently outperforms task-specific prompt methods and achieves competitive state-of-the-art performance while significantly reducing parameters and computational costs.

Conclusion: SMoPE successfully reconciles the efficiency-performance trade-off in prompt-based continual learning by combining benefits of task-specific and shared prompts through sparse mixture-of-experts architecture with dynamic expert selection.

Abstract: Prompt-based methods have recently gained prominence in Continual Learning (CL) due to their strong performance and memory efficiency. A prevalent strategy in this paradigm assigns a dedicated subset of prompts to each task, which, while effective, incurs substantial computational overhead and causes memory requirements to scale linearly with the number of tasks. Conversely, approaches employing a single shared prompt across tasks offer greater efficiency but often suffer from degraded performance due to knowledge interference. To reconcile this trade-off, we propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies. Inspired by recent findings on the relationship between Prefix Tuning and Mixture of Experts (MoE), SMoPE organizes a shared prompt into multiple “prompt experts” within a sparse MoE architecture. For each input, only a select subset of relevant experts is activated, effectively mitigating interference. To facilitate expert selection, we introduce a prompt-attention score aggregation mechanism that computes a unified proxy score for each expert, enabling dynamic and sparse activation. Additionally, we propose an adaptive noise mechanism to encourage balanced expert utilization while preserving knowledge from prior tasks. To further enhance expert specialization, we design a prototype-based loss function that leverages prefix keys as implicit memory representations. Extensive experiments across multiple CL benchmarks demonstrate that SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches, all while significantly reducing parameter counts and computational costs.

[485] Wasserstein projection distance for fairness testing of regression models

Wanxin Li, Yongjin P. Park, Khanh Dao Duc

Main category: cs.LG

TL;DR: A framework for fairness testing in regression models using Wasserstein distance to project data distributions and test expectation-based fairness criteria, with hypothesis testing and data perturbation methods.

Details

Motivation: Most fairness testing research has focused on classification models, leaving regression models underexplored. There's a need for systematic fairness testing frameworks specifically designed for regression tasks.

Method: Categorizes fairness criteria for regression, derives Wasserstein projection test statistic from dual reformulation, establishes asymptotic bounds and limiting distributions, and develops both hypothesis-testing procedures and optimal data perturbation methods to improve fairness while balancing accuracy.

Result: Experiments on synthetic data show higher specificity compared to permutation-based tests. Real-world case studies reveal: (1) statistically significant gender disparities in student performance data across multiple models, and (2) significant unfairness between pollution areas affecting housing price data under multiple fairness criteria, robust to different group divisions.

Conclusion: The proposed framework provides a systematic approach for fairness testing in regression models, offering both statistical testing capabilities and methods to improve fairness while maintaining accuracy, with demonstrated effectiveness on real-world datasets.

Abstract: Fairness testing evaluates whether a model satisfies a specified fairness criterion across different groups, yet most research has focused on classification models, leaving regression models underexplored. This paper introduces a framework for fairness testing in regression models, leveraging Wasserstein distance to project data distribution and focusing on expectation-based criteria. Upon categorizing fairness criteria for regression, we derive a Wasserstein projection test statistic from dual reformulation, and derive asymptotic bounds and limiting distributions, allowing us to formulate both a hypothesis-testing procedure and an optimal data perturbation method to improve fairness while balancing accuracy. Experiments on synthetic data demonstrate that the proposed hypothesis-testing approach offers higher specificity compared to permutation-based tests. To illustrate its potential applications, we apply our framework to two case studies on real data, showing (1) statistically significant gender disparities that appear on student performance data across multiple models, and (2) significant unfairness between pollution areas under multiple fairness criteria affecting housing price data, robust to different group divisions, with feature-level analysis identifying spatial and socioeconomic drivers.

[486] NeuroRVQ: Multi-Scale EEG Tokenization for Generative Large Brainwave Models

Konstantinos Barmpas, Na Lee, Alexandros Koliousis, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Stefanos Zafeiriou

Main category: cs.LG

TL;DR: NeuroRVQ introduces a codebook-based tokenizer for EEG signals that preserves high-frequency dynamics through multi-scale feature extraction and hierarchical residual vector quantization, enabling better reconstruction and downstream task performance.

Details

Motivation: Existing EEG foundation models have limited performance due to signal tokenization modules that fail to preserve high-frequency dynamics, hindering accurate EEG signal reconstruction and representation learning.

Method: NeuroRVQ uses a codebook-based tokenizer with: (1) multi-scale feature extraction to capture full frequency neural spectrum, (2) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding, and (3) EEG signal phase- and amplitude-aware loss function for efficient training.

Result: NeuroRVQ achieves lower reconstruction error and outperforms existing Large Brainwave Models on various downstream tasks, establishing a strong prior for codebook-based general-purpose brainwave models.

Conclusion: NeuroRVQ provides an effective tokenizer for EEG foundation models that enables efficient compression with accurate reconstruction across all frequency bands, supporting advances in neural decoding, generative modeling, and multimodal biosignal integration.

Abstract: Electroencephalography (EEG) captures neural activity across multiple temporal and spectral scales, yielding signals that are rich but complex for representation learning. Recently, EEG foundation models trained to predict masked signal-tokens have shown promise for learning generalizable representations. However, their performance is hindered by their signal tokenization modules. Existing neural tokenizers fail to preserve high-frequency dynamics, limiting their ability to reconstruct EEG signals with high fidelity. We introduce NeuroRVQ, a scalable Large Brainwave Model (LBM) centered on a codebook-based tokenizer. Our tokenizer integrates: (i) multi-scale feature extraction modules that capture the full frequency neural spectrum; (ii) hierarchical residual vector quantization (RVQ) codebooks for high-resolution encoding; and, (iii) an EEG signal phase- and amplitude-aware loss function for efficient training. This design enables efficient EEG compression while supporting accurate reconstruction across all frequency bands, leading to robust generative masked modeling. Our empirical results demonstrate that NeuroRVQ achieves lower reconstruction error and outperforms existing LBMs on a variety of downstream tasks. More broadly, NeuroRVQ tokenizer establishes a strong prior for codebook-based general-purpose brainwave models, enabling advances in neural decoding, generative modeling and multimodal biosignal integration.

[487] Multi-Objective $\textit{min-max}$ Online Convex Optimization

Rahul Vaze, Sumiran Mishra

Main category: cs.LG

TL;DR: Multi-objective online convex optimization with K loss sequences, focusing on min-max regret where algorithm must track all sequences simultaneously.

Details

Motivation: Extends traditional single-objective OCO to multi-objective setting where K different loss functions must be tracked simultaneously, capturing tradeoffs between competing objectives.

Method: Proposes algorithm combining Hedge (for multi-arm bandit) and Online Gradient Descent (OGD) for stochastic i.i.d. input model, with extensions to Martingale difference and Markov models.

Result: Shows adversarial input leads to linear regret, but with stochastic i.i.d. input achieves O(√(T log(TK))) expected min-max regret using proposed algorithm.

Conclusion: Multi-objective OCO with min-max regret is tractable under stochastic input models, with efficient algorithms achieving sublinear regret bounds.

Abstract: In this paper, we broaden the horizon of online convex optimization (OCO), and consider multi-objective OCO, where there are $K$ distinct loss function sequences, and an algorithm has to choose its action at time $t$, before the $K$ loss functions at time $t$ are revealed. To capture the tradeoff between tracking the $K$ different sequences, we consider the {\it min-max} regret, where the benchmark (optimal offline algorithm) takes a static action across all time slots that minimizes the maximum of the total loss (summed across time slots) incurred by each of the $K$ sequences. An online algorithm is allowed to change its action across time slots, and its {\it min-max} regret is defined as the difference between its {\it min-max} cost and that of the benchmark. The {\it min-max} regret is a stringent performance measure and an algorithm with small regret needs to `track’ all loss functions simultaneously. We first show that with adversarial input, {\it min-max} regret scales linearly with the time horizon $T$ for any online algorithm. Consequently, we consider a stochastic i.i.d. input model where all loss functions are i.i.d. generated from an unknown joint distribution and propose a simple algorithm that combines the well-known {\it Hedge} and online gradient descent (OGD) and show via a remarkably simple proof that its expected {\it min-max} regret is $O(\sqrt{T \log (T K)})$. Analogous results are also derived for Martingale difference and Markov input models.

[488] Beyond Pairwise: Empowering LLM Alignment With Ranked Choice Modeling

Yuxuan Tang, Yifan Feng

Main category: cs.LG

TL;DR: RCPO is a unified framework for LLM alignment that extends beyond pairwise preferences to incorporate richer human feedback like multiway comparisons and rankings via choice modeling.

Details

Motivation: Current LLM alignment relies on pairwise preference optimization which overlooks richer forms of human feedback like multiway comparisons and rankings that could provide more informative training signals.

Method: Introduces Ranked Choice Preference Optimization (RCPO), a unified framework bridging preference optimization with choice modeling via maximum likelihood estimation. Supports both utility-based and rank-based models, subsumes pairwise methods (DPO, SimPO) as special cases, and provides principled training objectives for richer feedback formats. Instantiates with Multinomial Logit and Mallows-RMJ models.

Result: Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across in-distribution and out-of-distribution settings show RCPO consistently outperforms competitive baselines.

Conclusion: Directly leveraging ranked preference data with appropriate choice models yields more effective alignment. RCPO offers an extensible foundation for incorporating choice modeling into LLM training.

Abstract: Alignment of large language models (LLMs) has predominantly relied on pairwise preference optimization, where annotators select the better of two responses to a prompt. While simple, this approach overlooks the opportunity to learn from richer forms of human feedback, such as multiway comparisons and top-$k$ rankings. We introduce Ranked Choice Preference Optimization (RCPO), a unified framework that bridges preference optimization with (ranked) choice modeling via maximum likelihood estimation. RCPO supports both utility-based and rank-based models, subsumes several pairwise methods (such as DPO and SimPO) as special cases, and provides principled training objectives for richer feedback formats. We instantiate this framework with two representative models (Multinomial Logit and Mallows-RMJ). Experiments on Llama-3-8B-Instruct, Gemma-2-9B-it, and Mistral-7B-Instruct across in-distribution and out-of-distribution settings show that RCPO consistently outperforms competitive baselines. RCPO shows that directly leveraging ranked preference data, combined with the right choice models, yields more effective alignment. It offers an extensible foundation for incorporating (ranked) choice modeling into LLM training.

[489] Why Policy Gradient Algorithms Work for Undiscounted Total-Reward MDPs

Jongmin Lee, Ernest K. Ryu

Main category: cs.LG

TL;DR: Policy gradient analysis for undiscounted total-reward MDPs with γ=1, addressing theory gap for LLM policy-based RL

Details

Motivation: Existing policy gradient theory assumes γ<1, but recent LLM policy-based RL uses γ=1, creating theory gap

Method: Two key insights: (1) state classification into recurrent/transient is invariant for policies with positive action probabilities, (2) replace classical state visitation measure with new transient visitation measure

Result: Provides theoretical analysis of policy gradient method for undiscounted expected total-reward infinite-horizon MDPs

Conclusion: Bridges theory gap for policy-based RL with γ=1, relevant for LLM applications

Abstract: The classical policy gradient method is the theoretical and conceptual foundation of modern policy-based reinforcement learning (RL) algorithms. Most rigorous analyses of such methods, particularly those establishing convergence guarantees, assume a discount factor $γ< 1$. In contrast, however, a recent line of work on policy-based RL for large language models uses the undiscounted total-reward setting with $γ= 1$, rendering much of the existing theory inapplicable. In this paper, we provide analyses of the policy gradient method for undiscounted expected total-reward infinite-horizon MDPs based on two key insights: (i) the classification of the MDP states into recurrent and transient states is invariant over the set of policies that assign strictly positive probability to every action (as is typical in deep RL models employing a softmax output layer) and (ii) the classical state visitation measure (which may be ill-defined when $γ= 1$) can be replaced with a new object that we call the transient visitation measure.

[490] Neural network initialization with nonlinear characteristics and information on hierarchical features

Hikaru Homma, Jun Ohkubo

Main category: cs.LG

TL;DR: Proposes a hierarchical initialization method for neural networks that adjusts scale factors in SWIM algorithm to capture low-frequency components in early layers and high-frequency components in late layers, improving training performance.

Details

Motivation: Neural network initialization significantly impacts learning performance, with some methods allowing avoidance of backpropagation training. Existing research shows neural networks learn hierarchical features (coarse information in early layers, fine details in later layers). The paper aims to leverage this hierarchical feature information to improve initialization strategies.

Method: Proposes a framework that modifies the SWIM (sampling where it matters) algorithm by adjusting scale factors to explicitly capture hierarchical features: low-frequency components in early-stage hidden layers and high-frequency components in late-stage hidden layers.

Result: Numerical experiments on 1D regression and MNIST classification tasks demonstrate the proposed method outperforms conventional initialization algorithms, showing improved training performance.

Conclusion: The work clarifies the importance of intrinsic hierarchical features in neural network learning and yields an effective parameter initialization strategy that enhances training performance by explicitly encoding hierarchical feature learning patterns.

Abstract: Initialization of neural network parameters, such as weights and biases, has a crucial impact on learning performance; if chosen well, we can even avoid the need for additional training with backpropagation. For example, algorithms based on the ridgelet transform or the SWIM (sampling where it matters) concept have been proposed for initialization. On the other hand, some works show hierarchical features in trained neural networks; neural networks tend to learn coarse information in the early-stage hidden layers. In this work, we investigate the effects of utilizing information on the hierarchical features in the initialization of neural networks. Hence, we propose a framework that adjusts the scale factors in the SWIM algorithm to capture low-frequency components in the early-stage hidden layers and to represent high-frequency components in the late-stage hidden layers. Numerical experiments on a one-dimensional regression task and the MNIST classification task demonstrate that the proposed method outperforms the conventional initialization algorithms. This work clarifies the importance of intrinsic hierarchical features in learning neural networks, and the finding yields an effective parameter initialization strategy that enhances their training performance.

[491] ReflexGrad: A Dual-Process Architecture for Gradient-Free Inference-Time Learning

Ankush Kadu, Ashwanth Krishnan

Main category: cs.LG

TL;DR: ReflexGrad enables gradient-free inference-time learning in LLMs through textual policy refinement and causal diagnosis without weight updates, allowing adaptation during execution.

Details

Motivation: Current approaches to extended reasoning in LLMs allocate more computation but remain static and cannot adapt from mistakes during execution. Online RL offers adaptation but requires expensive gradient updates at runtime with issues like catastrophic forgetting and instability.

Method: ReflexGrad uses gradient-free framework with two complementary mechanisms: rapid policy refinement during forward progress and deliberate causal diagnosis when stuck, with intelligent routing between them. It optimizes natural language “policy” through textual feedback while keeping model weights frozen, analyzing action-outcome sequences to identify root causes and apply corrections within same execution.

Result: Evaluated zero-shot across diverse interactive tasks without task-specific engineering, ReflexGrad achieves strong single-execution performance, demonstrating practical viability of gradient-free inference-time learning.

Conclusion: Gradient-free inference-time learning is practically viable, enabling genuine adaptation without retraining, weight updates, or demonstrations through ReflexGrad’s framework of policy refinement and causal diagnosis.

Abstract: Scaling inference-time compute has emerged as a powerful paradigm–yet deliberating longer is not the same as learning. Current approaches to extended reasoning in large language models allocate more computation to thinking but remain fundamentally static: they cannot adapt from mistakes encountered during execution. Online reinforcement learning offers adaptation but requires gradient updates at runtime–expensive, prone to catastrophic forgetting, and unstable in deployment. We introduce ReflexGrad, a gradient-free framework for genuine inference-time learning: adaptation without retraining, without weight updates, without demonstrations. Our key insight is that effective runtime learning requires two complementary mechanisms–rapid policy refinement during forward progress, and deliberate causal diagnosis when stuck–with intelligent routing between them. ReflexGrad implements this by optimizing a natural language “policy” through textual feedback while keeping model weights frozen. When failures occur, the system analyzes recent action-outcome sequences to identify root causes and immediately applies corrections within the same execution–eliminating the need for multiple trials. Evaluated zero-shot across diverse interactive tasks without task-specific engineering, ReflexGrad achieves strong single-execution performance, demonstrating that gradient-free inference-time learning is not just theoretically appealing but practically viable.

[492] RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse

Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai

Main category: cs.LG

TL;DR: RAGBoost is an efficient retrieval-augmented generation system that improves prefill performance through accuracy-preserving context reuse while maintaining reasoning quality.

Details

Motivation: Existing RAG systems suffer from degraded prefill performance with longer inputs, and current caching techniques either have low cache reuse or degrade reasoning quality.

Method: Detects overlapping retrieved items across sessions and multi-turn interactions using efficient context indexing, ordering, and de-duplication, with lightweight contextual hints to maintain reasoning fidelity.

Result: Improves prefill performance by 1.5-3X over state-of-the-art methods while preserving or enhancing reasoning accuracy across diverse RAG and agentic AI workloads.

Conclusion: RAGBoost achieves high cache reuse without sacrificing accuracy through context reuse, integrating seamlessly with existing LLM inference engines.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing LLM inference engines and improves their prefill performance by 1.5-3X over state-of-the-art methods, while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads. Our code is released at: https://github.com/Edinburgh-AgenticAI/RAGBoost.

[493] Free-Boundary Quasiconformal Maps via a Least-squares Operator in Diffeomorphism Optimization

Zhehao Xu, Lok Ming Lui

Main category: cs.LG

TL;DR: A method for free-boundary diffeomorphism optimization using least-squares quasiconformal formulation and a Spectral Beltrami Network surrogate for efficient differentiable optimization.

Details

Motivation: Free-boundary diffeomorphism optimization is important in geometric modeling, computer graphics, and biological imaging, requiring simultaneous determination of target domains and locally bijective maps with controlled distortion. Current methods lack efficient differentiable formulations for large-scale optimization.

Method: Formulates the problem through least-squares quasiconformal (LSQC) operator, analyzes its properties, and introduces Spectral Beltrami Network (SBN) - a multiscale mesh-spectral surrogate that approximates LSQC solution in a single differentiable forward pass. This enables SBN-Opt framework for searching over Beltrami coefficients and pinning conditions.

Result: Establishes well-posedness, invariance properties, and stability of LSQC minimizer. SBN-Opt shows consistent improvements over traditional numerical algorithms in experiments on equiareal parameterization and inconsistent surface registration.

Conclusion: The differentiable LSQC formulation with SBN surrogate enables practical free-boundary diffeomorphism optimization at scale with explicit distortion control, outperforming traditional methods.

Abstract: Free-boundary diffeomorphism optimization, an important and widely occurring task in geometric modeling, computer graphics, and biological imaging, requires simultaneously determining a planar target domain and a locally bijective map with well-controlled distortion. We formulate this task through the least-squares quasiconformal (LSQC) operator and establish key structural properties of the LSQC minimizer, including well-posedness under mild conditions, invariance under similarity transformations, and resolution-independent behavior with stability under mesh refinement. We further analyze the sensitivity of the LSQC solution with respect to the Beltrami coefficient, establishing stability and differentiability properties that enable gradient-based optimization over the space of Beltrami coefficients. To make this differentiable formulation practical at scale and to facilitate the optimization process, we introduce the Spectral Beltrami Network (SBN), a multiscale mesh-spectral surrogate that approximates the LSQC solution operator in a single differentiable forward pass. This yields SBN-Opt, an optimization framework that searches over admissible Beltrami coefficients and pinning conditions to solve free-boundary diffeomorphism objectives with explicit distortion control. Extensive experiments on equiareal parameterization and inconsistent surface registration demonstrate consistent improvements over traditional numerical algorithms.

[494] How to Purchase Labels? A Cost-Effective Approach Using Active Learning Markets

Xiwen Huang, Pierre Pinson

Main category: cs.LG

TL;DR: Active learning markets for purchasing labels to improve model fitting, with optimization-based market clearing, budget constraints, and two active learning strategies tested on real estate and energy datasets.

Details

Motivation: Analysts need to acquire additional labeled data to improve model performance, but existing proposals focus on purchasing features and examples rather than labels. There's a need for practical solutions to optimize label acquisition in resource-constrained environments.

Method: Formalize market clearing as optimization problem with budget constraints and improvement thresholds. Use single-buyer-multiple-seller setup with two active learning strategies: variance-based and query-by-committee based, paired with distinct pricing mechanisms. Compare to baselines including random sampling and greedy knapsack heuristic.

Result: Validated on real-world datasets from real estate pricing and energy forecasting. Proposed strategies demonstrate robustness, consistently achieving superior performance with fewer labels acquired compared to conventional methods.

Conclusion: The approach provides an easy-to-implement practical solution for optimizing data acquisition in resource-constrained environments, offering efficient label purchasing strategies for model improvement.

Abstract: We introduce and analyse active learning markets as a way to purchase labels, in situations where analysts aim to acquire additional data to improve model fitting, or to better train models for predictive analytics applications. This comes in contrast to the many proposals that already exist to purchase features and examples. By originally formalising the market clearing as an optimisation problem, we integrate budget constraints and improvement thresholds into the label acquisition process. We focus on a single-buyer-multiple-seller setup and propose the use of two active learning strategies (variance based and query-by-committee based), paired with distinct pricing mechanisms. They are compared to benchmark baselines including random sampling and a greedy knapsack heuristic. The proposed strategies are validated on real-world datasets from two critical application domains: real estate pricing and energy forecasting. Results demonstrate the robustness of our approach, consistently achieving superior performance with fewer labels acquired compared to conventional methods. Our proposal comprises an easy-to-implement practical solution for optimising data acquisition in resource-constrained environments.

[495] Multimodal Graph Neural Networks for Prognostic Modeling of Brain Network Reorganization

Preksha Girish, Rachana Mysore, Kiran K. N., Hiranmayee R., Shipra Prashanth, Shrey Kumar

Main category: cs.LG

TL;DR: A multimodal graph neural network framework that integrates structural MRI, DTI, and fMRI to model spatiotemporal brain network reorganization for predicting cognitive decline and neurological progression.

Details

Motivation: Understanding dynamic brain network reorganization is critical for predicting cognitive decline, neurological progression, and individual variability in clinical outcomes. Current approaches need better integration of multimodal imaging data and mathematical rigor for deriving clinically meaningful biomarkers.

Method: Proposes a multimodal graph neural network framework integrating structural MRI, diffusion tensor imaging, and functional MRI. Brain regions are nodes, structural/functional connectivity are edges forming longitudinal brain graphs. Uses fractional stochastic differential operators within graph-based recurrent networks to capture temporal evolution, long-term dependencies, and stochastic fluctuations. Attention mechanisms fuse multimodal information and generate interpretable biomarkers.

Result: Experiments on longitudinal neuroimaging datasets demonstrate both predictive accuracy and interpretability. The framework generates biomarkers including network energy entropy, graph curvature, fractional memory indices, and modality-specific attention scores, combined into a composite prognostic index to quantify individual risk.

Conclusion: The results highlight the potential of mathematically rigorous, multimodal graph-based approaches for deriving clinically meaningful biomarkers from existing imaging data without requiring new data collection.

Abstract: Understanding the dynamic reorganization of brain networks is critical for predicting cognitive decline, neurological progression, and individual variability in clinical outcomes. This work proposes a multimodal graph neural network framework that integrates structural MRI, diffusion tensor imaging, and functional MRI to model spatiotemporal brain network reorganization. Brain regions are represented as nodes and structural and functional connectivity as edges, forming longitudinal brain graphs for each subject. Temporal evolution is captured via fractional stochastic differential operators embedded within graph-based recurrent networks, enabling the modeling of long-term dependencies and stochastic fluctuations in network dynamics. Attention mechanisms fuse multimodal information and generate interpretable biomarkers, including network energy entropy, graph curvature, fractional memory indices, and modality-specific attention scores. These biomarkers are combined into a composite prognostic index to quantify individual risk of network instability or cognitive decline. Experiments on longitudinal neuroimaging datasets demonstrate both predictive accuracy and interpretability. The results highlight the potential of mathematically rigorous, multimodal graph-based approaches for deriving clinically meaningful biomarkers from existing imaging data without requiring new data collection.

[496] Solving PDEs With Deep Neural Nets under General Boundary Conditions

Chenggong Zhang

Main category: cs.LG

TL;DR: Extending TENG framework to handle Dirichlet boundary conditions in PINNs using natural gradient optimization with Euler/Heun time-stepping for improved accuracy and stability in solving PDEs.

Details

Motivation: Traditional numerical methods struggle with high-dimensional/complex PDEs, and PINNs face challenges with accuracy and complex boundary conditions. Need better methods to handle Dirichlet boundary constraints in physics-informed neural networks.

Method: Extends Time-Evolving Natural Gradient (TENG) framework to address Dirichlet boundary conditions by integrating natural gradient optimization with numerical time-stepping schemes (Euler and Heun methods). Incorporates boundary condition penalty terms into loss function for precise enforcement of constraints.

Result: Experiments on heat equation show Heun method provides superior accuracy due to second-order corrections, while Euler method offers computational efficiency for simpler scenarios. Framework successfully enforces Dirichlet constraints.

Conclusion: Establishes foundation for extending framework to Neumann and mixed boundary conditions, as well as broader PDE classes, advancing neural network-based solvers for real-world problems.

Abstract: Partial Differential Equations (PDEs) are central to modeling complex systems across physical, biological, and engineering domains, yet traditional numerical methods often struggle with high-dimensional or complex problems. Physics-Informed Neural Networks (PINNs) have emerged as an efficient alternative by embedding physics-based constraints into deep learning frameworks, but they face challenges in achieving high accuracy and handling complex boundary conditions. In this work, we extend the Time-Evolving Natural Gradient (TENG) framework to address Dirichlet boundary conditions, integrating natural gradient optimization with numerical time-stepping schemes, including Euler and Heun methods, to ensure both stability and accuracy. By incorporating boundary condition penalty terms into the loss function, the proposed approach enables precise enforcement of Dirichlet constraints. Experiments on the heat equation demonstrate the superior accuracy of the Heun method due to its second-order corrections and the computational efficiency of the Euler method for simpler scenarios. This work establishes a foundation for extending the framework to Neumann and mixed boundary conditions, as well as broader classes of PDEs, advancing the applicability of neural network-based solvers for real-world problems.

[497] Knowledge-Guided Masked Autoencoder with Linear Spectral Mixing and Spectral-Angle-Aware Reconstruction

Abdul Matin, Rupasree Dey, Tanjim Bin Faruk, Shrideep Pallickara, Sangmi Lee Pallickara

Main category: cs.LG

TL;DR: Knowledge-guided ViT-based Masked Autoencoder that incorporates scientific domain knowledge (Linear Spectral Mixing Model and Spectral Angle Mapper) into self-supervised learning to improve interpretability, generalization, and data efficiency.

Details

Motivation: To improve model interpretability, generalization, and data efficiency by integrating domain knowledge into deep learning, moving beyond purely data-driven optimization to incorporate physical constraints and structural relationships.

Method: Proposes a knowledge-guided ViT-based Masked Autoencoder that embeds the Linear Spectral Mixing Model (LSMM) as a physical constraint and uses physically-based Spectral Angle Mapper (SAM). The framework jointly optimizes LSMM and SAM loss with conventional Huber loss, promoting both numerical accuracy and geometric consistency.

Result: The model substantially enhances reconstruction quality and improves downstream task performance, demonstrating better reconstruction fidelity, stabilized training under limited supervision, and interpretable latent representations grounded in physical principles.

Conclusion: The work highlights the promise of embedding physics-informed inductive biases within transformer-based self-supervised learning, showing that domain knowledge integration improves model performance and interpretability.

Abstract: Integrating domain knowledge into deep learning has emerged as a promising direction for improving model interpretability, generalization, and data efficiency. In this work, we present a novel knowledge-guided ViT-based Masked Autoencoder that embeds scientific domain knowledge within the self-supervised reconstruction process. Instead of relying solely on data-driven optimization, our proposed approach incorporates the Linear Spectral Mixing Model (LSMM) as a physical constraint and physically-based Spectral Angle Mapper (SAM), ensuring that learned representations adhere to known structural relationships between observed signals and their latent components. The framework jointly optimizes LSMM and SAM loss with a conventional Huber loss objective, promoting both numerical accuracy and geometric consistency in the feature space. This knowledge-guided design enhances reconstruction fidelity, stabilizes training under limited supervision, and yields interpretable latent representations grounded in physical principles. The experimental findings indicate that the proposed model substantially enhances reconstruction quality and improves downstream task performance, highlighting the promise of embedding physics-informed inductive biases within transformer-based self-supervised learning.

[498] Scalable Formal Verification via Autoencoder Latent Space Abstraction

Robert Reed, Luca Laurenti, Morteza Lahijanian

Main category: cs.LG

TL;DR: A formal approach using convex autoencoders and kernel methods to reduce system dimensionality for verification, with guarantees that abstractions contain true system behaviors and verification results can be mapped back.

Details

Motivation: Finite abstraction methods face scalability challenges for high-dimensional systems due to exponential state-space discretization growth. Learning-based approaches using neural networks show potential but lack formal correctness guarantees for verification results.

Method: Uses convex autoencoders for dimensionality reduction, learns dynamics in latent space via kernel-based methods, constructs finite abstraction from learned latent model, and guarantees abstraction contains true system behaviors.

Result: Demonstrates effectiveness on multiple systems including a 26D neural network-controlled system, showing significant scalability improvements without loss of verification rigor.

Conclusion: Provides a formal, scalable approach to system verification that combines learning-based dimensionality reduction with rigorous guarantees, enabling verification of high-dimensional systems previously intractable with traditional methods.

Abstract: Finite Abstraction methods provide a powerful formal framework for proving that systems satisfy their specifications. However, these techniques face scalability challenges for high-dimensional systems, as they rely on state-space discretization which grows exponentially with dimension. Learning-based approaches to dimensionality reduction, utilizing neural networks and autoencoders, have shown great potential to alleviate this problem. However, ensuring the correctness of the resulting verification results remains an open question. In this work, we provide a formal approach to reduce the dimensionality of systems via convex autoencoders and learn the dynamics in the latent space through a kernel-based method. We then construct a finite abstraction from the learned model in the latent space and guarantee that the abstraction contains the true behaviors of the original system. We show that the verification results in the latent space can be mapped back to the original system. Finally, we demonstrate the effectiveness of our approach on multiple systems, including a 26D system controlled by a neural network, showing significant scalability improvements without loss of rigor.

[499] ATLAS: Adaptive Topology-based Learning at Scale for Homophilic and Heterophilic Graphs

Turja Kundu, Sanjukta Bhowmick

Main category: cs.LG

TL;DR: ATLAS is a propagation-free graph learning framework that encodes graph structure through multi-resolution community features instead of message passing, achieving competitive performance on both homophilic and heterophilic graphs with improved scalability.

Details

Motivation: Traditional GNNs struggle with heterophilic graphs where connected nodes don't share labels, and suffer from scalability issues due to iterative message passing that causes neighborhood expansion overhead.

Method: ATLAS uses modularity-guided adaptive search to identify informative community scales, one-hot encodes these communities, projects them into learnable embeddings, and concatenates with node attributes for MLP classification, enabling standard mini-batch training and adjacency-free inference.

Result: Across 13 benchmarks including million-node graphs, ATLAS achieves competitive or superior accuracy, with up to 20-point gains over GCN on heterophilic datasets and 12-point gains over MLPs on homophilic graphs.

Conclusion: By treating topology as explicit features rather than relying on message passing, ATLAS adapts intelligently to graph structure, provides scalable performance, and offers interpretable structural insights while avoiding propagation when structure is misleading.

Abstract: Graph neural networks (GNNs) excel on homophilic graphs where connected nodes share labels, but struggle with heterophilic graphs where edges do not imply similarity. Moreover, iterative message passing limits scalability due to neighborhood expansion overhead. We introduce ATLAS (Adaptive Topology-based Learning at Scale), a propagation-free framework that encodes graph structure through multi-resolution community features rather than message passing. We first prove that community refinement involves a fundamental trade-off: finer partitions increase label-community mutual information but also increase entropy. We formalize when refinement improves normalized mutual information, explaining why intermediate granularities are often most predictive. ATLAS employs modularity-guided adaptive search to automatically identify informative community scales, which are one-hot encoded, projected into learnable embeddings, and concatenated with node attributes for MLP classification. This enables standard mini-batch training and adjacency-free inference after one-time preprocessing. Across 13 benchmarks including million-node graphs, ATLAS achieves competitive or superior accuracy, up to 20-point gains over GCN on heterophilic datasets and 12-point gains over MLPs on homophilic graphs. By treating topology as explicit features, ATLAS adapts intelligently: leveraging structure when informative, remaining robust when weakly aligned, and avoiding propagation when structure misleads, providing both scalable performance and interpretable structural insights.

[500] Predictive Modeling of Power Outages during Extreme Events: Integrating Weather and Socio-Economic Factors

Nina Fatehi, Antar Kumar Biswas, Masoud H. Nazari

Main category: cs.LG

TL;DR: A machine learning framework for predicting power outages from extreme events using EAGLE-I outage data combined with weather, socioeconomic, infrastructure, and seasonal features, with LSTM achieving the best performance among tested models.

Details

Motivation: To develop a predictive framework for low-probability, high-consequence power outage scenarios caused by extreme events, addressing the need for better understanding of outage risks and community vulnerability patterns.

Method: Integrates EAGLE-I outage records (2014-2024) with weather, socioeconomic, infrastructure, and seasonal event data. Evaluates four ML models: Random Forest, Graph Neural Network, AdaBoost, and LSTM on Michigan county data.

Result: LSTM network achieves the highest accuracy among all tested models for predicting power outages in extreme event scenarios.

Conclusion: The proposed learning framework effectively predicts power outages from extreme events, with LSTM showing superior performance, and reveals important patterns of community vulnerability through social and demographic indicators.

Abstract: This paper presents a novel learning based framework for predicting power outages caused by extreme events. The proposed approach targets low-probability high-consequence outage scenarios and leverages a comprehensive set of features derived from publicly available data sources. We integrate EAGLE-I outage records from 2014 to 2024 with weather, socioeconomic, infrastructure, and seasonal event data. Incorporating social and demographic indicators reveals patterns of community vulnerability and improves understanding of outage risk during extreme conditions. Four machine learning models are evaluated, including Random Forest (RF), Graph Neural Network (GNN), Adaptive Boosting (AdaBoost), and Long Short-Term Memory (LSTM). Experimental validation is performed on a large-scale dataset covering counties in the lower peninsula of Michigan. Among all models tested, the LSTM network achieves higher accuracy.

[501] A Simple, Optimal and Efficient Algorithm for Online Exp-Concave Optimization

Yi-Han Wang, Peng Zhao, Zhi-Hua Zhou

Main category: cs.LG

TL;DR: LightONS: A computationally efficient variant of Online Newton Step for online exp-concave optimization that reduces runtime from O(d^ωT) to O(d^2T + d^ω√T log T) while maintaining optimal O(d log T) regret.

Details

Motivation: The standard Online Newton Step (ONS) algorithm for online exp-concave optimization suffers from computational bottlenecks due to expensive Mahalanobis projections at each round, costing Ω(d^ω) operations even for simple domains like the unit ball, leading to total runtime of O(d^ωT).

Method: LightONS uses domain-conversion techniques from parameter-free online learning and defers expensive Mahalanobis projections until necessary, preserving ONS structure while reducing computational overhead.

Result: LightONS achieves optimal O(d log T) regret with significantly reduced runtime of O(d^2T + d^ω√T log T), and enables stochastic exp-concave optimization with runtime O(d^3/ε), solving an open problem from Koren [2013].

Conclusion: LightONS provides an efficient plug-in replacement for ONS that maintains optimal regret guarantees while dramatically improving computational efficiency, with applications to gradient-norm adaptivity, parametric stochastic bandits, and memory-efficient online exp-concave optimization.

Abstract: Online eXp-concave Optimization (OXO) is a fundamental problem in online learning, where the goal is to minimize regret when loss functions are exponentially concave. The standard algorithm, Online Newton Step (ONS), guarantees an optimal $O(d \log T)$ regret, where $d$ is the dimension and $T$ is the time horizon. Despite its simplicity, ONS may face a computational bottleneck due to the Mahalanobis projection at each round. This step costs $Ω(d^ω)$ arithmetic operations for bounded domains, even for simple domains such as the unit ball, where $ω\in (2,3]$ is the matrix-multiplication exponent. As a result, the total runtime can reach $\tilde{O}(d^ωT)$, particularly when iterates frequently oscillate near the domain boundary. This paper proposes a simple variant of ONS, called LightONS, which reduces the total runtime to $O(d^2 T + d^ω\sqrt{T \log T})$ while preserving the optimal regret. Deploying LightONS with the online-to-batch conversion implies a method for stochastic exp-concave optimization with runtime $\tilde{O}(d^3/ε)$, thereby answering an open problem posed by Koren [2013]. The design leverages domain-conversion techniques from parameter-free online learning and defers expensive Mahalanobis projections until necessary, thereby preserving the elegant structure of ONS and enabling LightONS to act as an efficient plug-in replacement in broader scenarios, including gradient-norm adaptivity, parametric stochastic bandits, and memory-efficient OXO.

[502] Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces

Bryon Tjanaka, Henry Chen, Matthew C. Fontaine, Stefanos Nikolaidis

Main category: cs.LG

TL;DR: DMS (Discount Model Search) is a new QD algorithm that uses a continuous model of discount values instead of histograms to handle high-dimensional measure spaces, enabling applications like image-based measure specification.

Details

Motivation: Current QD algorithms struggle with high-dimensional measure spaces due to distortion issues where many solutions map to similar measures. Existing methods like CMA-MAE use histograms that cause stagnation in high-dimensional spaces because similar solutions fall into the same histogram cell.

Method: Proposes Discount Model Search (DMS) which guides exploration with a smooth, continuous model of discount values instead of discrete histograms. This allows DMS to distinguish between solutions with similar measures in high-dimensional spaces.

Result: DMS outperforms CMA-MAE and other black-box QD algorithms on high-dimensional benchmarks. Enables new capabilities like using image datasets as measure specifications rather than hand-designed measure functions.

Conclusion: DMS addresses limitations of existing QD algorithms in high-dimensional measure spaces by using continuous discount models, enabling applications in domains with image-based measures and improving performance on high-dimensional benchmarks.

Abstract: Quality diversity (QD) optimization searches for a collection of solutions that optimize an objective while attaining diverse outputs of a user-specified, vector-valued measure function. Contemporary QD algorithms are typically limited to low-dimensional measures because high-dimensional measures are prone to distortion, where many solutions found by the QD algorithm map to similar measures. For example, the state-of-the-art CMA-MAE algorithm guides measure space exploration with a histogram in measure space that records so-called discount values. However, CMA-MAE stagnates in domains with high-dimensional measure spaces because solutions with similar measures fall into the same histogram cell and hence receive the same discount value. To address these limitations, we propose Discount Model Search (DMS), which guides exploration with a model that provides a smooth, continuous representation of discount values. In high-dimensional measure spaces, this model enables DMS to distinguish between solutions with similar measures and thus continue exploration. We show that DMS facilitates new capabilities for QD algorithms by introducing two new domains where the measure space is the high-dimensional space of images, which enables users to specify their desired measures by providing a dataset of images rather than hand-designing the measure function. Results in these domains and on high-dimensional benchmarks show that DMS outperforms CMA-MAE and other existing black-box QD algorithms.

[503] When Should We Introduce Safety Interventions During Pretraining?

Dylan Sam, Sachin Goyal, Pratyush Maini, Alexander Robey, J. Zico Kolter

Main category: cs.LG

TL;DR: Safety interventions during pretraining should be timed strategically - starting after 20-60% of pretraining yields best robustness with standard decoding, while starting from beginning improves steerability with safety-aware inference.

Details

Motivation: Prior work shows safety interventions during pretraining improve model robustness, but overlooks the crucial question of when during pretraining these interventions should be introduced.

Method: Kept data sources and pretraining interventions fixed while varying intervention start time (after 0%, 20%, or 60% of pretraining tokens), then evaluated robustness, steerability, and internal representations.

Result: Optimal timing depends on use case: for standard decoding, interventions after 20-60% pretraining yield strongest robustness; for safety-aware inference, starting from beginning improves steerability; earlier interventions create cleaner separation in internal representations.

Conclusion: Intervention timing is a key curriculum design choice for safety, with different optimal timings for different deployment scenarios, establishing timing as a critical factor in safety intervention design.

Abstract: Prior work has shown that safety interventions applied during pretraining, such as removing and rephrasing harmful content, can substantially improve the robustness of the resulting models. In this paper, we study the fundamental question that prior work has overlooked: “When during pretraining should safety interventions be introduced?” We keep the underlying data sources and pretraining interventions fixed, varying the intervention start time (after 0%, 20%, or 60% of pretraining tokens). We find that the optimal start time is not one-size-fits-all: with standard top-k decoding, introducing interventions after a short initial phase of safe-only pretraining (20%-60%) often yields the strongest robustness, with the clearest benefits emerging after downstream, benign finetuning. In contrast, for safety-aware inference, interventions starting from the beginning improve steerability towards safer generations. Finally, we observe that earlier interventions reshape internal representations: linear probes more cleanly separate safe vs harmful examples. Our results are the first to establish intervention timing as a key curriculum design choice for safety.

[504] Patch-Level Tokenization with CNN Encoders and Attention for Improved Transformer Time-Series Forecasting

Saurish Nagrath, Saroj Kumar Panigrahy

Main category: cs.LG

TL;DR: Two-stage framework for multivariate time-series forecasting: CNN extracts local temporal patterns from patches, then Transformer models global dependencies between patches.

Details

Motivation: Transformer models need quality input representations for time-series forecasting, especially with long sequences and large datasets. Current approaches may not optimally separate local temporal learning from global dependency modeling.

Method: Two-stage approach: 1) CNN extracts short-range temporal dynamics from fixed-length patches, producing patch-level token embeddings with self-attention refinement. 2) Transformer encoder models inter-patch temporal dependencies for forecasting.

Result: Outperforms convolutional baseline with increased temporal context, remains competitive with patch-based Transformer. Shows structured patch-level tokenization is scalable and effective for long input sequences.

Conclusion: Separating local representation learning from global dependency modeling provides effective, scalable approach for multivariate time-series forecasting with long sequences.

Abstract: Transformer-based models have shown strong performance in time-series forecasting by leveraging self-attention to model long-range temporal dependencies. However, their effectiveness depends critically on the quality and structure of input representations derived from raw multivariate time-series data, particularly as sequence length and data scale increase. This paper proposes a two-stage forecasting framework that explicitly separates local temporal representation learning from global dependency modelling. In the proposed approach, a convolutional neural network operates on fixed-length temporal patches to extract short-range temporal dynamics and non-linear feature interactions, producing compact patch-level token embeddings. Token-level self-attention is applied during representation learning to refine these embeddings, after which a Transformer encoder models inter-patch temporal dependencies to generate forecasts. The method is evaluated on a synthetic multivariate time-series dataset with controlled static and dynamic factors, using an extended sequence length and a larger number of samples. Experimental results demonstrate that the proposed framework consistently outperforms a convolutional baseline under increased temporal context and remains competitive with a strong patch-based Transformer model. These findings indicate that structured patch-level tokenization provides a scalable and effective representation for multivariate time-series forecasting, particularly when longer input sequences are considered.

[505] Detecting and Mitigating Memorization in Diffusion Models through Anisotropy of the Log-Probability

Rohan Asthana, Vasileios Belagiannis

Main category: cs.LG

TL;DR: Proposes a new memorization detection method for diffusion models that combines isotropic norm and anisotropic alignment metrics, enabling faster detection without full denoising steps.

Details

Motivation: Current memorization detection methods for diffusion models rely on norm-based metrics that only work well under isotropic assumptions at high/medium noise levels, but fail in anisotropic low-noise regimes where memorized samples show different patterns.

Method: Develops a detection metric integrating both isotropic norm and anisotropic alignment (angular alignment between guidance vector and unconditional scores). The method computes metrics directly on pure noise inputs via two forward passes (conditional and unconditional), avoiding costly denoising steps.

Result: Outperforms existing denoising-free detection methods on Stable Diffusion v1.4 and v2, while being at least 5x faster than previous best approaches. Also demonstrates effective mitigation by adapting memorized prompts based on the detection metric.

Conclusion: The proposed memorization detection approach effectively addresses limitations of existing methods by considering both isotropic and anisotropic regimes, enabling faster and more accurate detection without requiring full denoising processes.

Abstract: Diffusion-based image generative models produce high-fidelity images through iterative denoising but remain vulnerable to memorization, where they unintentionally reproduce exact copies or parts of training images. Recent memorization detection methods are primarily based on the norm of score difference as indicators of memorization. We prove that such norm-based metrics are mainly effective under the assumption of isotropic log-probability distributions, which generally holds at high or medium noise levels. In contrast, analyzing the anisotropic regime reveals that memorized samples exhibit strong angular alignment between the guidance vector and unconditional scores in the low-noise setting. Through these insights, we develop a memorization detection metric by integrating isotropic norm and anisotropic alignment. Our detection metric can be computed directly on pure noise inputs via two conditional and unconditional forward passes, eliminating the need for costly denoising steps. Detection experiments on Stable Diffusion v1.4 and v2 show that our metric outperforms existing denoising-free detection methods while being at least approximately 5x faster than the previous best approach. Finally, we demonstrate the effectiveness of our approach by utilizing a mitigation strategy that adapts memorized prompts based on our developed metric. The code is available at https://github.com/rohanasthana/memorization-anisotropy .

[506] Differentiable Logic Synthesis: Spectral Coefficient Selection via Sinkhorn-Constrained Composition

Gorgi Pavlov

Main category: cs.LG

TL;DR: Hierarchical Spectral Composition: A differentiable architecture for learning precise Boolean logic via spectral synthesis with Sinkhorn-constrained routing and column-sign modulation, achieving hardware-efficient neuro-symbolic logic synthesis.

Details

Motivation: Learning precise Boolean logic via gradient descent is challenging as neural networks typically converge to "fuzzy" approximations that degrade under quantization. There's a need for differentiable architectures that can synthesize exact Boolean functions suitable for hardware implementation.

Method: Hierarchical Spectral Composition selects spectral coefficients from a frozen Boolean Fourier basis and composes them via Sinkhorn-constrained routing with column-sign modulation. The approach adapts Manifold-Constrained Hyper-Connections framework to logic synthesis, adding column-sign modulation to enable Boolean negation. Validation progresses through four complexity phases with different synthesis methods including gradient descent, exhaustive enumeration, and MCMC refinement with parallel tempering.

Result: Achieved 100% accuracy for n=2 Boolean operations with zero routing drift and zero-loss quantization to ternary masks. For n=3, gradient descent achieved 76% accuracy but exhaustive enumeration proved optimal ternary masks exist for all operations (100% accuracy, 39% sparsity). For n=4, spectral synthesis achieved 100% accuracy on all operations. All operations enable single-cycle combinational logic inference at 10,959 MOps/s on GPU.

Conclusion: Ternary polynomial threshold representations exist for all tested Boolean functions, but finding them requires methods beyond pure gradient descent as dimensionality grows. The approach demonstrates viability for hardware-efficient neuro-symbolic logic synthesis with efficient GPU implementation.

Abstract: Learning precise Boolean logic via gradient descent remains challenging: neural networks typically converge to “fuzzy” approximations that degrade under quantization. We introduce Hierarchical Spectral Composition, a differentiable architecture that selects spectral coefficients from a frozen Boolean Fourier basis and composes them via Sinkhorn-constrained routing with column-sign modulation. Our approach draws on recent insights from Manifold-Constrained Hyper-Connections (mHC), which demonstrated that projecting routing matrices onto the Birkhoff polytope preserves identity mappings and stabilizes large-scale training. We adapt this framework to logic synthesis, adding column-sign modulation to enable Boolean negation – a capability absent in standard doubly stochastic routing. We validate our approach across four phases of increasing complexity: (1) For n=2 (16 Boolean operations over 4-dim basis), gradient descent achieves 100% accuracy with zero routing drift and zero-loss quantization to ternary masks. (2) For n=3 (10 three-variable operations), gradient descent achieves 76% accuracy, but exhaustive enumeration over 3^8 = 6561 configurations proves that optimal ternary masks exist for all operations (100% accuracy, 39% sparsity). (3) For n=4 (10 four-variable operations over 16-dim basis), spectral synthesis – combining exact Walsh-Hadamard coefficients, ternary quantization, and MCMC refinement with parallel tempering – achieves 100% accuracy on all operations. This progression establishes (a) that ternary polynomial threshold representations exist for all tested functions, and (b) that finding them requires methods beyond pure gradient descent as dimensionality grows. All operations enable single-cycle combinational logic inference at 10,959 MOps/s on GPU, demonstrating viability for hardware-efficient neuro-symbolic logic synthesis.

[507] AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, Hong Xu

Main category: cs.LG

TL;DR: AGZO is a zeroth-order optimization method for fine-tuning LLMs that uses activation structure to guide perturbations, reducing memory usage while improving performance compared to isotropic ZO methods.

Details

Motivation: Existing ZO optimization methods for LLM fine-tuning use isotropic perturbations that ignore valuable structural information from activations during forward passes, leading to suboptimal performance despite memory efficiency.

Method: AGZO extracts activation-informed subspaces during forward passes and restricts perturbations to these low-rank subspaces, leveraging the insight that gradients of linear layers are confined to input activation subspaces.

Result: AGZO outperforms state-of-the-art ZO baselines on Qwen3 and Pangu models across various benchmarks, significantly narrowing the performance gap with first-order fine-tuning while maintaining similar memory footprint.

Conclusion: Activation-guided ZO optimization provides a more effective approach for memory-constrained LLM fine-tuning by incorporating structural information from forward passes, achieving better performance than isotropic methods.

Abstract: Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

[508] TwinWeaver: An LLM-Based Foundation Model Framework for Pan-Cancer Digital Twins

Nikita Makarov, Maria Bordukova, Lena Voith von Voithenberg, Estrella Pivel-Villanueva, Sabrina Mielke, Jonathan Wickes, Hanchen Wang, Mingyu Derek Ma, Keunwoo Choi, Kyunghyun Cho, Stephen Ra, Raul Rodriguez-Esteban, Fabian Schmich, Michael Menden

Main category: cs.LG

TL;DR: TwinWeaver framework serializes patient histories into text for LLM-based clinical event prediction, achieving state-of-the-art performance in cancer patient forecasting.

Details

Motivation: Precision oncology needs better forecasting of clinical events from sparse, multi-modal time series data, which current methods struggle with.

Method: TwinWeaver serializes longitudinal patient histories into text format, enabling unified event prediction using large language models, applied to 93,054 cancer patients across 20 cancer types.

Result: Significantly reduces forecasting error (MASE 0.87 vs 0.97 baseline), improves risk stratification (C-index 0.703 vs 0.662), and generalizes well to out-of-distribution clinical trials.

Conclusion: TwinWeaver provides a scalable, interpretable framework for clinical modeling that outperforms traditional time-series methods and enables transparent clinical reasoning.

Abstract: Precision oncology requires forecasting clinical events and trajectories, yet modeling sparse, multi-modal clinical time series remains a critical challenge. We introduce TwinWeaver, an open-source framework that serializes longitudinal patient histories into text, enabling unified event prediction as well as forecasting with large language models, and use it to build Genie Digital Twin (GDT) on 93,054 patients across 20 cancer types. In benchmarks, GDT significantly reduces forecasting error, achieving a median Mean Absolute Scaled Error (MASE) of 0.87 compared to 0.97 for the strongest time-series baseline (p<0.001). Furthermore, GDT improves risk stratification, achieving an average concordance index (C-index) of 0.703 across survival, progression, and therapy switching tasks, surpassing the best baseline of 0.662. GDT also generalizes to out-of-distribution clinical trials, matching trained baselines at zero-shot and surpassing them with fine-tuning, achieving a median MASE of 0.75-0.88 and outperforming the strongest baseline in event prediction with an average C-index of 0.672 versus 0.648. Finally, TwinWeaver enables an interpretable clinical reasoning extension, providing a scalable and transparent foundation for longitudinal clinical modeling.

[509] Toward Ultra-Long-Horizon Sequential Model Editing

Mingda Liu, Zhenghan Zhu, Ze’an Miao, Katsuki Fujisawa

Main category: cs.LG

TL;DR: Norm-Anchor Scaling (NAS) prevents catastrophic model collapse in sequential model editing by constraining MLP weight norm growth, extending editing capacity 4x with minimal overhead.

Details

Motivation: Sequential model editing using Locate-and-Edit methods suffers from catastrophic model collapse beyond a critical number of edits, limiting practical deployment for correcting factual errors in LLMs.

Method: Proposes Norm-Anchor Scaling (NAS), a plug-and-play norm-constrained strategy that identifies correlation between collapse and explosive MLP weight norm growth, then applies explicit norm control during editing updates.

Result: NAS delays collapse point by more than 4 times, achieves 72.2% average relative gain in editing performance, requires only one additional line of code, and has negligible computational overhead.

Conclusion: Explicit norm control is essential for stable sequential model editing, and NAS provides a simple yet effective solution to prevent catastrophic collapse in Locate-and-Edit frameworks.

Abstract: Model editing has emerged as a practical approach for mitigating factual errors and outdated knowledge in large language models (LLMs). Among existing methods, the Locate-and-Edit (L&E) paradigm is the dominant framework: it locates MLP parameters implicated in expressing a target fact, and then performs a localized update to rewrite that fact. However, long sequences of edits often trigger abrupt model collapse in L&E beyond a critical point. We empirically identify a strong correlation between collapse and explosive growth of edited MLP weight norms, and formally prove that commonly used L&E update rules can induce exponential norm growth across sequential edits in the absence of explicit norm control. To address this issue, we propose Norm-Anchor Scaling NAS, a plug-and-play norm-constrained strategy. Across extensive experiments, NAS delays the collapse point of representative L&E algorithms by more than 4 times and yields a 72.2% average relative gain in editing performance, requiring only a single additional line of code and incurring negligible computational overhead.

[510] Non-Intrusive Graph-Based Bot Detection for E-Commerce Using Inductive Graph Neural Networks

Sichen Zhao, Zhiming Xue, Yalun Qi, Xianling Zeng, Zihan Yu

Main category: cs.LG

TL;DR: Graph-based bot detection framework for e-commerce using inductive graph neural networks to identify automated behavior without intrusive methods.

Details

Motivation: Traditional bot mitigation techniques (IP blacklists, CAPTCHAs) are increasingly ineffective against modern bots using proxies, botnets, and AI-assisted evasion strategies, requiring more sophisticated detection methods.

Method: Non-intrusive graph-based framework that models user session behavior through graph representation and applies inductive graph neural networks for classification, capturing both relational structure and behavioral semantics.

Result: Outperforms session-level multilayer perceptron baseline in AUC and F1 score on real-world e-commerce traffic; remains robust under adversarial perturbations and generalizes effectively to unseen sessions and URLs.

Conclusion: The framework is deployment-friendly, integrates with existing systems without client-side instrumentation, supports real-time inference and incremental updates, making it suitable for practical e-commerce security deployments.

Abstract: Malicious bots pose a growing threat to e-commerce platforms by scraping data, hoarding inventory, and perpetrating fraud. Traditional bot mitigation techniques, including IP blacklists and CAPTCHA-based challenges, are increasingly ineffective or intrusive, as modern bots leverage proxies, botnets, and AI-assisted evasion strategies. This work proposes a non-intrusive graph-based bot detection framework for e-commerce that models user session behavior through a graph representation and applies an inductive graph neural network for classification. The approach captures both relational structure and behavioral semantics, enabling accurate identification of subtle automated activity that evades feature-based methods. Experiments on real-world e-commerce traffic demonstrate that the proposed inductive graph model outperforms a strong session-level multilayer perceptron baseline in terms of AUC and F1 score. Additional adversarial perturbation and cold-start simulations show that the model remains robust under moderate graph modifications and generalizes effectively to previously unseen sessions and URLs. The proposed framework is deployment-friendly, integrates with existing systems without client-side instrumentation, and supports real-time inference and incremental updates, making it suitable for practical e-commerce security deployments.

[511] RAP: KV-Cache Compression via RoPE-Aligned Pruning

Jihao Xin, Tian Lyu, David Keyes, Hatem Ltaief, Marco Canini

Main category: cs.LG

TL;DR: RAP (RoPE-Aligned Pruning) compresses KV-Cache in LLMs by pruning entire RoPE-aligned column pairs, enabling joint reduction of KV-Cache, attention parameters, and FLOPs by 20-30% while maintaining accuracy.

Details

Motivation: Long-context inference in LLMs is bottlenecked by KV-Cache memory and compute costs. Low-rank factorization approaches fail in RoPE-based LLMs because RoPE forces latent KV states to be reconstructed to full dimension, reintroducing overhead.

Method: Proposes RoPE-Aligned Pruning (RAP) which prunes entire RoPE-aligned column pairs to preserve RoPE’s 2x2 rotation structure, restore B absorption from low-rank factorization, and eliminate reconstruction overhead.

Result: Evaluation on LLaMA-3-8B and Mistral-7B shows RAP enables joint reduction of KV-Cache, attention parameters, and FLOPs by 20-30% while maintaining strong accuracy. Reduces attention latency to 83% (prefill) and 77% (decode) of baseline.

Conclusion: RAP effectively addresses KV-Cache bottlenecks in RoPE-based LLMs through structured pruning that preserves RoPE’s rotational properties, enabling significant efficiency gains without sacrificing accuracy.

Abstract: Long-context inference in large language models is increasingly bottlenecked by the memory and compute cost of the KV-Cache. Low-rank factorization compresses KV projections by writing $W \approx A * B$, where A produces latent KV states and B can be absorbed into downstream weights. In modern RoPE-based LLMs, this absorption fails: RoPE forces latent KV states to be reconstructed to full dimension, reintroducing substantial memory and compute overhead. We propose RoPE-Aligned Pruning (RAP), which prunes entire RoPE-aligned column pairs to preserve RoPE’s 2x2 rotation structure, restore B absorption, and eliminate reconstruction. Our evaluation on LLaMA-3-8B and Mistral-7B shows that RAP enables joint reduction of KV-Cache, attention parameters, and FLOPs by 20-30%, all at once, while maintaining strong accuracy. Notably, RAP reduces attention latency to 83% (prefill) and 77% (decode) of baseline.

[512] ECHO-2: A Large-Scale Distributed Rollout Framework for Cost-Efficient Reinforcement Learning

Jie Xiao, Meng Chen, Qingnan Ren, Jingwei Song, Jiaqi Huang, Yangshen Deng, Chris Tong, Wanyi Chen, Suli Wang, Ziqian Bi, Shuo Lu, Yiqun Duan, Xu Wang, Rymon Yu, Ween Yang, Lynn Ai, Eric Yang, Bill Shi

Main category: cs.LG

TL;DR: ECHO-2 is a distributed reinforcement learning framework for post-training LLMs that enables efficient wide-area coordination between centralized learning and distributed rollout generation, addressing policy dissemination latency through overlapping operations and peer-assisted broadcast.

Details

Motivation: Post-training RL for LLMs requires repeated interaction between rollout generation, reward evaluation, and centralized learning. While distributing rollout execution to cost-efficient inference resources is beneficial, it introduces challenges in wide-area coordination and policy dissemination latency.

Method: ECHO-2 combines centralized learning with distributed rollouts, treating bounded policy staleness as a user-controlled parameter to enable overlapping rollout generation, dissemination, and training. It uses an overlap-based capacity model for provisioning, peer-assisted pipelined broadcast to mitigate dissemination bottlenecks, and cost-aware activation of heterogeneous workers.

Result: Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

Conclusion: ECHO-2 provides an effective distributed RL framework for post-training LLMs that addresses wide-area coordination challenges, enabling cost-efficient scaling while maintaining training quality.

Abstract: Reinforcement learning (RL) is a critical stage in post-training large language models (LLMs), involving repeated interaction between rollout generation, reward evaluation, and centralized learning. Distributing rollout execution offers opportunities to leverage more cost-efficient inference resources, but introduces challenges in wide-area coordination and policy dissemination. We present ECHO-2, a distributed RL framework for post-training with remote inference workers and non-negligible dissemination latency. ECHO-2 combines centralized learning with distributed rollouts and treats bounded policy staleness as a user-controlled parameter, enabling rollout generation, dissemination, and training to overlap. We introduce an overlap-based capacity model that relates training time, dissemination latency, and rollout throughput, yielding a practical provisioning rule for sustaining learner utilization. To mitigate dissemination bottlenecks and lower cost, ECHO-2 employs peer-assisted pipelined broadcast and cost-aware activation of heterogeneous workers. Experiments on GRPO post-training of 4B and 8B models under real wide-area bandwidth regimes show that ECHO-2 significantly improves cost efficiency while preserving RL reward comparable to strong baselines.

[513] FlashSinkhorn: IO-Aware Entropic Optimal Transport

Felix X. -F. Ye, Xingjie Li, An Yu, Ming-Ching Chang, Linsong Chu, Davis Wertheimer

Main category: cs.LG

TL;DR: FlashSinkhorn: An IO-aware GPU solver for entropic optimal transport using FlashAttention-style fused kernels for efficient Sinkhorn iterations.

Details

Motivation: Existing GPU solvers for entropic optimal transport (EOT) are inefficient at scale due to quadratic HBM traffic from dense interactions, and existing online backends use generic tiled map-reduce kernels with limited fusion.

Method: Rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores (same normalization as transformer attention), enabling FlashAttention-style fusion and tiling with fused Triton kernels that stream tiles through on-chip SRAM and update dual potentials in a single pass.

Result: Achieves up to 32× forward-pass and 161× end-to-end speedups over state-of-the-art online baselines on point-cloud OT on A100 GPUs, with improved scalability on OT-based downstream tasks.

Conclusion: FlashSinkhorn provides an efficient IO-aware EOT solver that substantially reduces HBM IO per iteration while retaining linear-memory operations, enabling scalable optimization.

Abstract: Entropic optimal transport (EOT) via Sinkhorn iterations is widely used in modern machine learning, yet GPU solvers remain inefficient at scale. Tensorized implementations suffer quadratic HBM traffic from dense $n\times m$ interactions, while existing online backends avoid storing dense matrices but still rely on generic tiled map-reduce reduction kernels with limited fusion. We present \textbf{FlashSinkhorn}, an IO-aware EOT solver for squared Euclidean cost that rewrites stabilized log-domain Sinkhorn updates as row-wise LogSumExp reductions of biased dot-product scores, the same normalization as transformer attention. This enables FlashAttention-style fusion and tiling: fused Triton kernels stream tiles through on-chip SRAM and update dual potentials in a single pass, substantially reducing HBM IO per iteration while retaining linear-memory operations. We further provide streaming kernels for transport application, enabling scalable first- and second-order optimization. On A100 GPUs, FlashSinkhorn achieves up to $32\times$ forward-pass and $161\times$ end-to-end speedups over state-of-the-art online baselines on point-cloud OT, improves scalability on OT-based downstream tasks. For reproducibility, we release an open-source implementation at https://github.com/ot-triton-lab/ot_triton.

[514] Variational Sparse Paired Autoencoders (vsPAIR) for Inverse Problems and Uncertainty Quantification

Jack Michael Solomon, Rishi Leburu, Matthias Chung

Main category: cs.LG

TL;DR: vsPAIR: Variational Sparse Paired Autoencoder for inverse problems with interpretable uncertainty estimation through paired VAEs and sparse encodings.

Details

Motivation: Inverse problems require not just point estimates but interpretable uncertainty quantification. Many applications need fast inference with uncertainty estimates, which remains challenging.

Method: Proposes Variational Sparse Paired Autoencoder (vsPAIR) that pairs a standard VAE encoding observations with a sparse VAE encoding quantities of interest, connected through learned latent mapping. Uses variational structure for uncertainty, paired architecture for interpretability, and sparse encodings for structured representations.

Result: Experiments on blind inpainting and computed tomography demonstrate vsPAIR is a capable inverse problem solver that provides interpretable and structured uncertainty estimates.

Conclusion: vsPAIR effectively addresses the challenge of providing both fast inference and interpretable uncertainty estimation for inverse problems through its paired variational architecture with sparse encodings.

Abstract: Inverse problems are fundamental to many scientific and engineering disciplines; they arise when one seeks to reconstruct hidden, underlying quantities from noisy measurements. Many applications demand not just point estimates but interpretable uncertainty. Providing fast inference alongside uncertainty estimates remains challenging yet desirable in numerous applications. We propose the Variational Sparse Paired Autoencoder (vsPAIR) to address this challenge. The architecture pairs a standard VAE encoding observations with a sparse VAE encoding quantities of interest, connected through a learned latent mapping. The variational structure enables uncertainty estimation, the paired architecture encourages interpretability by anchoring QoI representations to clean data, and sparse encodings provide structure by concentrating information into identifiable factors rather than diffusing across all dimensions. To validate the effectiveness of our proposed architecture, we conduct experiments on blind inpainting and computed tomography, demonstrating that vsPAIR is a capable inverse problem solver that can provide interpretable and structured uncertainty estimates.

[515] Privileged Information Distillation for Language Models

Emiliano Penaloza, Dheeraj Vattikonda, Nicolas Gontier, Alexandre Lacoste, Laurent Charlin, Massimo Caccia

Main category: cs.LG

TL;DR: π-Distill and OPSD enable distillation of frontier models for multi-turn agentic environments using only action trajectories as privileged information, outperforming standard supervised finetuning + RL approaches.

Details

Motivation: Transferring capabilities learned with privileged information (PI) to policies that must act without it at inference time is challenging, especially in multi-turn agentic environments where only action trajectories are observable but internal reasoning is hidden.

Method: Two approaches: 1) π-Distill - joint teacher-student objective training PI-conditioned teacher and unconditioned student simultaneously using same model; 2) On-Policy Self-Distillation (OPSD) - RL with reverse KL-penalty between student and PI-conditioned teacher.

Result: Both algorithms effectively distill frontier agents using action-only PI, outperforming industry standard practices (supervised finetuning + RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI.

Conclusion: π-Distill and OPSD provide effective methods for knowledge distillation in agentic environments where only action trajectories are available as privileged information, enabling transfer of capabilities from frontier models to inference-time policies.

Abstract: Training-time privileged information (PI) can enable language models to succeed on tasks they would otherwise fail, making it a powerful tool for reinforcement learning in hard, long-horizon settings. However, transferring capabilities learned with PI to policies that must act without it at inference time remains a fundamental challenge. We study this problem in the context of distilling frontier models for multi-turn agentic environments, which typically hide their internal reasoning and expose only action trajectories. This breaks standard distillation pipelines, since successful behavior is observable, but the reasoning process is not. For this, we introduce π-Distill, a joint teacher-student objective that trains a PI-conditioned teacher and an unconditioned student simultaneously using the same model. Additionally, we also introduce On-Policy Self-Distillation (OPSD), an alternative approach that trains using Reinforcement Learning (RL) with a reverse KL-penalty between the student and the PI-conditioned teacher. We show that both of these algorithms effectively distill frontier agents using action-only PI. Specifically, we find that π-Distill and, in some cases, OPSD, outperform industry standard practices (Supervised finetuning followed by RL) that assume access to full Chain-of-Thought supervision across multiple agentic benchmarks, models, and forms of PI. We complement our results with extensive analysis that characterizes the factors enabling effective learning with PI, focusing primarily on π-Distill and characterizing when OPSD is competitive.

[516] From Scalar Rewards to Potential Trends: Shaping Potential Landscapes for Model-Based Reinforcement Learning

Yao-Hui Li, Zeyu Wang, Xin Li, Wei Pang, Yingfang Yuan, Zhengkun Chen, Boya Zhang, Riashat Islam, Alex Lamb, Yonggang Zhang

Main category: cs.LG

TL;DR: SLOPE is a model-based RL framework that replaces scalar reward regression with optimistic potential landscapes to address sparse reward challenges.

Details

Motivation: Standard MBRL struggles in sparse reward settings because regressing ground-truth scalar rewards creates flat, gradient-free landscapes that provide no directional guidance for planning.

Method: Shifts reward modeling from predicting scalars to constructing informative potential landscapes using optimistic distributional regression to estimate high-confidence upper bounds, amplifying rare success signals and ensuring exploration gradients.

Result: Outperforms leading baselines on 30+ tasks across 5 benchmarks in fully sparse, semi-sparse, and dense reward settings.

Conclusion: SLOPE effectively addresses sparse reward challenges in MBRL by transforming reward modeling from scalar prediction to landscape construction, enabling better exploration and planning.

Abstract: Model-based reinforcement learning (MBRL) achieves high sample efficiency by simulating future trajectories with learned dynamics and reward models. However, its effectiveness is severely compromised in sparse reward settings. The core limitation lies in the standard paradigm of regressing ground-truth scalar rewards: in sparse environments, this yields a flat, gradient-free landscape that fails to provide directional guidance for planning. To address this challenge, we propose Shaping Landscapes with Optimistic Potential Estimates (SLOPE), a novel framework that shifts reward modeling from predicting scalars to constructing informative potential landscapes. SLOPE employs optimistic distributional regression to estimate high-confidence upper bounds, which amplifies rare success signals and ensures sufficient exploration gradients. Evaluations on 30+ tasks across 5 benchmarks demonstrate that SLOPE consistently outperforms leading baselines in fully sparse, semi-sparse, and dense rewards.

[517] Learning Where It Matters: Geometric Anchoring for Robust Preference Alignment

Youngjae Cho, Jongsuk Kim, Ji-Hoon Kim

Main category: cs.LG

TL;DR: GAPO improves LLM alignment robustness by replacing fixed reference policies with dynamic, geometry-aware adversarial anchors that adaptively reweight preference pairs based on local sensitivity.

Details

Motivation: Current preference optimization methods like DPO use static reference policies that become miscalibrated as the policy drifts, causing distributional mismatch and amplifying noise. Reference-free variants avoid mismatch but suffer from unconstrained reward drift.

Method: Proposes Geometric Anchor Preference Optimization (GAPO) which replaces fixed references with dynamic, geometry-aware anchors - adversarial local perturbations of the current policy within a small radius that serve as pessimistic baselines. Introduces Anchor Gap (reward discrepancy between policy and anchor) and optimizes a logistic objective weighted by this gap to downweight brittle instances and emphasize robust preference signals.

Result: GAPO consistently improves robustness across diverse noise settings while matching or improving performance on standard LLM alignment and reasoning benchmarks.

Conclusion: GAPO provides a more robust approach to LLM alignment by using dynamic geometry-aware anchors to adaptively handle noisy supervision and prevent distributional mismatch issues present in static reference methods.

Abstract: Direct Preference Optimization (DPO) and related methods align large language models from pairwise preferences by regularizing updates against a fixed reference policy. As the policy drifts, a static reference, however, can become increasingly miscalibrated, leading to distributional mismatch and amplifying spurious preference signals under noisy supervision. Conversely, reference-free variants avoid mismatch but often suffer from unconstrained reward drift. We propose Geometric Anchor Preference Optimization (GAPO), which replaces the fixed reference with a dynamic, geometry-aware anchor: an adversarial local perturbation of the current policy within a small radius that serves as a pessimistic baseline. This anchor enables an adaptive reweighting mechanism, modulating the importance of each preference pair based on its local sensitivity. We further introduce the Anchor Gap, the reward discrepancy between the policy and its anchor, and show under smoothness conditions that it approximates worst-case local margin degradation. Optimizing a logistic objective weighted by this gap downweights geometrically brittle instances while emphasizing robust preference signals. Across diverse noise settings, GAPO consistently improves robustness while matching or improving performance on standard LLM alignment and reasoning benchmarks.

[518] CSRv2: Unlocking Ultra-Sparse Embeddings

Lixuan Guo, Yifei Wang, Tiansheng Wen, Yifan Wang, Aosong Feng, Bo Chen, Stefanie Jegelka, Chenyu You

Main category: cs.LG

TL;DR: CSRv2 improves ultra-sparse embeddings via progressive annealing and supervised contrastive learning, achieving 7x speedup over compact dense embeddings while maintaining performance.

Details

Motivation: Current dense embeddings are high-dimensional and costly, while sparse embeddings suffer severe degradation in ultra-sparse regimes with over 80% inactive neurons, limiting efficiency gains.

Method: CSRv2 uses progressive k-annealing to stabilize sparsity learning, supervised contrastive objectives to enhance representation quality, and full backbone finetuning for end-to-end adaptability.

Result: Reduces dead neurons from 80% to 20%, achieves 14% accuracy gain at k=2, matches performance of CSR at k=8 and MRL at 32 dimensions with only two active features, delivers 7x speedup over MRL and up to 300x efficiency improvements over dense embeddings.

Conclusion: CSRv2 makes ultra-sparse embeddings practical without performance compromise, enabling real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.

Abstract: In the era of large foundation models, the quality of embeddings has become a central determinant of downstream task performance and overall system capability. Yet widely used dense embeddings are often extremely high-dimensional, incurring substantial costs in storage, memory, and inference latency. To address these, Contrastive Sparse Representation (CSR) is recently proposed as a promising direction, mapping dense embeddings into high-dimensional but k-sparse vectors, in contrast to compact dense embeddings such as Matryoshka Representation Learning (MRL). Despite its promise, CSR suffers severe degradation in the ultra-sparse regime, where over 80% of neurons remain inactive, leaving much of its efficiency potential unrealized. In this paper, we introduce CSRv2, a principled training approach designed to make ultra-sparse embeddings viable. CSRv2 stabilizes sparsity learning through progressive k-annealing, enhances representational quality via supervised contrastive objectives, and ensures end-to-end adaptability with full backbone finetuning. CSRv2 reduces dead neurons from 80% to 20% and delivers a 14% accuracy gain at k=2, bringing ultra-sparse embeddings on par with CSR at k=8 and MRL at 32 dimensions, all with only two active features. While maintaining comparable performance, CSRv2 delivers a 7x speedup over MRL, and yields up to 300x improvements in compute and memory efficiency relative to dense embeddings in text representation. Extensive experiments across text and vision demonstrate that CSRv2 makes ultra-sparse embeddings practical without compromising performance, where CSRv2 achieves 7%/4% improvement over CSR when k=4 and further increases this gap to 14%/6% when k=2 in text/vision representation. By making extreme sparsity viable, CSRv2 broadens the design space for real-time and edge-deployable AI systems where both embedding quality and efficiency are critical.

[519] ContextBench: A Benchmark for Context Retrieval in Coding Agents

Han Li, Letian Zhu, Bohan Zhang, Rili Feng, Jiaming Wang, Yue Pan, Earl T. Barr, Sarro Federica, Zhaoyang Chu, He Ye

Main category: cs.LG

TL;DR: ContextBench is a process-oriented evaluation framework for coding agents that measures how they retrieve and use code context during issue resolution, revealing that sophisticated agent scaffolding provides minimal gains and LLMs prioritize recall over precision.

Details

Motivation: Existing evaluations of LLM-based coding agents focus primarily on final task success, offering limited insight into how agents retrieve and use code context during problem solving. There's a need for process-oriented evaluation to understand context retrieval behavior.

Method: Developed ContextBench with 1,136 issue-resolution tasks from 66 repositories across 8 programming languages, each with human-annotated gold contexts. Created automated evaluation framework tracking agent trajectories and measuring context recall, precision, and efficiency throughout issue resolution. Evaluated 4 frontier LLMs and 5 coding agents.

Result: Sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson” of coding agents). LLMs consistently favor recall over precision. Substantial gaps exist between explored and utilized context. ContextBench provides intermediate gold-context metrics that complement end-to-end benchmarks.

Conclusion: ContextBench offers valuable intermediate signals for guiding LLM reasoning in software tasks by unboxing the issue-resolution process, revealing important patterns in how coding agents retrieve and use context that aren’t captured by traditional success metrics.

Abstract: LLM-based coding agents have shown strong performance on automated issue resolution benchmarks, yet existing evaluations largely focus on final task success, providing limited insight into how agents retrieve and use code context during problem solving. We introduce ContextBench, a process-oriented evaluation of context retrieval in coding agents. ContextBench consists of 1,136 issue-resolution tasks from 66 repositories across eight programming languages, each augmented with human-annotated gold contexts. We further implement an automated evaluation framework that tracks agent trajectories and measures context recall, precision, and efficiency throughout issue resolution. Using ContextBench, we evaluate four frontier LLMs and five coding agents. Our results show that sophisticated agent scaffolding yields only marginal gains in context retrieval (“The Bitter Lesson” of coding agents), LLMs consistently favor recall over precision, and substantial gaps exist between explored and utilized context. ContextBench augments existing end-to-end benchmarks with intermediate gold-context metrics that unbox the issue-resolution process. These contexts offer valuable intermediate signals for guiding LLM reasoning in software tasks.

[520] The hidden risks of temporal resampling in clinical reinforcement learning

Thomas Frost, Hrisheekesh Vaidya, Steve Harris

Main category: cs.LG

TL;DR: Temporal resampling in offline RL for healthcare degrades live deployment performance due to counterfactual trajectories, distorted temporal expectations, and compounding generalization errors, with standard evaluation metrics failing to detect these issues.

Details

Motivation: Current offline RL research in healthcare aggregates patient data into fixed time intervals to fit standard frameworks, but the impact of these temporal manipulations on model safety and efficacy is poorly understood.

Method: Used both a gridworld navigation task and the UVA/Padova clinical diabetes simulator to demonstrate how temporal resampling affects offline RL algorithms during live deployment.

Result: Temporal resampling significantly degrades offline RL performance during live deployment, with three failure mechanisms identified: generation of counterfactual trajectories, distortion of temporal expectations, and compounding of generalization errors.

Conclusion: Reveals fundamental risk in current healthcare ORL pipelines and emphasizes need for methods that explicitly handle irregular timing of clinical decision-making, as standard off-policy evaluation metrics fail to detect performance drops.

Abstract: Offline reinforcement learning (ORL) has shown potential for improving decision-making in healthcare. However, contemporary research typically aggregates patient data into fixed time intervals, simplifying their mapping to standard ORL frameworks. The impact of these temporal manipulations on model safety and efficacy remains poorly understood. In this work, using both a gridworld navigation task and the UVA/Padova clinical diabetes simulator, we demonstrate that temporal resampling significantly degrades the performance of offline reinforcement learning algorithms during live deployment. We propose three mechanisms that drive this failure: (i) the generation of counterfactual trajectories, (ii) the distortion of temporal expectations, and (iii) the compounding of generalisation errors. Crucially, we find that standard off-policy evaluation metrics can fail to detect these drops in performance. Our findings reveal a fundamental risk in current healthcare ORL pipelines and emphasise the need for methods that explicitly handle the irregular timing of clinical decision-making.

[521] UTOPIA: Unlearnable Tabular Data via Decoupled Shortcut Embedding

Jiaming He, Fuming Luo, Hongwei Li, Wenbo Jiang, Wenshu Fan, Zhenbo Shi, Xudong Jiang, Yi Yu

Main category: cs.LG

TL;DR: UTOPIA creates unlearnable tabular data by decoupling optimization into semantic obfuscation on high-saliency features and embedding hyper-correlated shortcuts on low-saliency redundant features, preventing unauthorized model training while preserving tabular validity.

Details

Motivation: Tabular data in sensitive domains like finance and healthcare needs protection from unauthorized model training, but existing unlearnable example methods designed for vision data transfer poorly to tabular data due to mixed numerical/categorical constraints and saliency sparsity.

Method: UTOPIA exploits feature redundancy to decouple optimization into two channels: (1) high saliency features for semantic obfuscation, and (2) low saliency redundant features for embedding a hyper-correlated shortcut. This creates constraint-aware dominant shortcuts while preserving tabular validity under a Spectral Dominance condition.

Result: Extensive experiments show UTOPIA drives unauthorized training toward near-random performance, outperforming strong unlearnable example baselines and transferring well across different architectures on tabular datasets.

Conclusion: UTOPIA provides an effective mechanism for protecting sensitive tabular data from unauthorized model training by creating certified unlearnable examples that work under realistic tabular data constraints.

Abstract: Unlearnable examples (UE) have emerged as a practical mechanism to prevent unauthorized model training on private vision data, while extending this protection to tabular data is nontrivial. Tabular data in finance and healthcare is highly sensitive, yet existing UE methods transfer poorly because tabular features mix numerical and categorical constraints and exhibit saliency sparsity, with learning dominated by a few dimensions. Under a Spectral Dominance condition, we show certified unlearnability is feasible when the poison spectrum overwhelms the clean semantic spectrum. Guided by this, we propose Unlearnable Tabular Data via DecOuPled Shortcut EmbeddIng (UTOPIA), which exploits feature redundancy to decouple optimization into two channels: high saliency features for semantic obfuscation and low saliency redundant features for embedding a hyper correlated shortcut, yielding constraint-aware dominant shortcuts while preserving tabular validity. Extensive experiments across tabular datasets and models show UTOPIA drives unauthorized training toward near random performance, outperforming strong UE baselines and transferring well across architectures.

[522] PALMS: Pavlovian Associative Learning Models Simulator

Martin Fixman, Alessandro Abati, Julián Jiménez Nimmo, Sean Lim, Esther Mondragón

Main category: cs.LG

TL;DR: PALMS is a Python simulator for Pavlovian conditioning experiments that implements multiple associative learning models with a graphical interface for experimental design input and result visualization.

Details

Motivation: To provide researchers with a comprehensive simulation environment for Pavlovian conditioning experiments that can handle complex experimental designs, multiple learning models, and large stimulus sets while enabling rapid model comparison.

Method: Developed a Python-based simulator with graphical interface that implements canonical Rescorla-Wagner model plus several attentional learning approaches (Pearce-Kaye-Hall, Mackintosh Extended, Le Pelley’s Hybrid) and a novel extension with unified variable learning rate. Supports configural cues and compounds.

Result: Created PALMS simulator that efficiently handles experiments with hundreds of stimuli, provides instant visualization, enables model comparisons, supports data export, and reproduces published experiments from associative learning literature.

Conclusion: PALMS offers a powerful, open-source tool for simulating Pavlovian conditioning experiments that expands predictive capabilities of existing models and facilitates rapid comparison of different learning theories.

Abstract: Simulations are an indispensable step in the cycle of theory development and refinement, helping researchers formulate precise definitions, generate models, and make accurate predictions. This paper introduces the Pavlovian Associative Learning Models Simulator (PALMS), a Python environment to simulate Pavlovian conditioning experiments. In addition to the canonical Rescorla-Wagner model, PALMS incorporates several attentional learning approaches, including Pearce-Kaye-Hall, Mackintosh Extended, Le Pelley’s Hybrid, and a novel extension of the Rescorla-Wagner model with a unified variable learning rate that integrates Mackintosh’s and Pearce and Hall’s opposing conceptualisations. The simulator’s graphical interface allows for the input of entire experimental designs in an alphanumeric format, akin to that used by experimental neuroscientists. Moreover, it uniquely enables the simulation of experiments involving hundreds of stimuli, as well as the computation of configural cues and configural-cue compounds across all models, thereby considerably expanding their predictive capabilities. PALMS operates efficiently, providing instant visualisation of results, supporting rapid, precise comparisons of various models’ predictions within a single architecture and environment. Furthermore, graphic displays can be easily saved, and simulated data can be exported to spreadsheets. To illustrate the simulator’s capabilities and functionalities, we provide a detailed description of the software and examples of use, reproducing published experiments in the associative learning literature. PALMS is licensed under the open-source GNU Lesser General Public License 3.0. The simulator source code and the latest multiplatform release build are accessible as a GitHub repository at https://github.com/cal-r/PALMS-Simulator

[523] MedVerse: Efficient and Reliable Medical Reasoning via DAG-Structured Parallel Execution

Jianwen Chen, Xinyu Yang, Peng Xia, Arian Azarang, Yueh Z Lee, Gang Li, Hongtu Zhu, Yun Li, Beidi Chen, Huaxiu Yao

Main category: cs.LG

TL;DR: MedVerse: A parallel medical reasoning framework using Petri net-based DAG structure to overcome LLMs’ sequential limitations for complex clinical tasks like differential diagnosis.

Details

Motivation: LLMs' sequential autoregressive decoding forces inherently parallel clinical reasoning (like differential diagnosis) into linear paths, limiting efficiency and reliability for complex medical problems.

Method: Proposes MedVerse framework with three components: 1) MedVerse Curator for automated synthesis of knowledge-grounded medical reasoning paths into Petri net-structured representations, 2) topology-aware attention mechanism with adaptive position indices for parallel reasoning, and 3) customized inference engine for parallel execution without overhead.

Result: Improves strong general-purpose LLMs by up to 8.9%, achieves comparable performance to specialized medical LLMs while delivering 1.3x reduction in inference latency and 1.7x increase in generation throughput through parallel decoding.

Conclusion: MedVerse successfully reformulates medical reasoning as parallelizable DAG process, enabling more efficient and reliable complex medical inference while maintaining logical consistency.

Abstract: Large language models (LLMs) have demonstrated strong performance and rapid progress in a wide range of medical reasoning tasks. However, their sequential autoregressive decoding forces inherently parallel clinical reasoning, such as differential diagnosis, into a single linear reasoning path, limiting both efficiency and reliability for complex medical problems. To address this, we propose MedVerse, a reasoning framework for complex medical inference that reformulates medical reasoning as a parallelizable directed acyclic graph (DAG) process based on Petri net theory. The framework adopts a full-stack design across data, model architecture, and system execution. For data creation, we introduce the MedVerse Curator, an automated pipeline that synthesizes knowledge-grounded medical reasoning paths and transforms them into Petri net-structured representations. At the architectural level, we propose a topology-aware attention mechanism with adaptive position indices that supports parallel reasoning while preserving logical consistency. Systematically, we develop a customized inference engine that supports parallel execution without additional overhead. Empirical evaluations show that MedVerse improves strong general-purpose LLMs by up to 8.9%. Compared to specialized medical LLMs, MedVerse achieves comparable performance while delivering a 1.3x reduction in inference latency and a 1.7x increase in generation throughput, enabled by its parallel decoding capability. Code is available at https://github.com/aiming-lab/MedVerse.

[524] Compiler-Assisted Speculative Sampling for Accelerated LLM Inference on Heterogeneous Edge Devices

Alejandro Ruiz y Mesa, Guilherme Korol, Moritz Riesterer, João Paulo Cardoso de Lima, Jeronimo Castrillon

Main category: cs.LG

TL;DR: EdgeSpec: A framework for efficient speculative decoding of LLMs on edge devices using analytical cost modeling and heterogeneous hardware partitioning.

Details

Motivation: LLM deployment on resource-constrained edge devices faces severe latency constraints, especially for real-time applications where delayed responses compromise safety/usability. Speculative decoding shows promise but faces challenges: integration into compiler workflows and exploitation of heterogeneous compute resources on modern SoCs.

Method: Uses an analytical cost model to explore heterogeneous hardware configurations and guide coarse-grained partitioning of LLM subgraphs. The model predicts when speculative sampling and heterogeneous execution are beneficial, particularly for edge-typical short input sequences.

Result: Validated on edge device with hexacore Cortex-A CPU and Mali GPU, achieving up to 1.68× speedup for translation tasks, closely matching analytic expectations.

Conclusion: The analytical cost model effectively enables efficient speculative decoding on edge devices by guiding hardware partitioning strategies, addressing key deployment challenges for real-time LLM applications.

Abstract: LLM deployment on resource-constrained edge devices faces severe latency constraints, particularly in real-time applications where delayed responses can compromise safety or usability. Among many approaches to mitigate the inefficiencies of sequential token-by-token generation, Speculative Decoding (SD) has emerged as a promising technique. However, SD at the edge is hindered by two major challenges: (1) integrating SD into a compiler-based workflow without sacrificing performance or programmability, and (2) exploiting the heterogeneous compute resources of modern SoCs through carefully designed partitioning strategies. This work addresses these challenges by using an analytical cost model that explores heterogeneous hardware configurations and guides coarse-grained partitioning of LLM subgraphs, particularly with edge-typical short input sequence lengths. The cost model predicts when speculative sampling and heterogeneous execution are jointly beneficial and is validated on an edge device featuring a hexacore Cortex-A CPU and a Mali GPU, revealing up to 1.68$\times$ speedup for translation tasks, closely matching analytic expectations.

[525] LLaDA2.1: Speeding Up Text Diffusion via Token Editing

Tiwei Bie, Maosong Cao, Xiang Cao, Bingsen Chen, Fuyuan Chen, Kun Chen, Lun Du, Daozhuo Feng, Haibo Feng, Mingliang Gong, Zhuocheng Gong, Yanmei Gu, Jian Guan, Kaiyuan Guan, Hongliang He, Zenan Huang, Juyong Jiang, Zhonghui Jiang, Zhenzhong Lan, Chengxi Li, Jianguo Li, Zehuan Li, Huabin Liu, Lin Liu, Guoshan Lu, Yuan Lu, Yuxin Ma, Xingyu Mou, Zhenxuan Pan, Kaida Qiu, Yuji Ren, Jianfeng Tan, Yiding Tian, Zian Wang, Lanning Wei, Tao Wu, Yipeng Xing, Wentao Ye, Liangyu Zha, Tianze Zhang, Xiaolu Zhang, Junbo Zhao, Da Zheng, Hao Zhong, Wanli Zhong, Jun Zhou, Junlin Zhou, Liwang Zhu, Muzhi Zhu, Yihong Zhuang

Main category: cs.LG

TL;DR: LLaDA2.1 introduces a joint threshold-decoding scheme combining Token-to-Token editing with Mask-to-Token diffusion, offering Speedy and Quality modes for balancing generation speed and quality, plus RL alignment for improved reasoning and instruction-following.

Details

Motivation: The paper addresses the trade-off between decoding speed and generation quality in large diffusion language models (dLLMs). While previous work showed scaling potential, maintaining both high speed and quality remained challenging.

Method: 1) Joint configurable threshold-decoding scheme integrating Token-to-Token (T2T) editing with conventional Mask-to-Token (M2T) diffusion. 2) Two operational modes: Speedy Mode (lower M2T thresholds + T2T refinement) and Quality Mode (conservative thresholds). 3) Large-scale RL framework for dLLMs with stable gradient estimation techniques for alignment.

Result: LLaDA2.1 achieves strong performance across 33 benchmarks with lightning-fast decoding: 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench despite 100B parameters. Two model variants released: LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B).

Conclusion: LLaDA2.1 successfully balances speed and quality in diffusion language models through innovative threshold-decoding and RL alignment, achieving unprecedented decoding speeds while maintaining strong task performance.

Abstract: While LLaDA2.0 showcased the scaling potential of 100B-level block-diffusion models and their inherent parallelization, the delicate equilibrium between decoding speed and generation quality has remained an elusive frontier. Today, we unveil LLaDA2.1, a paradigm shift designed to transcend this trade-off. By seamlessly weaving Token-to-Token (T2T) editing into the conventional Mask-to-Token (M2T) scheme, we introduce a joint, configurable threshold-decoding scheme. This structural innovation gives rise to two distinct personas: the Speedy Mode (S Mode), which audaciously lowers the M2T threshold to bypass traditional constraints while relying on T2T to refine the output; and the Quality Mode (Q Mode), which leans into conservative thresholds to secure superior benchmark performances with manageable efficiency degrade. Furthering this evolution, underpinned by an expansive context window, we implement the first large-scale Reinforcement Learning (RL) framework specifically tailored for dLLMs, anchored by specialized techniques for stable gradient estimation. This alignment not only sharpens reasoning precision but also elevates instruction-following fidelity, bridging the chasm between diffusion dynamics and complex human intent. We culminate this work by releasing LLaDA2.1-Mini (16B) and LLaDA2.1-Flash (100B). Across 33 rigorous benchmarks, LLaDA2.1 delivers strong task performance and lightning-fast decoding speed. Despite its 100B volume, on coding tasks it attains an astounding 892 TPS on HumanEval+, 801 TPS on BigCodeBench, and 663 TPS on LiveCodeBench.

[526] Breaking the Simplification Bottleneck in Amortized Neural Symbolic Regression

Paul Saegert, Ullrich Köthe

Main category: cs.LG

TL;DR: SimpliPy is a fast rule-based simplification engine for symbolic regression that achieves 100x speed-up over SymPy, enabling Flash-ANSR framework to scale better and achieve state-of-the-art performance on symbolic regression benchmarks.

Details

Motivation: Amortized symbolic regression struggles with scaling to realistic scientific complexity due to slow reduction of equivalent expressions to normalized forms using general-purpose Computer Algebra Systems like SymPy, which severely limits training and inference speed.

Method: Proposes SimpliPy, a rule-based simplification engine that achieves 100x speed-up over SymPy. Uses this in Flash-ANSR framework for amortized symbolic regression, enabling larger training sets, more efficient token budget use, and systematic training set decontamination.

Result: Flash-ANSR achieves much better accuracy than amortized baselines (NeSymReS, E2E) on FastSRB benchmark. Performs on par with state-of-the-art direct optimization (PySR) while recovering more concise expressions with increasing inference budget.

Conclusion: Fast simplification via SimpliPy enables substantial improvements in amortized symbolic regression, making it competitive with direct optimization methods while maintaining interpretability and conciseness.

Abstract: Symbolic regression (SR) aims to discover interpretable analytical expressions that accurately describe observed data. Amortized SR promises to be much more efficient than the predominant genetic programming SR methods, but currently struggles to scale to realistic scientific complexity. We find that a key obstacle is the lack of a fast reduction of equivalent expressions to a concise normalized form. Amortized SR has addressed this by general-purpose Computer Algebra Systems (CAS) like SymPy, but the high computational cost severely limits training and inference speed. We propose SimpliPy, a rule-based simplification engine achieving a 100-fold speed-up over SymPy at comparable quality. This enables substantial improvements in amortized SR, including scalability to much larger training sets, more efficient use of the per-expression token budget, and systematic training set decontamination with respect to equivalent test expressions. We demonstrate these advantages in our Flash-ANSR framework, which achieves much better accuracy than amortized baselines (NeSymReS, E2E) on the FastSRB benchmark. Moreover, it performs on par with state-of-the-art direct optimization (PySR) while recovering more concise instead of more complex expressions with increasing inference budget.

[527] The Theory and Practice of MAP Inference over Non-Convex Constraints

Leander Kurscheidt, Gabriele Masina, Roberto Sebastiani, Antonio Vergari

Main category: cs.LG

TL;DR: A method for constrained maximum a posteriori (MAP) inference in probabilistic ML systems with non-convex algebraic constraints, using exact message-passing for tractable cases and domain partitioning with numerical optimization for general cases.

Details

Motivation: Probabilistic ML systems in safety-critical settings need to make predictions subject to algebraic constraints (e.g., obstacle avoidance), but real-world constraints are rarely convex and densities are not log-concave, making constrained MAP prediction challenging.

Method: Two approaches: 1) Exact message-passing algorithm for tractable cases where constrained MAP inference can be performed efficiently, 2) General strategy that partitions the domain into convex feasible regions combined with numerical constrained optimization.

Result: The methods outperform constraint-agnostic baselines on synthetic and real-world benchmarks, and scale to complex densities that are intractable for state-of-the-art exact solvers.

Conclusion: The paper provides effective approaches for constrained MAP inference in probabilistic ML systems with non-convex constraints, addressing a critical need in safety-critical applications.

Abstract: In many safety-critical settings, probabilistic ML systems have to make predictions subject to algebraic constraints, e.g., predicting the most likely trajectory that does not cross obstacles. These real-world constraints are rarely convex, nor the densities considered are (log-)concave. This makes computing this constrained maximum a posteriori (MAP) prediction efficiently and reliably extremely challenging. In this paper, we first investigate under which conditions we can perform constrained MAP inference over continuous variables exactly and efficiently and devise a scalable message-passing algorithm for this tractable fragment. Then, we devise a general constrained MAP strategy that interleaves partitioning the domain into convex feasible regions with numerical constrained optimization. We evaluate both methods on synthetic and real-world benchmarks, showing our approaches outperform constraint-agnostic baselines, and scale to complex densities intractable for SoTA exact solvers.

[528] Redundancy-Free View Alignment for Multimodal Human Activity Recognition with Arbitrarily Missing Views

Duc-Anh Nguyen, Nhien-An Le-Khac

Main category: cs.LG

TL;DR: RALIS is a multimodal multiview learning framework for human activity recognition that handles arbitrary view combinations and missing views through contrastive learning and mixture-of-experts.

Details

Motivation: Existing multimodal multiview learning approaches struggle with flexible view configurations including arbitrary view combinations, varying numbers of views, and heterogeneous modalities, especially when dealing with missing views during training and inference.

Method: Combines multiview contrastive learning with mixture-of-experts module; uses adjusted center contrastive loss for self-supervised representation learning and view alignment (instead of reconstructing missing views); integrates view weights for quality; reduces computational complexity from O(V²) to O(V); employs specialized load balancing strategy for mixture-of-experts to adapt to arbitrary view combinations.

Result: Validated on four datasets with inertial and human pose modalities, with number of views ranging from three to nine, demonstrating performance and flexibility in handling arbitrary view availability.

Conclusion: RALIS effectively addresses flexible view configurations in multimodal multiview learning for human activity recognition, handling missing views and arbitrary view combinations through contrastive learning and mixture-of-experts fusion.

Abstract: Multimodal multiview learning seeks to integrate information from diverse sources to enhance task performance. Existing approaches often struggle with flexible view configurations, including arbitrary view combinations, numbers of views, and heterogeneous modalities. Focusing on the context of human activity recognition, we propose RALIS, a model that combines multiview contrastive learning with a mixture-of-experts module to support arbitrary view availability during both training and inference. Instead of trying to reconstruct missing views, an adjusted center contrastive loss is used for self-supervised representation learning and view alignment, mitigating the impact of missing views on multiview fusion. This loss formulation allows for the integration of view weights to account for view quality. Additionally, it reduces computational complexity from $O(V^2)$ to $O(V)$, where $V$ is the number of views. To address residual discrepancies not captured by contrastive learning, we employ a mixture-of-experts module with a specialized load balancing strategy, tasked with adapting to arbitrary view combinations. We highlight the geometric relationship among components in our model and how they combine well in the latent space. RALIS is validated on four datasets encompassing inertial and human pose modalities, with the number of views ranging from three to nine, demonstrating its performance and flexibility.

cs.MA

[529] LingxiDiagBench: A Multi-Agent Framework for Benchmarking LLMs in Chinese Psychiatric Consultation and Diagnosis

Shihao Xu, Tiancheng Zhou, Jiatong Ma, Yanli Ding, Yiming Yan, Ming Xiao, Guoyi Li, Haiyang Geng, Yunyun Han, Jianhua Chen, Yafeng Deng

Main category: cs.MA

TL;DR: LingxiDiagBench is a large-scale multi-agent benchmark for evaluating LLMs on psychiatric diagnosis, featuring 16K synthetic consultation dialogues across 12 ICD-10 categories with realistic clinical distributions.

Details

Motivation: Addressing the global shortage of psychiatrists and subjectivity in interview-based diagnosis by creating a benchmark for AI-assisted psychiatric diagnosis that provides realistic patient simulation, clinician-verified labels, and dynamic multi-turn consultation support.

Method: Created LingxiDiag-16K dataset with 16,000 EMR-aligned synthetic consultation dialogues reproducing real clinical demographic and diagnostic distributions. Developed evaluation framework for both static diagnostic inference and dynamic multi-turn psychiatric consultation using LLMs.

Result: LLMs achieve high accuracy on binary depression-anxiety classification (up to 92.3%) but performance deteriorates for comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%). Dynamic consultation often underperforms static evaluation, and consultation quality shows only moderate correlation with diagnostic accuracy.

Conclusion: The benchmark reveals significant limitations in LLMs for complex psychiatric diagnosis, particularly for comorbidity recognition and differential diagnosis, highlighting the need for improved information-gathering strategies and diagnostic reasoning in AI-assisted mental health assessment.

Abstract: Mental disorders are highly prevalent worldwide, but the shortage of psychiatrists and the inherent subjectivity of interview-based diagnosis create substantial barriers to timely and consistent mental-health assessment. Progress in AI-assisted psychiatric diagnosis is constrained by the absence of benchmarks that simultaneously provide realistic patient simulation, clinician-verified diagnostic labels, and support for dynamic multi-turn consultation. We present LingxiDiagBench, a large-scale multi-agent benchmark that evaluates LLMs on both static diagnostic inference and dynamic multi-turn psychiatric consultation in Chinese. At its core is LingxiDiag-16K, a dataset of 16,000 EMR-aligned synthetic consultation dialogues designed to reproduce real clinical demographic and diagnostic distributions across 12 ICD-10 psychiatric categories. Through extensive experiments across state-of-the-art LLMs, we establish key findings: (1) although LLMs achieve high accuracy on binary depression–anxiety classification (up to 92.3%), performance deteriorates substantially for depression–anxiety comorbidity recognition (43.0%) and 12-way differential diagnosis (28.5%); (2) dynamic consultation often underperforms static evaluation, indicating that ineffective information-gathering strategies significantly impair downstream diagnostic reasoning; (3) consultation quality assessed by LLM-as-a-Judge shows only moderate correlation with diagnostic accuracy, suggesting that well-structured questioning alone does not ensure correct diagnostic decisions. We release LingxiDiag-16K and the full evaluation framework to support reproducible research at https://github.com/Lingxi-mental-health/LingxiDiagBench.

[530] Dieu khien he da tac tu

Minh Hoang Trinh, Hieu Minh Nguyen

Main category: cs.MA

TL;DR: A textbook on multi-agent systems control covering fundamental principles, consensus algorithms, and applications like formation control and distributed optimization.

Details

Motivation: To provide a systematic treatment of multi-agent system control fundamentals, addressing the scarcity of comprehensive textbooks in this field, particularly for educational purposes.

Method: The book is organized into three parts: introduction to multiagent systems and graph theory, design/analysis of linear consensus algorithms, and applications including formation control, network localization, distributed optimization, opinion dynamics, and matrix-weighted networks.

Result: A comprehensive textbook developed from teaching materials used since 2021, presented in a step-by-step manner to make complex topics accessible while maintaining research-level depth.

Conclusion: This book fills an important gap in educational resources for multi-agent systems control, providing systematic coverage of fundamental principles with practical applications and research directions.

Abstract: Since the early 2000s, control of multiagent systems has attracted significant research interest, with applications ranging from natural collective behaviors and social dynamics to engineered systems such as autonomous vehicles, sensor networks, and smart grids. Although research on multi-agent systems has diversified into numerous specialized directions, textbooks – including those in English – that provide a systematic treatment of the fundamental principles of multi-agent system control remain scarce. The material presented in this book has been developed and used in teaching since 2021, initially as a concise Vietnamese-language reference for the courses Networked Control Systems and Control of Multi-Agent Systems at Hanoi University of Science and Technology. The book focuses on a selection of fundamental topics of broad and continuing interest in the field. The complexity of several topics is asymptotic to that encountered in research-level studies, however, the analysis is presented in a step-by-step manner to facilitate access to commonly used methods and tools. The material is divided into three main parts. Part I introduces multiagent systems and basic graph-theoretic concepts. Part II addresses the design and analysis of linear consensus algorithms. Part III covers selected applications and research directions, including formation control, network localization, distributed optimization, opinion dynamics, and matrix-weighted networks. Each chapter concludes with notes on notable researchers in this field, further reading, and exercises. This book cannot be completed without the encouragement, support and suggestions from families, colleagues and friends. The authors appreciate feedback from readers to further improve the content of the book.

Agnieszka Dobrowolska, Rogier Hintzen, Martin Balla, Karl Gemayel, Sabine Reichert, Thomas Charman, Jen Ning Lim, Lindsay Edwards, Anna Gogleva

Main category: cs.MA

TL;DR: The Hypothesis Game: A symbolic formalism using LLM agents with fixed grammar of reasoning moves for incremental scientific hypothesis refinement, outperforming prompting baselines in error recovery tasks.

Details

Motivation: Most ML approaches frame scientific discovery as end-to-end predictions, obscuring incremental reasoning structure. Scientific progress often occurs through small, localized revisions grounded in domain context rather than extensive rewrites.

Method: Proposes The Hypothesis Game - a symbolic formalism where LLM agents operate on shared hypothesis state using fixed grammar of reasoning moves. Instantiates minimal game with LLM agents for pathway-level mechanistic refinement tasks, evaluating on corruption recovery (controlled errors) and reconstruction from partial cues.

Result: In corruption recovery tasks, game-based approach consistently removes more errors and achieves higher precision than strong prompting baselines while preserving valid structure through incremental edits. In reconstruction from partial cues, performs comparably to strongest baseline, showing explicit move-based refinement remains competitive even when ground-truth recovery is difficult.

Conclusion: Game-based reasoning offers principled route to more controllable, interpretable, and transferable hypothesis refinement systems for scientific discovery, supporting incremental symbolic reasoning over end-to-end prediction approaches.

Abstract: Most machine learning approaches to scientific discovery frame hypotheses as end-to-end predictions, obscuring the incremental structure of scientific reasoning. We propose The Hypothesis Game, a symbolic formalism for hypothesis refinement in which LLM agents operate on a shared hypothesis state using a fixed grammar of reasoning moves. The framework is motivated by the observation that scientific progress often proceeds through small, localized revisions, grounded in domain context, rather than extensive rewrites. We instantiate a minimal game with LLM agents and evaluate it on pathway-level mechanistic refinement tasks. In the primary setting of corruption recovery, where hypotheses contain controlled errors, the game-based approach consistently removes more errors and achieves higher precision than strong prompting baselines, while preserving valid structure through incremental edits. In a secondary reconstruction setting from partial cues, it performs comparably to the strongest baseline, indicating that explicit move-based refinement remains competitive even when ground-truth recovery is difficult. These findings support game-based reasoning as a principled route to more controllable, interpretable, and transferable hypothesis refinement systems for scientific discovery.

[532] Multi-Agent Reinforcement Learning Simulation for Environmental Policy Synthesis

James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

Main category: cs.MA

TL;DR: A framework combining climate simulations with Multi-Agent Reinforcement Learning (MARL) for climate policy synthesis, addressing challenges in reward definition, scalability, uncertainty propagation, and policy interpretability.

Details

Motivation: Climate policy development faces challenges from deep uncertainty, complex system dynamics, and competing stakeholder interests. Traditional climate simulations are used for policy evaluation rather than synthesis, and optimization approaches struggle with non-linear dynamics, heterogeneous agents, and comprehensive uncertainty quantification.

Method: Proposes augmenting climate simulations with Multi-Agent Reinforcement Learning (MARL) to address limitations in policy synthesis. The framework tackles key challenges including reward definition, scalability with increasing agents and state spaces, uncertainty propagation across linked systems, and solution validation.

Result: The framework provides a foundation for more sophisticated climate policy exploration, though the abstract doesn’t present specific empirical results. It acknowledges important limitations and areas for future research while addressing challenges in making MARL-derived solutions interpretable and useful for policy-makers.

Conclusion: MARL-augmented climate simulations offer a promising approach for climate policy synthesis that can handle complex system dynamics, uncertainty, and multiple stakeholders, but significant challenges remain in implementation, validation, and policy interpretability.

Abstract: Climate policy development faces significant challenges due to deep uncertainty, complex system dynamics, and competing stakeholder interests. Climate simulation methods, such as Earth System Models, have become valuable tools for policy exploration. However, their typical use is for evaluating potential polices, rather than directly synthesizing them. The problem can be inverted to optimize for policy pathways, but the traditional optimization approaches often struggle with non-linear dynamics, heterogeneous agents, and comprehensive uncertainty quantification. We propose a framework for augmenting climate simulations with Multi-Agent Reinforcement Learning (MARL) to address these limitations. We identify key challenges at the interface between climate simulations and the application of MARL in the context of policy synthesis, including reward definition, scalability with increasing agents and state spaces, uncertainty propagation across linked systems, and solution validation. Additionally, we discuss challenges in making MARL-derived solutions interpretable and useful for policy-makers. Our framework provides a foundation for more sophisticated climate policy exploration while acknowledging important limitations and areas for future research.

[533] Virtual Force-Based Routing of Modular Agents on a Graph

Adam Casselman, Manav Vora, Melkior Ornik

Main category: cs.MA

TL;DR: A routing algorithm for modular vehicles that can connect/disconnect mid-transit to efficiently visit multiple target nodes on graphs, using charge-based attraction modeling.

Details

Motivation: Modular vehicles offer efficiency and flexibility for urban/aerial transportation tasks, but routing multiple modular agents to visit all target nodes with minimal resource expenditure is challenging due to the trade-off between individual path optimality and cost benefits of joining modules.

Method: Models agents and targets as point charges, where modules follow paths of highest attractive force from target nodes and neighboring agents. Introduces a novel algorithm balancing individual path optimality with cost benefits of module joining.

Result: Validated on real-world transportation routes in Champaign-Urbana road network. Proposed method exceeds available benchmarks and demonstrates benefits of modularity in multi-target planning problems.

Conclusion: The approach effectively solves modular vehicle routing problems, showing practical advantages of modularity in transportation systems through efficient multi-target planning.

Abstract: Modular vehicles present a novel area of academic and industrial interest in the field of multi-agent systems. Modularity allows vehicles to connect and disconnect with each other mid-transit which provides a balance between efficiency and flexibility when solving complex and large scale tasks in urban or aerial transportation. This paper details a generalized scheme to route multiple modular agents on a graph to a predetermined set of target nodes. The objective is to visit all target nodes while incurring minimum resource expenditure. Agents that are joined together will incur the equivalent cost of a single agent, which is motivated by the logistical benefits of traffic reduction and increased fuel efficiency. To solve this problem, we introduce a novel algorithm that seeks to balance the optimality of the path that every single module takes and the cost benefit of joining modules. Our approach models the agents and targets as point charges, where the modules take the path of highest attractive force from its target node and neighboring agents. We validate our approach by simulating multiple modular agents along real-world transportation routes in the road network of Champaign-Urbana, Illinois, USA. The proposed method easily exceeds the available benchmarks and illustrates the benefits of modularity in multi-target planning problems.

cs.MM

[534] TAROT: Towards Optimization-Driven Adaptive FEC Parameter Tuning for Video Streaming

Jashanjot Singh Sidhu, Aman Sahu, Abdelhak Bentaleb

Main category: cs.MM

TL;DR: TAROT is an adaptive FEC controller for video streaming that dynamically adjusts redundancy, block size, and symbolization per segment to optimize quality-overhead tradeoffs based on network conditions and buffer levels.

Details

Motivation: Static FEC configurations in video streaming are inefficient - they create unnecessary redundancy during stable periods and insufficient protection during bursty losses, especially with shallow buffers and oversized blocks that increase stall risk.

Method: TAROT is a cross-layer, optimization-driven FEC controller that selects redundancy, block size, and symbolization per segment. It’s codec-agnostic (supports Reed-Solomon, RaptorQ, XOR codes) and uses a fine-grained scoring model that incorporates transport-layer loss/goodput, application buffer dynamics, and block-level timing constraints.

Result: Across Low-Latency Live and Video-on-Demand streaming modes with diverse network traces and ABR algorithms, TAROT reduces FEC overhead by up to 43% while improving perceptual quality by 10 VMAF units with minimal rebuffering.

Conclusion: TAROT achieves a stronger overhead-quality balance than static FECs by dynamically adapting to network conditions and buffer levels, making it more efficient for video streaming applications.

Abstract: Forward Error Correction (FEC) remains essential for protecting video streaming against packet loss, yet most real deployments still rely on static, coarse-grained configurations that cannot react to rapid shifts in loss rate, goodput, or client buffer levels. These rigid settings often create inefficiencies: unnecessary redundancy that suppresses throughput during stable periods, and insufficient protection during bursty losses, especially when shallow buffers and oversized blocks increase stall risk. To address these challenges, we present TAROT, a cross-layer, optimization-driven FEC controller that selects redundancy, block size, and symbolization on a per-segment basis. TAROT is codec-agnostic–supporting Reed-Solomon, RaptorQ, and XOR-based codes–and evaluates a pre-computed candidate set using a fine-grained scoring model. The scoring function jointly incorporates transport-layer loss and goodput, application layer buffer dynamics, and block-level timing constraints to penalize insufficient coverage, excessive overhead, and slow block completion. To enable realistic testing, we extend the SABRE simulator 1 with two new modules: a high-fidelity packet-loss generator that replays diverse multi-trace loss patterns, and a modular FEC benchmarking layer supporting arbitrary code/parameter combinations. Across Low-Latency Live (LLL) and Video-on-Demand (VoD) streaming modes, diverse network traces, and multiple ABR algorithms, TAROT reduces FEC overhead by up to 43% while improving perceptual quality by 10 VMAF units with minimal rebuffering, achieving a stronger overhead-quality balance than static FECs.

eess.AS

[535] Soft Clustering Anchors for Self-Supervised Speech Representation Learning in Joint Embedding Prediction Architectures

Georgios Ioannides, Adrian Kieback, Judah Goldfeder, Linsey Pang, Aman Chadha, Aaron Elkins, Yann LeCun, Ravid Shwartz-Ziv

Main category: eess.AS

TL;DR: GMM-Anchored JEPA improves self-supervised speech representation learning by using frozen Gaussian Mixture Model soft posteriors as auxiliary targets to prevent representation collapse, outperforming WavLM-style baselines on multiple speech tasks.

Details

Motivation: Joint Embedding Predictive Architectures (JEPA) for self-supervised speech representation learning suffer from representation collapse without explicit grounding. Existing methods like HuBERT and WavLM require iterative re-clustering, which is computationally expensive.

Method: Proposes GMM-Anchored JEPA: fits a Gaussian Mixture Model once on log-mel spectrograms, uses frozen soft posteriors as auxiliary targets throughout training. Implements decaying supervision schedule where GMM regularization dominates early training before gradually yielding to JEPA objective.

Result: On ~50k hours of speech, improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to WavLM-style baseline. Achieves 98% cluster entropy vs 31% for baseline, indicating more uniform cluster utilization.

Conclusion: GMM anchoring effectively prevents representation collapse in JEPA-based speech models with single-pass clustering, outperforming iterative clustering methods while providing more uniform and informative representations.

Abstract: Joint Embedding Predictive Architectures (JEPA) offer a promising approach to self-supervised speech representation learning, but suffer from representation collapse without explicit grounding. We propose GMM-Anchored JEPA, which fits a Gaussian Mixture Model once on log-mel spectrograms and uses its frozen soft posteriors as auxiliary targets throughout training. A decaying supervision schedule allows GMM regularization to dominate early training before gradually yielding to the JEPA objective. Unlike HuBERT and WavLM, which require iterative re-clustering, our approach clusters input features once with soft rather than hard assignments. On ~50k hours of speech, GMM anchoring improves ASR (28.68% vs. 33.22% WER), emotion recognition (67.76% vs. 65.46%), and slot filling (64.7% vs. 59.1% F1) compared to a WavLM-style baseline with matched compute. Cluster analysis shows GMM-anchored representations achieve up to 98% entropy compared to 31% for WavLM-style, indicating substantially more uniform cluster utilization. Code is made available at https://github.com/gioannides/clustering-anchored-jepa.

[536] Windowed SummaryMixing: An Efficient Fine-Tuning of Self-Supervised Learning Models for Low-resource Speech Recognition

Aditya Srinivas Menon, Kumud Tripathi, Raj Gohil, Pankaj Wasnik

Main category: eess.AS

TL;DR: Windowed SummaryMixing (WSM) enhances linear-time speech SSL by adding local context to global summaries, with selective fine-tuning reducing VRAM usage by 40% while improving ASR performance.

Details

Motivation: Self-supervised learning for speech suffers from quadratic complexity due to self-attention. Existing linear-time alternatives like SummaryMixing lack sufficient local context, limiting their effectiveness for speech recognition tasks.

Method: Proposes Windowed SummaryMixing (WSM) that integrates local neighborhood summaries alongside global utterance summaries. Introduces selective fine-tuning approach where self-attention layers in SSL models are replaced with WSM blocks and only these blocks are fine-tuned in low-resource settings.

Result: Improves ASR performance while reducing peak VRAM usage by 40% in SSL models. WSM blocks maintain linear-time complexity while providing enhanced context awareness. Selective replacement of attention layers reduces compute, memory, and latency.

Conclusion: WSM with selective fine-tuning provides an efficient alternative to self-attention for speech SSL, making it particularly suitable for low-resource speech recognition applications.

Abstract: Self-supervised learning (SSL) has advanced speech processing but suffers from quadratic complexity due to self-attention. To address this, SummaryMixing (SM) has been proposed as a linear-time alternative that summarizes entire utterances using mean pooling but lacks sufficient local context. In this work, we introduce Windowed SummaryMixing (WSM), which enhances SM by integrating local neighborhood summaries alongside the global summary, maintaining efficiency while improving temporal dependencies. Additionally, we introduce a selective fine-tuning approach, replacing self-attention layers in SSL models with WSM blocks and fine-tuning only these blocks in low-resource settings. Our approach improves ASR performance while reducing peak VRAM usage by 40% in the SSL models. WSM blocks have linear-time complexity with enhanced context awareness. Selectively replacing some attention layers reduces compute, memory, and latency, making it ideal for low-resource speech recognition.

[537] Beyond the Utterance: An Empirical Study of Very Long Context Speech Recognition

Robert Flynn, Anton Ragni

Main category: eess.AS

TL;DR: Training ASR models on sequences up to 1 hour long shows performance improvements with up to 21.8 minutes of context, achieving 14.2% relative improvement over short-context baselines.

Details

Motivation: Traditional ASR models operate on short utterances (<30s) due to computational constraints and independence assumptions, but long-format audio recordings require segmentation. Recent algorithmic and hardware advances enable training on longer sequences.

Method: Train attention-based ASR models on large-scale data using 10 different sequence lengths from 10 seconds to 1 hour. Analyze architectural components including positional encoding methods and model width/depth. Use synthetic data evaluations to analyze context usage.

Result: Performance improves with longer context up to 21.8 minutes (14.2% relative improvement). Positional encoding methods and model architecture are crucial for long sequences. Both linguistic and acoustic aspects of distant context are utilized.

Conclusion: Long-context ASR training is now feasible and beneficial, with optimal performance around 21.8 minutes of context. Proper architectural choices enable effective use of both linguistic and acoustic information from distant context.

Abstract: Automatic speech recognition (ASR) models are normally trained to operate over single utterances, with a short duration of less than 30 seconds. This choice has been made in part due to computational constraints, but also reflects a common, but often inaccurate, modelling assumption that treats utterances as independent and identically distributed samples. When long-format audio recordings are available, to work with such systems, these recordings must first be segmented into short utterances and processed independently. In this work, we show that due to recent algorithmic and hardware advances, this is no longer necessary, and current attention-based approaches can be used to train ASR systems that operate on sequences of over an hour in length. Therefore, to gain a better understanding of the relationship between the training/evaluation sequence length and performance, we train ASR models on large-scale data using 10 different sequence lengths from 10 seconds up to 1 hour. The results show a benefit from using up to 21.8 minutes of context, with up to a 14.2% relative improvement from a short context baseline in our primary experiments. Through modifying various architectural components, we find that the method of encoding positional information and the model’s width/depth are important factors when working with long sequences. Finally, a series of evaluations using synthetic data are constructed to help analyse the model’s use of context. From these results, it is clear that both linguistic and acoustic aspects of the distant context are being used by the model.

[538] Performance Comparison of CNN and AST Models with Stacked Features for Environmental Sound Classification

Parinaz Binandeh Dehaghania, Danilo Penab, A. Pedro Aguiar

Main category: eess.AS

TL;DR: Feature-stacked CNNs for environmental sound classification using multiple acoustic features, compared with transformer models under different training regimes.

Details

Motivation: To enhance CNN performance for environmental sound classification by aggregating complementary acoustic descriptors into richer input representations, and to compare feature-stacked CNNs with transformer-based models under varying training data availability.

Method: CNN-based models with stacked feature combinations including Log-Mel Spectrogram, Spectral Contrast, Chroma, Tonnetz, MFCCs, and Gammatone Cepstral Coefficients. Experiments conducted on ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining, fine-tuning, and comparison with Audio Spectrogram Transformer models.

Result: Feature-stacked CNNs offer a more computationally and data-efficient alternative when large-scale pretraining or extensive training data are unavailable, making them suitable for resource-constrained and edge-level sound classification scenarios.

Conclusion: Feature-stacked CNNs provide an efficient alternative to transformer models for environmental sound classification in resource-constrained settings, balancing performance with computational requirements.

Abstract: Environmental sound classification (ESC) has gained significant attention due to its diverse applications in smart city monitoring, fault detection, acoustic surveillance, and manufacturing quality control. To enhance CNN performance, feature stacking techniques have been explored to aggregate complementary acoustic descriptors into richer input representations. In this paper, we investigate CNN-based models employing various stacked feature combinations, including Log-Mel Spectrogram (LM), Spectral Contrast (SPC), Chroma (CH), Tonnetz (TZ), Mel-Frequency Cepstral Coefficients (MFCCs), and Gammatone Cepstral Coefficients (GTCC). Experiments are conducted on the widely used ESC-50 and UrbanSound8K datasets under different training regimes, including pretraining on ESC-50, fine-tuning on UrbanSound8K, and comparison with Audio Spectrogram Transformer (AST) models pretrained on large-scale corpora such as AudioSet. This experimental design enables an analysis of how feature-stacked CNNs compare with transformer-based models under varying levels of training data and pretraining diversity. The results indicate that feature-stacked CNNs offer a more computationally and data-efficient alternative when large-scale pretraining or extensive training data are unavailable, making them particularly well suited for resource-constrained and edge-level sound classification scenarios.

[539] TVTSyn: Content-Synchronous Time-Varying Timbre for Streaming Voice Conversion and Anonymization

Waris Quamer, Mu-Ruei Tseng, Ghady Nasrallah, Ricardo Gutierrez-Osuna

Main category: eess.AS

TL;DR: Streamable speech synthesizer with time-varying timbre representation for real-time voice conversion and speaker anonymization, achieving <80ms GPU latency with improved naturalness and speaker transfer.

Details

Motivation: Current real-time voice conversion systems have a representational mismatch: content is time-varying while speaker identity is injected as static global embedding, limiting naturalness and expressiveness in streaming applications.

Method: Introduces content-synchronous time-varying timbre (TVT) representation with Global Timbre Memory that expands global timbre into compact facets. Frame-level content attends to this memory, with gating for variation regulation and spherical interpolation for identity preservation. Uses factorized vector-quantized bottleneck to regularize content and reduce speaker leakage.

Result: System achieves <80ms GPU latency, end-to-end streamability, and shows improvements in naturalness, speaker transfer, and anonymization compared to state-of-the-art streaming baselines.

Conclusion: TVT representation provides scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets, addressing core representational mismatch in current streaming systems.

Abstract: Real-time voice conversion and speaker anonymization require causal, low-latency synthesis without sacrificing intelligibility or naturalness. Current systems have a core representational mismatch: content is time-varying, while speaker identity is injected as a static global embedding. We introduce a streamable speech synthesizer that aligns the temporal granularity of identity and content via a content-synchronous, time-varying timbre (TVT) representation. A Global Timbre Memory expands a global timbre instance into multiple compact facets; frame-level content attends to this memory, a gate regulates variation, and spherical interpolation preserves identity geometry while enabling smooth local changes. In addition, a factorized vector-quantized bottleneck regularizes content to reduce residual speaker leakage. The resulting system is streamable end-to-end, with <80 ms GPU latency. Experiments show improvements in naturalness, speaker transfer, and anonymization compared to SOTA streaming baselines, establishing TVT as a scalable approach for privacy-preserving and expressive speech synthesis under strict latency budgets.

[540] Evaluation of acoustic Green’s function in rectangular rooms with general surface impedance walls

Matteo Calafà, Yuanxin Xia, Jonas Brunskog, Cheol-Ho Jeong

Main category: eess.AS

TL;DR: Extends acoustic room mode analysis to include soft-wall boundaries and provides efficient semi-analytical Green’s function computation for rectangular rooms.

Details

Motivation: Existing analytical methods for acoustic room modes only work for perfectly reflecting or nearly rigid walls, failing for general boundary conditions like significant wall absorption. There's a need for methods that accommodate soft-wall boundaries.

Method: Develops first-order asymptotics for soft-wall boundaries and introduces a semi-analytical, efficient method for computing Green’s function in rectangular rooms. Uses mode expansion approach with truncation order control.

Result: Method provides reliable Green’s function computation with negligible error at sufficient truncation orders. Validated through numerical tests and can serve as benchmark for numerical simulations.

Conclusion: Provides comprehensive framework for acoustic analysis in rectangular rooms with general boundary conditions, addressing spectral basis orthogonality and completeness issues.

Abstract: Acoustic room modes and the Green’s function mode expansion are well-known for rectangular rooms with perfectly reflecting walls. First-order approximations also exist for nearly rigid boundaries; however, current analytical methods fail to accommodate more general boundary conditions, e.g., when wall absorption is significant. In this work, we present a comprehensive analysis that extends previous studies by including additional first-order asymptotics that account for soft-wall boundaries. In addition, we introduce a semi-analytical, efficient, and reliable method for computing the Green’s function in rectangular rooms, which is described and validated through numerical tests. With a sufficiently large truncation order, the resulting error becomes negligible, making the method suitable as a benchmark for numerical simulations. Additional aspects regarding the spectral basis orthogonality and completeness are also addressed, providing a general framework for the validity of the proposed approach.

[541] BioME: A Resource-Efficient Bioacoustic Foundational Model for IoT Applications

Heitor R. Guimarães, Abhishek Tiwari, Mahsa Abdollahi, Anderson R. Avila, Tiago H. Falk

Main category: eess.AS

TL;DR: BioME: A resource-efficient audio encoder for bioacoustic applications using layer-to-layer distillation from high-capacity models, with modulation-aware features via FiLM conditioning for better generalization on IoT devices.

Details

Motivation: Passive acoustic monitoring is crucial for biodiversity assessment, but current SSL-based audio encoders (BEATs, AVES) are computationally expensive and lack robustness for deployment on resource-constrained IoT platforms.

Method: BioME uses layer-to-layer distillation from a high-capacity teacher model to reduce parameters by 75%. It’s pretrained on multi-domain data (speech, environmental sounds, animal vocalizations) and incorporates modulation-aware acoustic features via FiLM conditioning for better feature disentanglement.

Result: BioME matches or surpasses the performance of larger models including its teacher across multiple bioacoustic tasks, while being suitable for resource-constrained IoT deployments.

Conclusion: BioME provides an efficient solution for bioacoustic monitoring on IoT devices through distillation and modulation-aware features, enabling strong performance with reduced computational requirements.

Abstract: Passive acoustic monitoring has become a key strategy in biodiversity assessment, conservation, and behavioral ecology, especially as Internet-of-Things (IoT) devices enable continuous in situ audio collection at scale. While recent self-supervised learning (SSL)-based audio encoders, such as BEATs and AVES, have shown strong performance in bioacoustic tasks, their computational cost and limited robustness to unseen environments hinder deployment on resource-constrained platforms. In this work, we introduce BioME, a resource-efficient audio encoder designed for bioacoustic applications. BioME is trained via layer-to-layer distillation from a high-capacity teacher model, enabling strong representational transfer while reducing the parameter count by 75%. To further improve ecological generalization, the model is pretrained on multi-domain data spanning speech, environmental sounds, and animal vocalizations. A key contribution is the integration of modulation-aware acoustic features via FiLM conditioning, injecting a DSP-inspired inductive bias that enhances feature disentanglement in low-capacity regimes. Across multiple bioacoustic tasks, BioME matches or surpasses the performance of larger models, including its teacher, while being suitable for resource-constrained IoT deployments. For reproducibility, code and pretrained checkpoints are publicly available.

[542] Diffusion-based Signal Refiner for Speech Enhancement and Separation

Masato Hirano, Ryosuke Sawata, Naoki Murata, Shusuke Takahashi, Yuki Mitsufuji

Main category: eess.AS

TL;DR: Diffiner uses diffusion models as a post-processor to enhance perceptual quality of speech processing outputs by learning natural speech priors and replacing unnatural artifacts.

Details

Motivation: Despite improvements in objective metrics, there remains a gap between speech processing systems and human perceptual quality. Current deterministic approaches fail to address unnatural artifacts introduced during processing.

Method: Diffiner leverages diffusion models’ generative capabilities to learn natural prior distributions of clean speech. It analyzes both original degraded speech and pre-processed speech to identify unnatural artifacts, then uses iterative diffusion sampling to replace degraded portions with perceptually natural speech segments.

Result: Experimental results show Diffiner recovers clearer harmonic structure of speech, improves perceptual quality metrics, and performs well in human listening tests. It effectively enhances existing speech processing pipelines.

Conclusion: Diffiner serves as an effective versatile post-processor that bridges the gap between objective metrics and human perceptual quality in speech processing using diffusion models’ generative capabilities.

Abstract: Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion models’ prior distributions to address this fundamental issue. Diffiner leverages the probabilistic generative framework of diffusion models and learns natural prior distributions of clean speech to convert outputs from existing speech processing systems into perceptually natural high-quality audio. In contrast to conventional deterministic approaches, our method simultaneously analyzes both the original degraded speech and the pre-processed speech to accurately identify unnatural artifacts introduced during processing. Then, through the iterative sampling process of the diffusion model, these degraded portions are replaced with perceptually natural and high-quality speech segments. Experimental results indicate that Diffiner can recover a clearer harmonic structure of speech, which is shown to result in improved perceptual quality w.r.t. several metrics as well as in a human listening test. This highlights Diffiner’s efficacy as a versatile post-processor for enhancing existing speech processing pipelines.

[543] Deep Room Impulse Response Completion

Jackie Lin, Georg Götz, Sebastian J. Schlecht

Main category: eess.AS

TL;DR: DECOR is a deep neural network autoencoder that predicts late reverberation from early RIR portions for fast spatial audio generation in VR/games.

Details

Motivation: Traditional methods for generating room impulse responses (RIRs) in VR and games are computationally expensive or noisy. The insight that early reflections contain enough room information enables predicting late reverberation from just the first 50ms.

Method: Proposes “RIR completion” task and DECOR (Deep Exponential Completion Of Room impulse responses) - an autoencoder network that predicts multi-exponential decay envelopes of filtered noise sequences from early RIR portions.

Result: DECOR achieves comparable performance to adapted state-of-the-art networks, showing feasibility of RIR completion task for fast late reverberation approximation.

Conclusion: RIR completion can enhance RIR generation tasks requiring fast late reverberation approximation, with DECOR’s interpretable output facilitating integration with diverse rendering techniques.

Abstract: Rendering immersive spatial audio in virtual reality (VR) and video games demands a fast and accurate generation of room impulse responses (RIRs) to recreate auditory environments plausibly. However, the conventional methods for simulating or measuring long RIRs are either computationally intensive or challenged by low signal-to-noise ratios. This study is propelled by the insight that direct sound and early reflections encapsulate sufficient information about room geometry and absorption characteristics. Building upon this premise, we propose a novel task termed “RIR completion,” aimed at synthesizing the late reverberation given only the early portion (50 ms) of the response. To this end, we introduce DECOR, Deep Exponential Completion Of Room impulse responses, a deep neural network structured as an autoencoder designed to predict multi-exponential decay envelopes of filtered noise sequences. The interpretability of DECOR’s output facilitates its integration with diverse rendering techniques. The proposed method is compared against an adapted state-of-the-art network, and comparable performance shows promising results supporting the feasibility of the RIR completion task. The RIR completion can be widely adapted to enhance RIR generation tasks where fast late reverberation approximation is required.

[544] TTA: Transcribe, Translate and Alignment for Cross-lingual Speech Representation

Wei Liu, Jiahong Li, Yiwen Shao, Dong Yu

Main category: eess.AS

TL;DR: TTA: A lightweight speech semantic model trained on 358k hours of multilingual data for better LLM integration, outperforming Whisper on ASR, speech translation, and retrieval tasks.

Details

Motivation: Current speech-LLM models using Whisper encoder have limitations in input format, model scale, and semantic performance. There's a need for a more effective speech semantic model specifically designed for LLM integration.

Method: Proposes TTA (Text-To-Audio) model trained on 358k hours of multilingual speech data across ASR, speech translation, and speech-text alignment tasks to produce robust cross-lingual speech representations.

Result: TTA demonstrates superiority over Whisper across diverse benchmarks including ASR, speech translation, speech retrieval, and ASR-LLM performance assessments. Validates cross-lingual capabilities and ASR/ST performance interplay.

Conclusion: TTA provides more effective speech semantic representations for LLM integration, with model weights and training recipes to be released as part of the Auden audio understanding toolkit.

Abstract: Speech-LLM models have demonstrated great performance in multi-modal and multi-task speech understanding. A typical speech-LLM paradigm is integrating speech modality with a large language model (LLM). While the Whisper encoder was frequently adopted in previous studies for speech input, it shows limitations regarding input format, model scale, and semantic performance. To this end, we propose a lightweight TTA model specialized in speech semantics for more effective LLM integration. With large-scale training of 358k hours of speech data on multilingual speech recognition (ASR), speech translation (ST) and speech-text alignment tasks, TTA is capable of producing robust cross-lingual speech representations. Extensive evaluations across diverse benchmarks, including ASR/ST, speech retrieval, and ASR-LLM performance assessments, demonstrate TTA’s superiority over Whisper. Furthermore, we rigorously validate the interplay between cross-lingual capabilities and ASR/ST performance. The model weights and training recipes of TTA will be released as part of an audio understanding toolkit Auden.

eess.IV

[545] Smaller is Better: Generative Models Can Power Short Video Preloading

Liming Liu, Jiangkai Wu, Xinggong Zhang

Main category: eess.IV

TL;DR: PromptPream is a computation-powered preloading paradigm for short video platforms that uses semantic prompts instead of pixel-level video chunks, leveraging generative models like Stable Diffusion to reduce bandwidth while maintaining quality.

Details

Motivation: Existing video preloading strategies face a fundamental tradeoff: aggressive preloading reduces playback stalls but wastes bandwidth, while conservative strategies save data but increase stall risk. There's a need for a solution that breaks this tradeoff.

Method: Three core techniques: (1) gradient-based prompt inversion method that compresses frames into compact token embeddings, (2) computation-aware scheduling strategy that jointly optimizes network and compute resources, and (3) scalable searching algorithm to handle the enlarged scheduling space.

Result: PromptStream reduces both stalls and bandwidth waste by over 31%, and improves Quality of Experience (QoE) by 45% compared to traditional preloading strategies.

Conclusion: The computation-powered preloading paradigm using semantic prompts and generative models successfully breaks the traditional tradeoff between stalls and bandwidth waste in video streaming.

Abstract: Preloading is widely used in short video platforms to minimize playback stalls by downloading future content in advance. However, existing strategies face a tradeoff. Aggressive preloading reduces stalls but wastes bandwidth, while conservative strategies save data but increase the risk of playback stalls. This paper presents PromptPream, a computation powered preloading paradigm that breaks this tradeoff by using local computation to reduce bandwidth demand. Instead of transmitting pixel level video chunks, PromptPream sends compact semantic prompts that are decoded into high quality frames using generative models such as Stable Diffusion. We propose three core techniques to enable this paradigm: (1) a gradient based prompt inversion method that compresses frames into small sets of compact token embeddings; (2) a computation aware scheduling strategy that jointly optimizes network and compute resource usage; and (3) a scalable searching algorithm that addresses the enlarged scheduling space introduced by scheduler. Evaluations show that PromptStream reduces both stalls and bandwidth waste by over 31%, and improves Quality of Experience (QoE) by 45%, compared to traditional strategies.

[546] Camel: Frame-Level Bandwidth Estimation for Low-Latency Live Streaming under Video Bitrate Undershooting

Liming Liu, Zhidong Jia, Li Jiang, Wei Zhang, Lan Xie, Feng Qian, Leju Yan, Bing Yan, Qiang Ma, Zhou Sha, Wei Yang, Yixuan Ban, Xinggong Zhang

Main category: eess.IV

TL;DR: Camel is a frame-level congestion control algorithm for low-latency live streaming that addresses stalling issues caused by temporal variations in video encoding by using frame-level network feedback to better estimate true network capacity.

Details

Motivation: Low-latency live streaming suffers from frequent stalling events even when encoded bitrate doesn't fully utilize available bandwidth. This occurs because conventional packet-level congestion control algorithms misestimate bandwidth due to temporal variations in real-time video encoding, causing packet loss or increased queueing delay when high-bitrate frames are suddenly produced.

Method: Camel uses frame-level network feedback to capture true network capacity, immune to irregular sending patterns from encoding. It has three key modules: Bandwidth and Delay Estimator and Congestion Detector (jointly determine average sending rate), and Bursting Length Controller (governs emission pattern to prevent packet loss).

Result: In real-world deployment with 250M users and 2B sessions across 150+ countries: up to 70.8% increase in 1080P resolution ratio, 14.4% increase in media bitrate, and up to 14.1% reduction in stalling ratio. In simulations: up to 19.8% higher bitrate, 93.0% lower stalling ratio, and 23.9% improvement in bandwidth estimation accuracy.

Conclusion: Camel effectively addresses the stalling problem in low-latency live streaming by using frame-level congestion control that better handles temporal encoding variations, significantly improving video quality and reducing playback interruptions.

Abstract: Low-latency live streaming (LLS) has emerged as a popular web application, with many platforms adopting real-time protocols such as WebRTC to minimize end-to-end latency. However, we observe a counter-intuitive phenomenon: even when the actual encoded bitrate does not fully utilize the available bandwidth, stalling events remain frequent. This insufficient bandwidth utilization arises from the intrinsic temporal variations of real-time video encoding, which cause conventional packet-level congestion control algorithms to misestimate available bandwidth. When a high-bitrate frame is suddenly produced, sending at the wrong rate can either trigger packet loss or increase queueing delay, resulting in playback stalls. To address these issues, we present Camel, a novel frame-level congestion control algorithm (CCA) tailored for LLS. Our insight is to use frame-level network feedback to capture the true network capacity, immune to the irregular sending pattern caused by encoding. Camel comprises three key modules: the Bandwidth and Delay Estimator and the Congestion Detector, which jointly determine the average sending rate, and the Bursting Length Controller, which governs the emission pattern to prevent packet loss. We evaluate Camel on both large-scale real-world deployments and controlled simulations. In the real-world platform with 250M users and 2B sessions across 150+ countries, Camel achieves up to a 70.8% increase in 1080P resolution ratio, a 14.4% increase in media bitrate, and up to a 14.1% reduction in stalling ratio. In simulations under undershooting, shallow buffers, and network jitter, Camel outperforms existing congestion control algorithms, with up to 19.8% higher bitrate, 93.0% lower stalling ratio, and 23.9% improvement in bandwidth estimation accuracy.

[547] Operator-Based Information Theory for Imaging: Entropy, Capacity, and Irreversibility in Physical Measurement Systems

Charles Wood

Main category: eess.IV

TL;DR: Operator-based information theory framework for imaging systems that quantifies information flow through spectral properties of imaging operators, applicable to linear, nonlinear, and stochastic systems.

Details

Motivation: Traditional imaging metrics (resolution, contrast, SNR) don't fully capture how physical transformations affect information flow. Need a general framework to analyze information redistribution in imaging systems across different modalities.

Method: Models imaging chain as composition of bounded operators acting on functions. Uses spectral properties of operators to characterize information redistribution. Develops three measures: operator entropy (energy distribution across singular spectrum), operator information capacity (recoverable modes above noise threshold), and irreversibility index (information lost through mode suppression).

Result: Provides analytical examples showing how attenuation, blur, and sampling affect entropy, capacity, and irreversibility differently. Framework applies to linear, nonlinear, and stochastic operators independent of specific imaging modality.

Conclusion: Establishes general structure for analyzing physical limits of imaging, forming basis for future work on information geometry, spatiotemporal budgets, nonlinear channels, and reconstruction algorithms.

Abstract: Imaging systems are commonly described using resolution, contrast, and signal-to-noise ratio, but these quantities do not provide a general account of how physical transformations affect the flow of information. This paper introduces an operator-based formulation of information theory for imaging. The approach models the imaging chain as a composition of bounded operators acting on functions, and characterises information redistribution using the spectral properties of these operators. Three measures are developed. Operator entropy quantifies how an operator distributes energy across its singular spectrum. Operator information capacity describes the number of modes that remain recoverable above a noise-dependent threshold. An irreversibility index measures the information lost through suppression or elimination of modes and captures the accumulation of information loss under operator composition. The framework applies to linear, nonlinear, and stochastic operators and does not depend on the specific imaging modality. Analytical examples show how attenuation, blur, and sampling affect entropy, capacity, and irreversibility in different ways. The results provide a general structure for analysing the physical limits of imaging and form the basis for subsequent work on information geometry, spatiotemporal budgets, nonlinear channels, and reconstruction algorithms.

[548] SAS-Net: Scene-Appearance Separation Network for Robust Spatiotemporal Registration in Bidirectional Photoacoustic Microscopy

Jiahao Qin

Main category: eess.IV

TL;DR: A unified scene-appearance separation framework for bidirectional optical-resolution photoacoustic microscopy that addresses domain shift and geometric distortion to enable robust high-speed functional brain imaging.

Details

Motivation: High-speed bidirectional scanning in OR-PAM introduces severe spatiotemporal misalignment from coupled scan-direction-dependent domain shift and geometric distortion, which conventional registration methods fail to address due to violated brightness constancy assumptions.

Method: Proposes a unified scene-appearance separation framework that separates domain-invariant scene content from domain-specific appearance characteristics, enabling cross-domain reconstruction with geometric preservation. Uses scene consistency loss to promote geometric correspondence in latent space, linking domain shift correction with spatial registration.

Result: Achieves NCC of 0.961 and SSIM of 0.894 for in vivo mouse brain vasculature imaging, substantially outperforming conventional methods. Domain alignment loss is critical (82% NCC reduction without it), while scene and cycle consistency losses provide complementary regularization. Achieves 11.2 ms inference time per frame (86 fps), enabling real-time processing.

Conclusion: The proposed framework enables robust high-speed bidirectional OR-PAM for reliable quantitative and longitudinal functional imaging, with real-time processing capabilities exceeding typical acquisition rates.

Abstract: High-speed optical-resolution photoacoustic microscopy (OR-PAM) with bidirectional scanning enables rapid functional brain imaging but introduces severe spatiotemporal misalignment from coupled scan-direction-dependent domain shift and geometric distortion. Conventional registration methods rely on brightness constancy, an assumption violated under bidirectional scanning, leading to unreliable alignment. A unified scene-appearance separation framework is proposed to jointly address domain shift and spatial misalignment. The proposed architecture separates domain-invariant scene content from domain-specific appearance characteristics, enabling cross-domain reconstruction with geometric preservation. A scene consistency loss promotes geometric correspondence in the latent space, linking domain shift correction with spatial registration within a single framework. For in vivo mouse brain vasculature imaging, the proposed method achieves normalized cross-correlation (NCC) of 0.961 and structural similarity index (SSIM) of 0.894, substantially outperforming conventional methods. Ablation studies demonstrate that domain alignment loss is critical, with its removal causing 82% NCC reduction (0.961 to 0.175), while scene consistency and cycle consistency losses provide complementary regularization for optimal performance. The method achieves 11.2 ms inference time per frame (86 fps), substantially exceeding typical OR-PAM acquisition rates and enabling real-time processing. These results suggest that the proposed framework enables robust high-speed bidirectional OR-PAM for reliable quantitative and longitudinal functional imaging. The code will be publicly available at https://github.com/D-ST-Sword/SAS-Net

[549] Intensity-based Segmentation of Tissue Images Using a U-Net with a Pretrained ResNet-34 Encoder: Application to Mueller Microscopy

Sooyong Chae, Dani Giammattei, Ajmal Ajmal, Junzhu Pei, Amanda Sanchez, Tananant Boonya-ananta, Andres Rodriguez, Tatiana Novikova, Jessica Ramella-Roman

Main category: eess.IV

TL;DR: U-Net with ResNet-34 encoder trained on Mueller matrix intensity data achieves 89.71% pixel accuracy for automated segmentation of cervical tissue sections using limited biomedical imaging data.

Details

Motivation: Manual annotation of thin tissue sections in Mueller microscopy is time-consuming and limits scalability, creating a need for automated approaches to accelerate biomedical image analysis.

Method: U-Net architecture with pretrained ResNet-34 encoder trained on only the M11 element (total intensity) of Mueller matrix to segment four classes in murine uterine cervix sections: background, internal os, cervical tissue, and vaginal wall.

Result: Achieved 89.71% pixel accuracy and 80.96% mean tissue Dice coefficient on held-out test dataset using only 70 cervical tissue sections, demonstrating effective transfer learning from ImageNet.

Conclusion: Intensity-based framework with minimal preprocessing enables accurate segmentation despite limited training data, is extensible to other imaging modalities and tissue types, and includes publicly available graphical annotation tools for practical deployment.

Abstract: Manual annotation of the images of thin tissue sections remains a time-consuming step in Mueller microscopy and limits its scalability. We present a novel automated approach using only the total intensity M11 element of the Mueller matrix as an input to a U-Net architecture with a pretrained ResNet-34 encoder. The network was trained to distinguish four classes in the images of murine uterine cervix sections: background, internal os, cervical tissue, and vaginal wall. With only 70 cervical tissue sections, the model achieved 89.71% pixel accuracy and 80.96% mean tissue Dice coefficient on the held-out test dataset. Transfer learning from ImageNet enables accurate segmentation despite limited size of training dataset typical of specialized biomedical imaging. This intensity-based framework requires minimal preprocessing and is readily extensible to other imaging modalities and tissue types, with publicly available graphical annotation tools for practical deployment.

[550] A Real-Time DDS-Based Chest X-Ray Decision Support System for Resource-Constrained Clinics

Omar H. Khater, Basem Almadani, Farouq Aliyu, Esam Al-Nahari

Main category: eess.IV

TL;DR: Real-time chest X-ray decision support system using fine-tuned ResNet50 and Fast DDS middleware for remote healthcare in bandwidth-constrained environments

Details

Motivation: IoT-based healthcare systems can improve healthcare delivery in humanitarian/remote areas, but limited network infrastructure makes reliable communication challenging for traditional IoT systems

Method: Integrates fine-tuned ResNet50 deep learning model for disease classification with Fast DDS real-time middleware to ensure reliable, low-latency communication between healthcare practitioners and inference system

Result: Model achieves 88.61% accuracy, 88.76% precision, 88.49% recall; system attains average throughput of 3.2 KB/s and average latency of 65 ms

Conclusion: System demonstrates suitability for deployment in bandwidth-constrained environments and effectiveness of DDS-based middleware for real-time medical decision support in remote healthcare

Abstract: Internet of Things (IoT)-based healthcare systems offer significant potential for improving healthcare delivery in humanitarian and resource-constrained environments, providing essential services to underserved populations in remote areas. However, limited network infrastructure in such regions makes reliable communication challenging for traditional IoT systems. This paper presents a real-time chest X-ray decision support system designed for hospitals in remote locations. The proposed system integrates a fine-tuned ResNet50 deep learning model for disease classification with Fast DDS real-time middleware to ensure reliable and low-latency communication between healthcare practitioners and the inference system. Experimental results show that the model achieves an accuracy of 88.61%, precision of 88.76%, and recall of 88.49%. The system attains an average throughput of 3.2 KB/s and an average latency of 65 ms, demonstrating its suitability for deployment in bandwidth-constrained environments. These results highlight the effectiveness of DDS-based middleware in enabling real-time medical decision support for remote healthcare applications.

[551] Targeted Unlearning Using Perturbed Sign Gradient Methods With Applications On Medical Images

George R. Nahass, Zhu Wang, Homa Rashidisabet, Won Hwa Kim, Sasha Hubschman, Jeffrey C. Peterson, Chad A. Purnell, Pete Setabutr, Ann Q. Tran, Darvin Yi, Sathya N. Ravi

Main category: eess.IV

TL;DR: Machine unlearning as a practical tool for clinical model maintenance, with bilevel optimization for boundary-based unlearning and tunable loss design for forgetting-retention tradeoff.

Details

Motivation: To address the need for post-deployment model revision in clinical contexts where data shifts, device deprecation, and policy changes are common, moving beyond privacy-focused unlearning to general-purpose model maintenance.

Method: Proposes a bilevel optimization formulation of boundary-based unlearning solved with iterative algorithms, featuring tunable loss design for controlling forgetting-retention tradeoff and supporting model composition strategies.

Result: Outperforms baselines on both forgetting and retention metrics across benchmark and real-world clinical imaging datasets, including scenarios with imaging devices and anatomical outliers.

Conclusion: Establishes machine unlearning as a modular, practical alternative to retraining for real-world model maintenance in clinical applications.

Abstract: Machine unlearning aims to remove the influence of specific training samples from a trained model without full retraining. While prior work has largely focused on privacy-motivated settings, we recast unlearning as a general-purpose tool for post-deployment model revision. Specifically, we focus on utilizing unlearning in clinical contexts where data shifts, device deprecation, and policy changes are common. To this end, we propose a bilevel optimization formulation of boundary-based unlearning that can be solved using iterative algorithms. We provide convergence guarantees when first-order algorithms are used to unlearn. Our method introduces tunable loss design for controlling the forgetting-retention tradeoff and supports novel model composition strategies that merge the strengths of distinct unlearning runs. Across benchmark and real-world clinical imaging datasets, our approach outperforms baselines on both forgetting and retention metrics, including scenarios involving imaging devices and anatomical outliers. This work establishes machine unlearning as a modular, practical alternative to retraining for real-world model maintenance in clinical applications.

[552] ECGFlowCMR: Pretraining with ECG-Generated Cine CMR Helps Cardiac Disease Classification and Phenotype Prediction

Xiaocheng Fang, Zhengyao Ding, Guangkun Nie, Jieyi Cai, Yujie Xiao, Bo Liu, Jiarui Jin, Haoyu Wang, Shun Huang, Ting Chen, Hongyan Li, Shenda Hong

Main category: eess.IV

TL;DR: ECGFlowCMR: A novel framework that generates cardiac MRI sequences from ECG signals using phase-aware masked autoencoders and anatomy-motion disentangled flow models to address cross-modal temporal mismatches and anatomical observability gaps.

Details

Motivation: Cardiac MRI provides comprehensive cardiac assessment but is expensive and requires expert annotations, limiting large-scale datasets. ECGs are inexpensive and widely available, offering potential for conditioning generative synthesis of cine CMR sequences.

Method: Proposes ECGFlowCMR with two key components: 1) Phase-Aware Masked Autoencoder (PA-MAE) to handle cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and 2) Anatomy-Motion Disentangled Flow (AMDF) to address anatomical observability gap due to limited structural information in ECGs.

Result: Extensive experiments on UK Biobank and proprietary clinical datasets show ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.

Conclusion: ECGFlowCMR successfully addresses cross-modal generation challenges between ECG and CMR, providing a framework for scalable pretraining and enhanced downstream cardiac analysis tasks using inexpensive ECG inputs.

Abstract: Cardiac Magnetic Resonance (CMR) imaging provides a comprehensive assessment of cardiac structure and function but remains constrained by high acquisition costs and reliance on expert annotations, limiting the availability of large-scale labeled datasets. In contrast, electrocardiograms (ECGs) are inexpensive, widely accessible, and offer a promising modality for conditioning the generative synthesis of cine CMR. To this end, we propose ECGFlowCMR, a novel ECG-to-CMR generative framework that integrates a Phase-Aware Masked Autoencoder (PA-MAE) and an Anatomy-Motion Disentangled Flow (AMDF) to address two fundamental challenges: (1) the cross-modal temporal mismatch between multi-beat ECG recordings and single-cycle CMR sequences, and (2) the anatomical observability gap due to the limited structural information inherent in ECGs. Extensive experiments on the UK Biobank and a proprietary clinical dataset demonstrate that ECGFlowCMR can generate realistic cine CMR sequences from ECG inputs, enabling scalable pretraining and improving performance on downstream cardiac disease classification and phenotype prediction tasks.

[553] EchoJEPA: A Latent Predictive Foundation Model for Echocardiography

Alif Munim, Adibvafa Fallahpour, Teodora Szasz, Ahmadreza Attarpour, River Jiang, Brana Sooriyakanthan, Maala Sooriyakanthan, Heather Whitney, Jeremy Slivnick, Barry Rubin, Wendy Tsang, Bo Wang

Main category: eess.IV

TL;DR: EchoJEPA is a foundation model for echocardiography that uses latent predictive objectives to learn robust anatomical representations while ignoring ultrasound speckle noise, achieving superior performance on cardiac measurements and generalization across patient populations.

Details

Motivation: Current foundation models for echocardiography struggle to separate anatomical signals from inherent ultrasound artifacts like speckle noise and acquisition artifacts, limiting their robustness and generalization capabilities.

Method: Trained on 18 million echocardiograms across 300K patients using a latent predictive objective to learn robust anatomical representations that ignore speckle noise, validated through multi-view probing with frozen backbones.

Result: Outperforms leading baselines by ~20% in LVEF estimation and ~17% in RVSP estimation, achieves 79% view classification with only 1% labeled data, shows only 2% degradation under acoustic perturbations vs 17% for competitors, and demonstrates superior zero-shot performance on pediatric patients.

Conclusion: Latent prediction is a superior paradigm for building robust, generalizable medical AI foundation models that can effectively handle ultrasound artifacts and generalize across diverse patient populations.

Abstract: Foundation models for echocardiography often struggle to disentangle anatomical signal from the stochastic speckle and acquisition artifacts inherent to ultrasound. We present EchoJEPA, a foundation model trained on 18 million echocardiograms across 300K patients, representing the largest pretraining corpus for this modality to date. By leveraging a latent predictive objective, EchoJEPA learns robust anatomical representations that ignore speckle noise. We validate this using a novel multi-view probing framework with frozen backbones, where EchoJEPA outperforms leading baselines by approximately 20% in left ventricular ejection fraction (LVEF) estimation and 17% in right ventricular systolic pressure (RVSP) estimation. The model also exhibits remarkable sample efficiency, reaching 79% view classification accuracy with only 1% of labeled data versus 42% for the best baseline trained on 100%. Crucially, EchoJEPA demonstrates superior generalization, degrading by only 2% under physics-informed acoustic perturbations compared to 17% for competitors. Most remarkably, its zero-shot performance on pediatric patients surpasses fully fine-tuned baselines, establishing latent prediction as a superior paradigm for robust, generalizable medical AI.

[554] Mamba-FCS: Joint Spatio- Frequency Feature Fusion, Change-Guided Attention, and SeK Loss for Enhanced Semantic Change Detection in Remote Sensing

Buddhi Wijenayake, Athulya Ratnayake, Praveen Sumanasekara, Roshan Godaliyadda, Parakrama Ekanayake, Vijitha Herath, Nichula Wasalathilaka

Main category: eess.IV

TL;DR: Mamba-FCS: A semantic change detection framework using Visual State Space Model backbone with frequency domain features, change-guided attention, and specialized loss for remote sensing imagery.

Details

Motivation: Need for semantic change detection models that balance spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions in remote sensing imagery.

Method: Uses Visual State Space Model (Mamba) backbone with Joint Spatio-Frequency Fusion block (log-amplitude frequency features), Change-Guided Attention module linking BCD and SCD tasks, and Separated Kappa loss for class imbalance.

Result: State-of-the-art performance on SECOND (88.62% OA, 65.78% F_scd, 25.50% SeK) and Landsat-SCD (96.25% OA, 89.27% F_scd, 60.26% SeK) datasets.

Conclusion: Mamba architectures with proposed enhancements show substantial potential for effective and scalable semantic change detection in remote sensing applications.

Abstract: Semantic Change Detection (SCD) from remote sensing imagery requires models balancing extensive spatial context, computational efficiency, and sensitivity to class-imbalanced land-cover transitions. While Convolutional Neural Networks excel at local feature extraction but lack global context, Transformers provide global modeling at high computational costs. Recent Mamba architectures based on state-space models offer compelling solutions through linear complexity and efficient long-range modeling. In this study, we introduce Mamba-FCS, a SCD framework built upon Visual State Space Model backbone incorporating, a Joint Spatio-Frequency Fusion block incorporating log-amplitude frequency domain features to enhance edge clarity and suppress illumination artifacts, a Change-Guided Attention (CGA) module that explicitly links the naturally intertwined BCD and SCD tasks, and a Separated Kappa (SeK) loss tailored for class-imbalanced performance optimization. Extensive evaluation on SECOND and Landsat-SCD datasets shows that Mamba-FCS achieves state-of-the-art metrics, 88.62% Overall Accuracy, 65.78% F_scd, and 25.50% SeK on SECOND, 96.25% Overall Accuracy, 89.27% F_scd, and 60.26% SeK on Landsat-SCD. Ablation analyses confirm distinct contributions of each novel component, with qualitative assessments highlighting significant improvements in SCD. Our results underline the substantial potential of Mamba architectures, enhanced by proposed techniques, setting a new benchmark for effective and scalable semantic change detection in remote sensing applications. The complete source code, configuration files, and pre-trained models will be publicly available upon publication.

Editor’s Picks

[1] DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis

[2] NarraScore: Bridging Visual Narrative and Musical Dynamics via Hierarchical Affective Control

[3] Covo-Audio Technical Report

Today’s Research Highlights

Table of Contents

cs.CL

[1] Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection

[2] Measuring Inclusion in Interaction: Inclusion Analytics for Human-AI Collaborative Learning

[3] Effective Reasoning Chains Reduce Intrinsic Dimensionality

[4] Don’t Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention

[5] Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization

[6] FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

[7] Understanding Risk and Dependency in AI Chatbot Use from User Discourse

[8] Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs

[9] Unsupervised Cross-Lingual Part-of-Speech Tagging with Monolingual Corpora Only

[10] AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis

[11] AfriNLLB: Efficient Translation Models for African Languages

[12] MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

[13] BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation

[14] Contractual Deepfakes: Can Large Language Models Generate Contracts?

[15] Effective vocabulary expanding of multilingual language models for extremely low-resource languages

[16] Are Language Models Sensitive to Morally Irrelevant Distractors?

[17] Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency

[18] Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts

[19] Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality

[20] NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts

[21] Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement

[22] Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models

[23] EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

[24] The CLEF-2026 CheckThat! Lab: Advancing Multilingual Fact-Checking

[25] Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models

[26] UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

[27] Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA

[28] Advancing Block Diffusion Language Models for Test-Time Scaling

[29] LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

[30] Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

[31] Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

[32] On the Optimal Reasoning Length for RL-Trained Language Models

[33] Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

[34] AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models

[35] MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

[36] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

[37] Life Cycle-Aware Evaluation of Knowledge Distillation for Machine Translation: Environmental Impact and Translation Quality Trade-offs

[38] Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding

[39] TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces

[40] Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs

[41] AI-Assisted Scientific Assessment: A Case Study on Climate Change

[42] Targum – A Multilingual New Testament Translation Corpus

[43] Improving Interpretability of Lexical Semantic Change with Neurobiological Features

[44] Where Are We At with Automatic Speech Recognition for the Bambara Language?

[45] Decomposing Reasoning Efficiency in Large Language Models

[46] AnalyticsGPT: An LLM Workflow for Scientometric Question Answering

[47] Text summarization via global structure awareness

[48] From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

[49] LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse

[50] How Do People Quantify Naturally: Evidence from Mandarin Picture Description

[51] SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech

[52] Steer2Edit: From Activation Steering to Component-Level Editing

[53] The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

[54] AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning

[55] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

[56] ATTNPO: Attention-Guided Process Supervision for Efficient Reasoning

[57] ViMultiChoice: Toward a Method That Gives Explanation for Multiple-Choice Reading Comprehension in Vietnamese

[58] A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

[59] ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition

[60] SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

[61] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

[62] MEVER: Multi-Modal and Explainable Claim Verification with Graph-based Evidence Retrieval

[63] Anagent For Enhancing Scientific Table & Figure Analysis

[64] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

[65] LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

[66] Can LLMs Automate Fact-Checking Article Writing?

[67] Is the Reversal Curse a Binding Problem? Uncovering Limitations of Transformers from a Basic Generalization Failure

[68] Subject islands do not reduce to construction-specific discourse function

[69] Cochain: Balancing Insufficient and Excessive Collaboration in LLM Agent Workflows

[70] EAMET: Robust Massive Model Editing via Embedding Alignment Optimization

[71] MolLangBench: A Comprehensive Benchmark for Language-Prompted Molecular Structure Recognition, Editing, and Generation

[72] An Iterative Question-Guided Framework for Knowledge Base Question Answering

[73] What Should Feature Distillation Transfer in LLMs? A Task-Tangent Geometry View