Today’s Research Highlights

AI-enhanced summaries of the latest research papers from arXiv.

cs.CL [Total: 77]
cs.CV [Total: 215]
cs.AI [Total: 44]
cs.SD [Total: 11]
cs.LG [Total: 164]
cs.MA [Total: 6]
cs.MM [Total: 3]
eess.AS [Total: 4]
eess.IV [Total: 10]

cs.CL

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

Hen-Hsen Huang

Main category: cs.CL

TL;DR: The paper argues that current LLM efficiency methods (MoE, speculative decoding, RAG) only benefit hyperscale providers and proposes a new research agenda focused on robust simplicity for democratizing LLM deployment.

Details

Motivation: Current efficiency methods collapse into overhead and fragility outside hyperscale contexts, leaving most organizations without viable options and creating inequality in LLM access.

Method: Proposes retrofitting pretrained models without retraining, lightweight fine-tuning, economical reasoning, dynamic knowledge management without heavy RAG, and Overhead-Aware Efficiency benchmarking.

Result: A framework for redefining efficiency to include adoption cost, sustainability, and fairness rather than just performance metrics.

Conclusion: By focusing on robust simplicity and adoption costs, LLM optimization can reduce inequality and carbon waste rather than amplifying them, enabling broader democratization of LLM deployment.

Abstract: Large language models (LLMs) have become indispensable, but the most celebrated efficiency methods – mixture-of-experts (MoE), speculative decoding, and complex retrieval-augmented generation (RAG) – were built for hyperscale providers with vast infrastructure and elite teams. Outside that context, their benefits collapse into overhead, fragility, and wasted carbon. The result is that a handful of Big Tech companies benefit, while thousands of hospitals, schools, governments, and enterprises are left without viable options. We argue that the next frontier is not greater sophistication at scale, but robust simplicity: efficiency that thrives under modest resources and minimal expertise. We propose a new research agenda: retrofitting pretrained models with more efficient architectures without retraining, inventing lightweight fine-tuning that preserves alignment, making reasoning economical despite long chains of thought, enabling dynamic knowledge management without heavy RAG pipelines, and adopting Overhead-Aware Efficiency (OAE) as a standard benchmark. By redefining efficiency to include adoption cost, sustainability, and fairness, we can democratize LLM deployment – ensuring that optimization reduces inequality and carbon waste rather than amplifying them.

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

Tcharlies Schmitz

Main category: cs.CL

TL;DR: HTP is a reversible, deterministic framework for generating text embeddings without training, using harmonic trajectories from Unicode representations to create interpretable vector mappings.

Details

Motivation: To create transparent and efficient text embeddings that don't rely on statistical co-occurrence, optimization, or training data, providing a deterministic alternative to neural embeddings.

Method: Encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective mapping between symbols and continuous vector space with phase-coherent projections.

Result: Achieves Spearman correlation of ρ = 0.68 on STS-B benchmark, maintains stable performance across 10 languages with sub-millisecond latency and negligible computational cost.

Conclusion: Meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings.

Abstract: This paper introduces the Harmonic Token Projection (HTP), a reversible and deterministic framework for generating text embeddings without training, vocabularies, or stochastic parameters. Unlike neural embeddings that rely on statistical co-occurrence or optimization, HTP encodes each token analytically as a harmonic trajectory derived from its Unicode integer representation, establishing a bijective and interpretable mapping between discrete symbols and continuous vector space. The harmonic formulation provides phase-coherent projections that preserve both structure and reversibility, enabling semantic similarity estimation from purely geometric alignment. Experimental evaluation on the Semantic Textual Similarity Benchmark (STS-B) and its multilingual extension shows that HTP achieves a Spearman correlation of \r{ho} = 0.68 in English, maintaining stable performance across ten languages with negligible computational cost and sub-millisecond latency per sentence pair. This demonstrates that meaningful semantic relations can emerge from deterministic geometry, offering a transparent and efficient alternative to data-driven embeddings. Keywords: Harmonic Token Projection, reversible embedding, deterministic encoding, semantic similarity, multilingual representation.

[3] A centroid based framework for text classification in itsm environments

Hossein Mohanna, Ali Ait-Bachir

Main category: cs.CL

TL;DR: Dual-embedding centroid-based framework for hierarchical text classification in ITSM, achieving competitive performance with SVM while offering 5.9x faster training and 152x faster incremental updates.

Details

Motivation: Need for efficient and interpretable hierarchical text classification in IT Service Management systems for categorizing support tickets into tree-structured taxonomies.

Method: Dual-embedding centroid-based classification with separate semantic and lexical centroid representations per category, combined through reciprocal rank fusion at inference.

Result: Competitive performance with SVM (hierarchical F1: 0.731 vs 0.727), 5.9x faster training, up to 152x faster incremental updates, and 8.6-8.8x speedup across batch sizes when excluding embedding computation.

Conclusion: Method is suitable for production ITSM environments prioritizing interpretability and operational efficiency.

Abstract: Text classification with hierarchical taxonomies is a fundamental requirement in IT Service Management (ITSM) systems, where support tickets must be categorized into tree-structured taxonomies. We present a dual-embedding centroid-based classification framework that maintains separate semantic and lexical centroid representations per category, combining them through reciprocal rank fusion at inference time. The framework achieves performance competitive with Support Vector Machines (hierarchical F1: 0.731 vs 0.727) while providing interpretability through centroid representations. Evaluated on 8,968 ITSM tickets across 123 categories, this method achieves 5.9 times faster training and up to 152 times faster incremental updates. With 8.6-8.8 times speedup across batch sizes (100-1000 samples) when excluding embedding computation. These results make the method suitable for production ITSM environments prioritizing interpretability and operational efficiency.

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

Yongfu Xue

Main category: cs.CL

TL;DR: PIRA is a training paradigm that improves reward models for LLM alignment by reformulating question-answer pairs into preference instructions, aggregating rewards from diverse tasks, and stabilizing outputs with dropout averaging.

Details

Motivation: Traditional reward models have low data efficiency due to direct concatenation of questions and responses, and are vulnerable to reward overoptimization, limiting their effectiveness in aligning LLMs with human preferences.

Method: Three strategies: (1) Reformulate question-answer pairs into preference-based instructions for clearer task specification, (2) aggregate rewards from diverse preference tasks to reduce bias, (3) average value-head outputs under varying dropout rates to stabilize rewards.

Result: Extensive experiments demonstrated the effectiveness of PIRA in addressing the challenges of traditional reward models.

Conclusion: PIRA successfully addresses key limitations of traditional reward models through its three-component approach, providing a more robust and efficient training paradigm for LLM alignment with human preferences.

Abstract: Reward models are crucial for aligning Large Language Models (LLMs) with human preferences but face two representative challenges. First, traditional discriminative reward models usually concatenate questions and responses directly as input, resulting in low data efficiency. Second, reward models are vulnerable to reward overoptimization. We propose PIRA, a training paradigm addressing these issues through three strategies: (1) Reformulating question-answer pairs into preference-based instructions for clearer and more explicit task specification, (2) aggregating rewards from diverse preference tasks to reduce bias and improve robustness, and (3) averaging value-head outputs under varying dropout rates to stabilize rewards. Extensive experiments have demonstrated the effectiveness of PIRA.

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

Mann Khatri, Mirza Yusuf, Rajiv Ratn Shah, Ponnurangam Kumaraguru

Main category: cs.CL

TL;DR: LLMs struggle with legal tasks due to lack of domain-specific training. This paper shows that structuring legal documents by rhetorical roles and explaining legal terminology improves LLM performance on Indian legal judgment prediction by 1.5-4.36% F1 score.

Details

Motivation: LLMs lack domain-specific pretraining for law and struggle with long, complex legal documents. Previous work used in-context learning to address knowledge gaps, but legal text structure and terminology remain challenging.

Method: Three experiments: (i) reorganizing documents by rhetorical roles to improve long context processing, (ii) defining rhetorical roles to familiarize models with legal terminology, (iii) emulating court step-by-step reasoning about rhetorical roles. Conducted in zero-shot setting on three Indian legal judgment prediction datasets.

Result: Organizing data by rhetorical roles and explaining legal terms significantly improved model performance, with F1 score increases ranging from ~1.5% to 4.36% compared to baseline.

Conclusion: Structuring legal information through rhetorical roles and providing domain-specific terminology explanations effectively enhances LLM performance on legal tasks without requiring full domain retraining.

Abstract: Large Language Models (LLMs), trained on extensive datasets from the web, exhibit remarkable general reasoning skills. Despite this, they often struggle in specialized areas like law, mainly because they lack domain-specific pretraining. The legal field presents unique challenges, as legal documents are generally long and intricate, making it hard for models to process the full text efficiently. Previous studies have examined in-context approaches to address the knowledge gap, boosting model performance in new domains without full domain alignment. In our paper, we analyze model behavior on legal tasks by conducting experiments in three areas: (i) reorganizing documents based on rhetorical roles to assess how structured information affects long context processing and model decisions, (ii) defining rhetorical roles to familiarize the model with legal terminology, and (iii) emulating the step-by-step reasoning of courts regarding rhetorical roles to enhance model reasoning. These experiments are conducted in a zero-shot setting across three Indian legal judgment prediction datasets. Our results reveal that organizing data or explaining key legal terms significantly boosts model performance, with a minimum increase of ~1.5% and a maximum improvement of 4.36% in F1 score compared to the baseline.

Saad Mankarious, Ayah Zirikly, Daniel Wiechmann, Elma Kerz, Edward Kempa, Yu Qiao

Main category: cs.CL

TL;DR: MindSET is a new benchmark dataset for mental health analysis from Reddit with 13M annotated posts across 7 conditions, featuring rigorous preprocessing and outperforming previous benchmarks by up to 18 F1 points.

Details

Motivation: Existing mental health benchmarks are outdated due to limited data, inadequate cleaning, and inability to handle diverse social media content like multilingual and harmful material.

Method: Curated dataset from Reddit using self-reported diagnoses, applied rigorous preprocessing (language filtering, NSFW removal, deduplication), performed linguistic analysis with LIWC, and conducted binary classification experiments with fine-tuned language models and BoW features.

Result: MindSET contains over 13M annotated posts (more than twice previous benchmarks), models trained on it consistently outperformed previous benchmarks, achieving up to 18-point improvement in F1 for Autism detection.

Conclusion: MindSET provides a robust foundation for mental health research using social media data, supporting early risk detection and analysis of emerging psychological trends.

Abstract: Social media data has become a vital resource for studying mental health, offering real-time insights into thoughts, emotions, and behaviors that traditional methods often miss. Progress in this area has been facilitated by benchmark datasets for mental health analysis; however, most existing benchmarks have become outdated due to limited data availability, inadequate cleaning, and the inherently diverse nature of social media content (e.g., multilingual and harmful material). We present a new benchmark dataset, \textbf{MindSET}, curated from Reddit using self-reported diagnoses to address these limitations. The annotated dataset contains over \textbf{13M} annotated posts across seven mental health conditions, more than twice the size of previous benchmarks. To ensure data quality, we applied rigorous preprocessing steps, including language filtering, and removal of Not Safe for Work (NSFW) and duplicate content. We further performed a linguistic analysis using LIWC to examine psychological term frequencies across the eight groups represented in the dataset. To demonstrate the dataset utility, we conducted binary classification experiments for diagnosis detection using both fine-tuned language models and Bag-of-Words (BoW) features. Models trained on MindSET consistently outperformed those trained on previous benchmarks, achieving up to an \textbf{18-point} improvement in F1 for Autism detection. Overall, MindSET provides a robust foundation for researchers exploring the intersection of social media and mental health, supporting both early risk detection and deeper analysis of emerging psychological trends.

[7] Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

Leanne Nortje, Dan Oneata, Gabriel Pirlogeanu, Herman Kamper

Main category: cs.CL

TL;DR: This paper introduces a few-shot learning method for visually prompted keyword localisation (VPKL) that automatically mines positive and negative pairs without transcriptions, enabling application to low-resource languages like Yoruba.

Details

Motivation: Previous VPKL methods required transcriptions for contrastive loss training and were only tested on English. The goal is to enable VPKL for low-resource languages without written forms by eliminating the need for transcriptions.

Method: Proposes a few-shot learning scheme that automatically mines positive and negative pairs for contrastive loss without using transcriptions, making the approach applicable to unwritten languages.

Result: On English, the method shows only a small performance drop compared to using ground truth pairs. On Yoruba (a real low-resource language), performance is reasonable but shows a bigger drop due to less accurate automatic pair mining.

Conclusion: The proposed transcription-free approach enables VPKL for low-resource languages, though automatic pair mining accuracy varies across languages, with better performance on English than Yoruba.

Abstract: Given an image query, visually prompted keyword localisation (VPKL) aims to find occurrences of the depicted word in a speech collection. This can be useful when transcriptions are not available for a low-resource language (e.g. if it is unwritten). Previous work showed that VPKL can be performed with a visually grounded speech model trained on paired images and unlabelled speech. But all experiments were done on English. Moreover, transcriptions were used to get positive and negative pairs for the contrastive loss. This paper introduces a few-shot learning scheme to mine pairs automatically without transcriptions. On English, this results in only a small drop in performance. We also - for the first time - consider VPKL on a real low-resource language, Yoruba. While scores are reasonable, here we see a bigger drop in performance compared to using ground truth pairs because the mining is less accurate in Yoruba.

[8] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

Zheng Hui, Xiaokai Wei, Reza Shirkavand, Chen Wang, Weizhi Zhang, Alejandro Peláez, Michelle Gong

Main category: cs.CL

TL;DR: FlexCode introduces a popularity-aware framework for generative recommendation that uses dual codebooks (collaborative filtering and semantic) with adaptive token allocation to better handle both popular and long-tail items.

Details

Motivation: Existing generative recommendation approaches use a single uniform codebook, which fails to address the imbalance between popular items with rich collaborative signals and long-tail items that require semantic understanding, limiting representational efficiency and generalization.

Method: FlexCode adaptively allocates a fixed token budget between a collaborative filtering codebook and a semantic codebook using a lightweight Mixture of Experts (MoE), with alignment and smoothness objectives to maintain coherence across the popularity spectrum.

Result: Experiments on public and industrial-scale datasets show that FlexCode consistently outperforms strong baselines, achieving stronger accuracy and better tail robustness.

Conclusion: FlexCode provides a new mechanism for token representation in generative recommenders that effectively balances memorization and generalization, offering improved handling of both popular and long-tail items.

Abstract: Generative recommendation has recently emerged as a powerful paradigm that unifies retrieval and generation, representing items as discrete semantic tokens and enabling flexible sequence modeling with autoregressive models. Despite its success, existing approaches rely on a single, uniform codebook to encode all items, overlooking the inherent imbalance between popular items rich in collaborative signals and long-tail items that depend on semantic understanding. We argue that this uniform treatment limits representational efficiency and hinders generalization. To address this, we introduce FlexCode, a popularity-aware framework that adaptively allocates a fixed token budget between a collaborative filtering (CF) codebook and a semantic codebook. A lightweight MoE dynamically balances CF-specific precision and semantic generalization, while an alignment and smoothness objective maintains coherence across the popularity spectrum. We perform experiments on both public and industrial-scale datasets, showing that FlexCode consistently outperform strong baselines. FlexCode provides a new mechanism for token representation in generative recommenders, achieving stronger accuracy and tail robustness, and offering a new perspective on balancing memorization and generalization in token-based recommendation models.

[9] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

Saleh Almohaimeed, May Alsofyani, Saad Almohaimeed, Mansour Al Ghanim, Liqiang Wang

Main category: cs.CL

TL;DR: This paper introduces Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset, and presents experiments with GPT models using various prompt engineering techniques and a novel GAT corrector approach.

Details

Motivation: Most text-to-SQL research has focused on English and Chinese, leaving Arabic as an unexplored language in this domain. There was a need to address the lack of Arabic datasets and research for cross-domain, context-dependent text-to-SQL tasks.

Method: Created Ar-SParC dataset with 3,450 question sequences (10,225 total questions). Conducted 40 experiments using GPT-3.5-turbo and GPT-4.5-turbo with 10 prompt engineering techniques. Developed a novel GAT corrector approach and performed ablation studies.

Result: The GAT corrector enhanced performance across all experiments: average improvement of 1.9% EX and 1.9% IX in zero-shot settings, and 1.72% EX and 0.92% IX improvement in in-context learning settings.

Conclusion: The paper successfully addresses the gap in Arabic text-to-SQL research by introducing the first Arabic dataset and demonstrating effective methods including the novel GAT corrector approach that outperforms previous techniques.

Abstract: In recent years, the task of cross-domain, context-dependent text-to-SQL has received significant attention. Enables users with no prior knowledge of SQL to have a conversation with databases using natural language. However, most of the available datasets and research have been conducted in English, along with some work in Chinese. To this date, no effort has been made to address this task in the Arabic language. In this paper, we introduce Ar-SParC, the first Arabic cross-domain, context-dependent text-to-SQL dataset. The dataset consists of 3,450 sequences of interrelated questions, each sequence containing an average of approximately three questions, which results in a total of 10225 questions along with their corresponding SQL queries. We conducted 40 experiments on the Ar-SParC dataset using two large language models, GPT-3.5-turbo and GPT-4.5-turbo, applying 10 different prompt engineering techniques, including four question representation methods and six in-context learning techniques. Furthermore, we developed a novel approach named GAT corrector, which enhanced the performance across all 40 experiments, yielding an average improvement of 1.9% in execution accuracy (EX) and 1.9% in interaction accuracy (IX) under zero-shot settings, and an average increase of 1.72% EX and 0.92% IX under in-context learning settings. Finally, we conducted an ablation study with two more experiments to explain why the GAT corrector outperformed the previous GAT verifier technique, particularly for the Arabic language.

[10] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

Matthew W. Kenaston, Umair Ayub, Mihir Parmar, Muhammad Umair Anjum, Syed Arsalan Ahmed Naqvi, Priya Kumar, Samarth Rawal, Aadel A. Chaudhuri, Yousef Zakharia, Elizabeth I. Heath, Tanios S. Bekaii-Saab, Cui Tao, Eliezer M. Van Allen, Ben Zhou, YooJung Choi, Chitta Baral, Irbaz Bin Riaz

Main category: cs.CL

TL;DR: Large language models can reach correct conclusions through faulty reasoning, posing safety risks in oncology decision support. A hierarchical taxonomy of reasoning errors was developed and validated, showing 23% error rate with confirmation and anchoring biases most common, leading to guideline-discordant recommendations.

Details

Motivation: Despite high performance on clinical benchmarks, LLMs may reach correct conclusions through faulty reasoning, creating safety implications for oncology decision support that accuracy-based evaluation doesn't capture.

Method: Developed hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes. Annotated 600 reasoning traces from breast/pancreatic cancer notes, then validated on 822 responses from prostate cancer consult notes spanning localized to metastatic disease.

Result: Reasoning errors occurred in 23% of interpretations, dominated by confirmation bias and anchoring bias. These failures were associated with guideline-discordant and potentially harmful recommendations, especially in advanced disease management. Automated evaluators could detect error presence but not reliably classify subtypes.

Conclusion: LLMs may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.

Abstract: Despite high performance on clinical benchmarks, large language models may reach correct conclusions through faulty reasoning, a failure mode with safety implications for oncology decision support that is not captured by accuracy-based evaluation. In this two-cohort retrospective study, we developed a hierarchical taxonomy of reasoning errors from GPT-4 chain-of-thought responses to real oncology notes and tested its clinical relevance. Using breast and pancreatic cancer notes from the CORAL dataset, we annotated 600 reasoning traces to define a three-tier taxonomy mapping computational failures to cognitive bias frameworks. We validated the taxonomy on 822 responses from prostate cancer consult notes spanning localized through metastatic disease, simulating extraction, analysis, and clinical recommendation tasks. Reasoning errors occurred in 23 percent of interpretations and dominated overall errors, with confirmation bias and anchoring bias most common. Reasoning failures were associated with guideline-discordant and potentially harmful recommendations, particularly in advanced disease management. Automated evaluators using state-of-the-art language models detected error presence but could not reliably classify subtypes. These findings show that large language models may provide fluent but clinically unsafe recommendations when reasoning is flawed. The taxonomy provides a generalizable framework for evaluating and improving reasoning fidelity before clinical deployment.

[11] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov

Main category: cs.CL

TL;DR: ToolOrchestra trains small orchestrator models to coordinate multiple tools, achieving better performance at lower cost than larger models like GPT-5 on complex reasoning tasks.

Details

Motivation: Large language models are computationally expensive for solving deep problems like Humanity's Last Exam, and there's a need for more efficient and effective tool-augmented reasoning systems.

Method: ToolOrchestra uses reinforcement learning with rewards for outcomes, efficiency, and user preferences to train small orchestrators that manage other models and tools.

Result: Orchestrator (8B model) achieves 37.1% on HLE (outperforming GPT-5’s 35.1%) with 2.5x efficiency, and surpasses GPT-5 on tau2-Bench and FRAMES using only 30% of the cost.

Conclusion: Lightweight orchestration models composing diverse tools are more efficient and effective than existing methods, enabling practical and scalable tool-augmented reasoning systems.

Abstract: Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity’s Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

[12] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

Bharadwaj Yadavalli

Main category: cs.CL

TL;DR: Dynamic Template Selection (DTS) adaptively matches response templates to query complexity to reduce token costs without compromising quality, achieving 90.5% routing accuracy and 32-34% token reductions across major LLM providers.

Details

Motivation: Current uniform prompting strategies across diverse query types lead to substantial token inefficiency, amplified by the 4-8x higher cost of output tokens compared to input tokens across major providers.

Method: Two routing approaches: a simple MLP using pre-computed embeddings and a fine-tuned RoBERTa transformer, evaluated on 1,000 MMLU questions and validated through 9,000 production API calls across 3 major LLM providers.

Result: MLP router achieved 90.5% routing accuracy (slightly better than RoBERTa’s 89.5%) with 125M fewer parameters. Token reductions varied from 32.6% to 33.9% across providers while maintaining consistent routing accuracy.

Conclusion: DTS provides significant cost reductions through adaptive template selection, with provider-agnostic routing behavior and better performance from simpler MLP approach compared to complex transformer models.

Abstract: Contemporary large language model deployments typically employ uniform prompting strategies across diverse query types, applying verbose response patterns to both complex analytical tasks and straightforward factual questions. This one-size-fits-all methodology leads to substantial token inefficiency, a concern amplified by the significant cost differential between input and output tokens–the latter commanding 4-8x higher prices across major providers. We present Dynamic Template Selection (DTS), which adaptively matches response templates to query complexity, achieving significant cost reductions without compromising response quality. We compared two routing approaches: a simple MLP that uses pre-computed embeddings and a more complex fine-tuned RoBERTa transformer. Through comprehensive evaluation on 1,000 MMLU questions, we find that the MLP router achieves 90.5% routing accuracy on held-out test data, marginally exceeding RoBERTa’s performance (89.5%) despite utilizing 125M fewer parameters. Notably, our empirical analysis reveals provider-agnostic behavior in template selection–routing decisions generalize effectively across 3 major LLM providers (OpenAI GPT-4, Google Gemini, and Anthropic Claude), as validated through 9,000 production API calls. While routing accuracy remains consistent at 90.5% across providers, observed token reductions vary from 32.6% to 33.9%, reflecting provider-specific generation characteristics. This work contributes several key elements: formal problem formulation with theoretical grounding in machine learning, four algorithms with corresponding complexity analyses, and extensive empirical validation across production systems.

[13] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Ye Bhone Lin, Thura Aung, Ye Kyaw Thu, Thazin Myint Oo

Main category: cs.CL

TL;DR: First study on ASR error correction for Burmese using sequence-to-sequence Transformers with IPA and alignment features, achieving significant WER reduction and chrF++ improvement.

Details

Motivation: Address ASR error correction specifically for low-resource Burmese language, which lacks prior research in this area.

Method: Sequence-to-sequence Transformer models with different feature integration strategies including IPA and alignment information, evaluated on five ASR backbones.

Result: AEC model reduced average WER from 51.56 to 39.82 (43.59 after augmentation) and improved chrF++ scores from 0.5864 to 0.627, showing consistent gains over baseline ASR outputs.

Conclusion: AEC is robust and feature design is crucial for improving ASR outputs in low-resource settings, with IPA and alignment features proving effective.

Abstract: This paper investigates sequence-to-sequence Transformer models for automatic speech recognition (ASR) error correction in low-resource Burmese, focusing on different feature integration strategies including IPA and alignment information. To our knowledge, this is the first study addressing ASR error correction specifically for Burmese. We evaluate five ASR backbones and show that our ASR Error Correction (AEC) approaches consistently improve word- and character-level accuracy over baseline outputs. The proposed AEC model, combining IPA and alignment features, reduced the average WER of ASR models from 51.56 to 39.82 before augmentation (and 51.56 to 43.59 after augmentation) and improving chrF++ scores from 0.5864 to 0.627, demonstrating consistent gains over the baseline ASR outputs without AEC. Our results highlight the robustness of AEC and the importance of feature design for improving ASR outputs in low-resource settings.

[14] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

Lijun Shang, Yadong Yu, Wenqiang Kang, Jian Zhou, Dongyue Gao, Pan Xiang, Zhe Liu, Mengyan Dai, Zhonglu Guo, Zhimei Sun

Main category: cs.CL

TL;DR: 2D materials have unique properties for energy applications, but information is scattered across research papers, making synthesis data hard to find.

Details

Motivation: To address the challenge of dispersed synthesis information for 2D materials in research literature, which hinders efficient material development for energy storage and conversion applications.

Method: The paper likely involves developing a systematic approach or database to collect and organize synthesis methods and properties of 2D materials from published research papers.

Result: Expected to provide a comprehensive collection of 2D material synthesis methods and properties, enabling easier access to valuable information for researchers.

Conclusion: Organizing scattered 2D material synthesis data from research papers will accelerate development of energy storage and conversion technologies.

Abstract: Two-dimensional (2D) materials have showed widespread applications in energy storage and conversion owning to their unique physicochemical, and electronic properties. Most of the valuable information for the materials, such as their properties and preparation methods, is included in the published research papers. However, due to the dispersion of synthe

[15] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

Trung Cuong Dang, David Mohaisen

Main category: cs.CL

TL;DR: The paper introduces multi-prefix memorization, a new framework that defines memorization based on the number of distinct prefixes that can elicit a sequence from an LLM, providing a more robust measure than single-path extraction methods.

Details

Motivation: Existing memorization definitions have shortcomings in comprehensively capturing memorization in aligned models, creating privacy and copyright risks due to verbatim memorization of training data in large language models.

Method: Proposed multi-prefix memorization framework where a sequence is considered memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it, focusing on the diversity of retrieval paths rather than single-path extraction.

Result: Experiments on open-source and aligned chat models show that the multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust tool for auditing data leakage.

Conclusion: The multi-prefix memorization framework offers a practical and reliable approach to quantify memorization robustness in LLMs, addressing limitations of previous definitions and enabling better assessment of privacy and copyright risks.

Abstract: Large language models, trained on massive corpora, are prone to verbatim memorization of training data, creating significant privacy and copyright risks. While previous works have proposed various definitions for memorization, many exhibit shortcomings in comprehensively capturing this phenomenon, especially in aligned models. To address this, we introduce a novel framework: multi-prefix memorization. Our core insight is that memorized sequences are deeply encoded and thus retrievable via a significantly larger number of distinct prefixes than non-memorized content. We formalize this by defining a sequence as memorized if an external adversarial search can identify a target count of distinct prefixes that elicit it. This framework shifts the focus from single-path extraction to quantifying the robustness of a memory, measured by the diversity of its retrieval paths. Through experiments on open-source and aligned chat models, we demonstrate that our multi-prefix definition reliably distinguishes memorized from non-memorized data, providing a robust and practical tool for auditing data leakage in LLMs.

[16] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

Jiaojiao Han, Wujiang Xu, Mingyu Jin, Mengnan Du

Main category: cs.CL

TL;DR: SAGE is an agent-based framework that improves feature interpretation in sparse autoencoders by using an active, iterative process of explanation formulation and testing.

Details

Motivation: LLMs are opaque and hard to interpret, making their safe deployment challenging. While sparse autoencoders help decompose representations, explaining their features remains difficult.

Method: SAGE uses an agent-based approach that systematically formulates multiple explanations per feature, designs targeted experiments to test them, and iteratively refines explanations based on activation feedback.

Result: Experiments show SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines across diverse language models.

Conclusion: SAGE successfully transforms feature interpretation from passive generation to an active, explanation-driven process, improving interpretability of LLM representations.

Abstract: Large language models (LLMs) have achieved remarkable progress, yet their internal mechanisms remain largely opaque, posing a significant challenge to their safe and reliable deployment. Sparse autoencoders (SAEs) have emerged as a promising tool for decomposing LLM representations into more interpretable features, but explaining the features captured by SAEs remains a challenging task. In this work, we propose SAGE (SAE AGentic Explainer), an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanation-driven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.an agent-based framework that recasts feature interpretation from a passive, single-pass generation task into an active, explanationdriven process. SAGE implements a rigorous methodology by systematically formulating multiple explanations for each feature, designing targeted experiments to test them, and iteratively refining explanations based on empirical activation feedback. Experiments on features from SAEs of diverse language models demonstrate that SAGE produces explanations with significantly higher generative and predictive accuracy compared to state-of-the-art baselines.

[17] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

Asad Aali, Muhammad Ahmed Mohsin, Vasiliki Bikia, Arnav Singhvi, Richard Gaus, Suhana Bedi, Hejie Cui, Miguel Fuentes, Alyssa Unell, Yifan Mai, Jordan Cahoon, Michael Pfeffer, Roxana Daneshjou, Sanmi Koyejo, Emily Alsentzer, Percy Liang, Christopher Potts, Nigam H. Shah, Akshay S. Chaudhari

Main category: cs.CL

TL;DR: The paper introduces DSPy+HELM framework that uses structured prompting methods to more accurately estimate language model performance ceilings, revealing that traditional HELM benchmarks underestimate LM capabilities and misrepresent performance rankings.

Details

Motivation: Existing benchmarking frameworks like HELM use fixed prompts that fail to generalize across LMs, leading to unrepresentative performance estimates and potentially underestimating model capabilities.

Method: Developed a reproducible DSPy+HELM framework with four structured prompting methods that elicit reasoning, evaluated across four frontier LMs and seven benchmarks in general and medical domains.

Result: Without structured prompting: HELM underestimates LM performance by 4% average, increases performance variance, flips leaderboard rankings on 3/7 benchmarks, and reasoning methods reduce LM sensitivity to prompt design.

Conclusion: Scalable performance ceiling estimation through structured prompting enables more accurate and decision-useful benchmarks, with open-source tools provided for integration and optimization.

Abstract: As language models (LMs) are increasingly adopted across domains, high-quality benchmarking frameworks that accurately estimate performance are essential for guiding deployment decisions. While frameworks such as Holistic Evaluation of Language Models (HELM) enable broad evaluation across tasks, they often rely on fixed prompts that fail to generalize across LMs, yielding unrepresentative performance estimates. Unless we estimate each LM’s ceiling (maximum achievable via changes to the prompt), we risk underestimating performance. Declarative prompting frameworks, such as DSPy, offer a scalable alternative to manual prompt engineering by crafting structured prompts that can be optimized per task. However, such frameworks have not been systematically evaluated across established benchmarks. We present a reproducible DSPy+HELM framework that introduces structured prompting methods which elicit reasoning, enabling more accurate LM benchmarking. Using four prompting methods, we evaluate four frontier LMs across seven benchmarks (general/medical domain) against existing HELM baseline scores. We find that without structured prompting: (i) HELM underestimates LM performance (by 4% average), (ii) performance estimates vary more across benchmarks (+2% standard deviation), (iii) performance gaps are misrepresented (leaderboard rankings flip on 3/7 benchmarks), and (iv) introducing reasoning (chain-of-thought) reduces LM sensitivity to prompt design (smaller Δ across prompts). To our knowledge, this is the first large-scale benchmarking study to empirically characterize LM behavior across benchmarks and prompting methods, showing that scalable performance ceiling estimation enables more decision-useful benchmarks. We open-source (i) DSPy+HELM Integration (https://github.com/stanford-crfm/helm/pull/3893) and (ii) Prompt Optimization Pipeline (https://github.com/StanfordMIMI/dspy-helm).

[18] LightMem: Lightweight and Efficient Memory-Augmented Generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, Ningyu Zhang

Main category: cs.CL

TL;DR: LightMem is an efficient memory system for LLMs that organizes memory into three stages (sensory, short-term, long-term) inspired by human memory models, achieving significant performance improvements while drastically reducing computational overhead.

Details

Motivation: LLMs struggle to effectively leverage historical interaction information in dynamic environments, and existing memory systems introduce substantial time and computational overhead.

Method: Three-stage memory system: 1) Cognition-inspired sensory memory filters irrelevant info via lightweight compression and topic grouping, 2) Topic-aware short-term memory consolidates topic groups with structured access, 3) Long-term memory with sleep-time update uses offline procedures decoupled from online inference.

Result: On LongMemEval and LoCoMo benchmarks using GPT and Qwen backbones: improves QA accuracy by up to 7.7%/29.3%, reduces total token usage by up to 38x/20.9x, reduces API calls by up to 30x/55.5x, with online test-time costs achieving up to 106x/117x token reduction and 159x/310x fewer API calls.

Conclusion: LightMem effectively balances performance and efficiency in memory systems for LLMs, enabling more effective utilization of historical interaction information while minimizing computational overhead.

Abstract: Despite their remarkable capabilities, Large Language Models (LLMs) struggle to effectively leverage historical interaction information in dynamic and complex environments. Memory systems enable LLMs to move beyond stateless interactions by introducing persistent information storage, retrieval, and utilization mechanisms. However, existing memory systems often introduce substantial time and computational overhead. To this end, we introduce a new memory system called LightMem, which strikes a balance between the performance and efficiency of memory systems. Inspired by the Atkinson-Shiffrin model of human memory, LightMem organizes memory into three complementary stages. First, cognition-inspired sensory memory rapidly filters irrelevant information through lightweight compression and groups information according to their topics. Next, topic-aware short-term memory consolidates these topic-based groups, organizing and summarizing content for more structured access. Finally, long-term memory with sleep-time update employs an offline procedure that decouples consolidation from online inference. On LongMemEval and LoCoMo, using GPT and Qwen backbones, LightMem consistently surpasses strong baselines, improving QA accuracy by up to 7.7% / 29.3%, reducing total token usage by up to 38x / 20.9x and API calls by up to 30x / 55.5x, while purely online test-time costs are even lower, achieving up to 106x / 117x token reduction and 159x / 310x fewer API calls. The code is available at https://github.com/zjunlp/LightMem.

[19] Length-MAX Tokenizer for Language Models

Dong Dong, Weijie Su

Main category: cs.CL

TL;DR: Length-MAX tokenizer reduces tokens per character by treating vocabulary selection as graph partitioning, achieving 14-18% fewer tokens than BPE and faster training/inference while improving downstream performance.

Details

Motivation: Current tokenizers like BPE don't optimize for token length efficiency, leading to unnecessary computational overhead in training and inference. The goal is to minimize average tokens per character to improve language model efficiency.

Method: Casts length-weighted objective maximization as graph partitioning problem and develops greedy approximation algorithm to obtain vocabulary that minimizes tokens per character.

Result: 14-18% fewer tokens than BPE across various vocabulary sizes, 18.5% fewer training steps to reach validation loss, 13.7% lower inference latency, 16% throughput gain, improved downstream performance (11.7% lower LAMBADA perplexity, 4.3% higher HellaSwag accuracy), 99.62% vocabulary coverage.

Conclusion: Optimizing for average token length rather than frequency alone provides more efficient language modeling without sacrificing downstream performance, with practical benefits including 18% memory reduction for embeddings and KV-cache.

Abstract: We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14–18% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5%, 17.2%, and 18.5% fewer steps, respectively, to reach a fixed validation loss, and 13.7%, 12.7%, and 13.7% lower inference latency, together with a 16% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7% and enhancing HellaSwag accuracy by 4.3%. Moreover, the Length-MAX tokenizer achieves 99.62% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing – and often improving – downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18% at inference.

[20] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

Main category: cs.CL

TL;DR: Evo-Memory is a benchmark and framework for evaluating self-evolving memory in LLM agents, addressing the gap in dynamic memory evolution across continuous task streams.

Details

Motivation: Current LLM evaluations focus on static conversational settings, overlooking the need for dynamic memory accumulation and reuse in real-world environments where LLMs handle continuous task streams and fail to learn from accumulated interactions.

Method: Evo-Memory structures datasets into sequential task streams, implements over ten memory modules, provides baseline ExpRAG for experience retrieval, and proposes ReMem pipeline integrating reasoning, actions, and memory updates.

Result: The framework enables evaluation across 10 diverse multi-turn goal-oriented and single-turn reasoning datasets, demonstrating the ability of LLMs to search, adapt, and evolve memory after each interaction.

Conclusion: Evo-Memory bridges the gap in evaluating self-evolving memory capabilities in LLM agents, providing a comprehensive benchmark for continual improvement through memory evolution in dynamic environments.

Abstract: Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

[21] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

Ali Jahan, Masood Ghayoomi, Annette Hautli-Janisz

Main category: cs.CL

TL;DR: Cross-lingual argument mining approach for low-resource languages using three training scenarios: zero-shot transfer, LLM-based augmentation, and cross-lingual training, showing that cross-lingual models outperform others for Persian.

Details

Motivation: To address data scarcity in argument mining for low-resource languages by leveraging cross-lingual approaches and comparing different training strategies.

Method: Three training scenarios: (i) zero-shot transfer from English to Persian, (ii) English training enhanced with LLM-generated synthetic examples, (iii) cross-lingual model combining original English and manually translated Persian data.

Result: Zero-shot: 50.2% F1 (English), 50.7% (Persian); LLM-augmented: 59.2% (English), 69.3% (Persian); Cross-lingual: 74.8% F1 (Persian only). Cross-lingual model outperforms others.

Conclusion: Lightweight cross-lingual approach is more effective than resource-intensive LLM augmentation for argument mining in low-resource languages, providing practical solution for data scarcity.

Abstract: Argument mining is a subfield of natural language processing to identify and extract the argument components, like premises and conclusions, within a text and to recognize the relations between them. It reveals the logical structure of texts to be used in tasks like knowledge extraction. This paper aims at utilizing a cross-lingual approach to argument mining for low-resource languages, by constructing three training scenarios. We examine the models on English, as a high-resource language, and Persian, as a low-resource language. To this end, we evaluate the models based on the English Microtext corpus \citep{PeldszusStede2015}, and its parallel Persian translation. The learning scenarios are as follow: (i) zero-shot transfer, where the model is trained solely with the English data, (ii) English-only training enhanced by synthetic examples generated by Large Language Models (LLMs), and (iii) a cross-lingual model that combines the original English data with manually translated Persian sentences. The zero-shot transfer model attains F1 scores of 50.2% on the English test set and 50.7% on the Persian test set. LLM-based augmentation model improves the performance up to 59.2% on English and 69.3% on Persian. The cross-lingual model, trained on both languages but evaluated solely on the Persian test set, surpasses the LLM-based variant, by achieving a F1 of 74.8%. Results indicate that a lightweight cross-lingual blend can outperform considerably the more resource-intensive augmentation pipelines, and it offers a practical pathway for the argument mining task to overcome data resource shortage on low-resource languages.

[22] Emergence and Localisation of Semantic Role Circuits in LLMs

Nura Aljaafari, Danilo S. Carvalho, André Freitas

Main category: cs.CL

TL;DR: LLMs form compact, causally isolated circuits for semantic roles through gradual refinement, with partial transfer across model scales and architectures.

Details

Motivation: To characterize the internal mechanisms that ground abstract semantic structure in large language models, despite their demonstrated semantic competence.

Method: Integration of role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study semantic role implementation in LLMs.

Result: Found highly concentrated circuits (89-94% attribution within 28 nodes), gradual structural refinement without phase transitions, and moderate cross-scale conservation (24-59% component overlap) with high spectral similarity.

Conclusion: LLMs develop compact, causally isolated mechanisms for abstract semantic structure that exhibit partial transfer across different scales and architectures.

Abstract: Despite displaying semantic competence, large language models’ internal mechanisms that ground abstract semantic structure remain insufficiently characterised. We propose a method integrating role-cross minimal pairs, temporal emergence analysis, and cross-model comparison to study how LLMs implement semantic roles. Our analysis uncovers: (i) highly concentrated circuits (89-94% attribution within 28 nodes); (ii) gradual structural refinement rather than phase transitions, with larger models sometimes bypassing localised circuits; and (iii) moderate cross-scale conservation (24-59% component overlap) alongside high spectral similarity. These findings suggest that LLMs form compact, causally isolated mechanisms for abstract semantic structure, and these mechanisms exhibit partial transfer across scales and architectures.

[23] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

Reham Omar, Abdelghny Orogat, Ibrahim Abdelaziz, Omij Mangukiya, Panos Kalnis, Essam Mansour

Main category: cs.CL

TL;DR: Chatty-KG is a modular multi-agent system for conversational QA over knowledge graphs that combines RAG-style retrieval with structured SPARQL query execution through specialized LLM agents.

Details

Motivation: To address limitations of existing KGQA systems that struggle with multi-turn conversations, coreference resolution, and context tracking while maintaining low latency and preserving KG structure.

Method: Uses task-specialized LLM agents for contextual interpretation, dialogue tracking, entity/relation linking, and query planning to translate natural questions into executable SPARQL queries.

Result: Significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings with higher F1 and P@1 scores on large diverse KGs.

Conclusion: Chatty-KG unifies conversational flexibility with structured KG grounding, offering scalable and extensible multi-turn KGQA without fine-tuning or pre-processing.

Abstract: Conversational Question Answering over Knowledge Graphs (KGs) combines the factual grounding of KG-based QA with the interactive nature of dialogue systems. KGs are widely used in enterprise and domain applications to provide structured, evolving, and reliable knowledge. Large language models (LLMs) enable natural and context-aware conversations, but lack direct access to private and dynamic KGs. Retrieval-augmented generation (RAG) systems can retrieve graph content but often serialize structure, struggle with multi-turn context, and require heavy indexing. Traditional KGQA systems preserve structure but typically support only single-turn QA, incur high latency, and struggle with coreference and context tracking. To address these limitations, we propose Chatty-KG, a modular multi-agent system for conversational QA over KGs. Chatty-KG combines RAG-style retrieval with structured execution by generating SPARQL queries through task-specialized LLM agents. These agents collaborate for contextual interpretation, dialogue tracking, entity and relation linking, and efficient query planning, enabling accurate and low-latency translation of natural questions into executable queries. Experiments on large and diverse KGs show that Chatty-KG significantly outperforms state-of-the-art baselines in both single-turn and multi-turn settings, achieving higher F1 and P@1 scores. Its modular design preserves dialogue coherence and supports evolving KGs without fine-tuning or pre-processing. Evaluations with commercial (e.g., GPT-4o, Gemini-2.0) and open-weight (e.g., Phi-4, Gemma 3) LLMs confirm broad compatibility and stable performance. Overall, Chatty-KG unifies conversational flexibility with structured KG grounding, offering a scalable and extensible approach for reliable multi-turn KGQA.

[24] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

Ioana Buhnila, Aman Sinha, Mathieu Constant

Main category: cs.CL

TL;DR: LLMs perform best on definition-type queries but struggle with exemplification tasks, with performance varying based on concept frequency in pre-training data.

Details

Motivation: To investigate why LLMs excel at definition-type answers but perform poorly on other answer types like examples and paraphrases, and to examine how pre-training data frequency affects performance.

Method: Used TrackList analysis pipeline and RefoMed-EN dataset (6170 medical terms with human annotations) to evaluate LLM performance across different query types, using syntactic/semantic similarity metrics, statistical correlations, and embeddings.

Result: LLMs showed highest performance on definition-type questions and lowest on exemplification tasks. For definitions, models paraphrase more on frequent knowledge and less on technical/tail knowledge, especially in expert texts.

Conclusion: LLM performance varies significantly by answer type and concept frequency, with definition tasks being strongest and exemplification weakest, highlighting limitations in handling diverse linguistic queries beyond definitions.

Abstract: Large Language Models (LLMs) have proven efficient in giving definition-type answers to user input queries. While for humans giving various types of answers, such as examples and paraphrases, is an easy task, LLMs struggle to provide correct answers for other than definition-type queries. In this study, we evaluated this drop in performance using TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also introduce RefoMed-EN, an English dataset consisting of 6170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We studied whether the high frequency of a concept (head) or low frequency (tail) impacts the language model’s performance. We evaluated the quality of the LLM’s output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM’s task performance for definition type questions is the highest, while for the exemplification type it is the lowest. Additionally, we showed that for definition-type questions, large language models are prone to paraphrase more on popular and frequent knowledge and less on tail and technical knowledge, especially in the expert texts.

[25] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

Anantha Padmanaban Krishna Kumar

Main category: cs.CL

TL;DR: LLMs cannot override pre-trained label semantics through in-context learning - they primarily refine existing semantic directions rather than remapping label meanings.

Details

Motivation: To determine whether in-context learning can override pre-trained label semantics or merely refines existing semantic backbones.

Method: Treat LLMs as prompt-induced classifiers and contrast behavior under natural demonstrations (correct labels) vs inverted demonstrations (flipped label meanings), using three alignment metrics and semantic override rate.

Result: Models cannot learn coherent anti-semantic classifiers - semantic override rates remain exactly zero, and prompt alignment increases only by sacrificing accuracy.

Conclusion: ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, suggesting fundamental limits of few-shot prompting for overriding label semantics.

Abstract: Can in-context learning (ICL) override pre-trained label semantics, or does it merely refine an existing semantic backbone? We address this question by treating LLMs as prompt-induced classifiers and contrasting their behavior under \emph{natural} demonstrations (with correct labels) and \emph{inverted} demonstrations (systematically flipping label meanings). We decompose ICL behavior into three alignment metrics (truth, prior, and prompt alignment) and introduce a semantic override rate, defined as correctness under flipped semantics. Across eight classification tasks and eight open-source LLMs (1–12B parameters), we find consistent evidence for a semantic anchor view. With natural demonstrations, ICL improves accuracy while maintaining strong prior alignment; most correct predictions coincide with zero-shot behavior, even when the prior is weak. With inverted demonstrations, models cannot learn coherent anti-semantic classifiers: prompt alignment increases only by sacrificing accuracy, and semantic override rates remain exactly zero in our few-shot 1–12B setting. Rather than flexibly remapping label meanings, ICL primarily adjusts how inputs project onto stable semantic directions learned during pre-training, clarifying fundamental limits of few-shot prompting and suggesting that overriding label semantics at these scales requires interventions beyond ICL. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/semantic-anchors-icl.

[26] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

Michael Iskandardinata, William Christian, Derwin Suhartono

Main category: cs.CL

TL;DR: A retrieval-aware approach using contextual information improves sarcasm detection in LLMs, with web-based retrieval boosting performance by 9.87% on culturally specific datasets and self-knowledge retrieval improving results on standard benchmarks.

Details

Motivation: Sarcasm detection remains challenging for PLMs and LLMs due to linguistic diversity, cultural variations, and unreliable detection of words requiring extra grounding. Current models struggle with context-dependent sarcasm.

Method: Builds on Pragmatic Metacognitive Prompting (PMP) by adding two retrieval strategies: web-based retrieval for external context when models lack background knowledge, and self-knowledge retrieval to elicit the model’s internal knowledge.

Result: Non-parametric retrieval improved macro-F1 by 9.87% on Twitter Indonesia Sarcastic dataset. Self-knowledge retrieval improved macro-F1 by 3.29% on SemEval and 4.08% on MUStARD compared to original PMP method.

Conclusion: Contextual information is crucial for enhancing LLM performance in sarcasm detection, especially for culturally specific slang and unknown terms. Future work will optimize retrieval quality and relevance assessment.

Abstract: Detecting sarcasm remains a challenging task in the areas of Natural Language Processing (NLP) despite recent advances in neural network approaches. Currently, Pre-trained Language Models (PLMs) and Large Language Models (LLMs) are the preferred approach for sarcasm detection. However, the complexity of sarcastic text, combined with linguistic diversity and cultural variation across communities, has made the task more difficult even for PLMs and LLMs. Beyond that, those models also exhibit unreliable detection of words or tokens that require extra grounding for analysis. Building on a state-of-the-art prompting method in LLMs for sarcasm detection called Pragmatic Metacognitive Prompting (PMP), we introduce a retrieval-aware approach that incorporates retrieved contextual information for each target text. Our pipeline explores two complementary ways to provide context: adding non-parametric knowledge using web-based retrieval when the model lacks necessary background, and eliciting the model’s own internal knowledge for a self-knowledge awareness strategy. We evaluated our approach with three datasets, such as Twitter Indonesia Sarcastic, SemEval-2018 Task 3, and MUStARD. Non-parametric retrieval resulted in a significant 9.87% macro-F1 improvement on Twitter Indonesia Sarcastic compared to the original PMP method. Self-knowledge retrieval improves macro-F1 by 3.29% on Semeval and by 4.08% on MUStARD. These findings highlight the importance of context in enhancing LLMs performance in sarcasm detection task, particularly the involvement of culturally specific slang, references, or unknown terms to the LLMs. Future work will focus on optimizing the retrieval of relevant contextual information and examining how retrieval quality affects performance. The experiment code is available at: https://github.com/wllchrst/sarcasm-detection_pmp_knowledge-base.

[27] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

Thura Aung, Eaint Kay Khaing Kyaw, Ye Kyaw Thu, Thazin Myint Oo, Thepchai Supnithi

Main category: cs.CL

TL;DR: KANs (Kolmogorov-Arnold Networks) outperform or match MLPs as classification heads for low-resource Burmese language tasks across various embeddings, with EfficientKAN achieving the best F1-score.

Details

Motivation: In low-resource languages like Burmese, fine-tuning typically freezes pre-trained encoder weights and only trains the final classification layer. MLPs are commonly used but have limitations in expressiveness and computational efficiency.

Method: Evaluated three KAN variants (FourierKAN, EfficientKAN, FasterKAN) as classification heads across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT), comparing them against traditional MLPs.

Result: KAN-based heads were competitive with or superior to MLPs. EfficientKAN with fastText achieved highest F1-score (0.928), FasterKAN offered best speed-accuracy trade-off, and EfficientKAN matched/slightly outperformed MLPs with mBERT (0.917 F1).

Conclusion: KANs serve as expressive, efficient alternatives to MLPs for low-resource language classification, demonstrating their potential for improving performance in resource-constrained settings.

Abstract: In low-resource languages like Burmese, classification tasks often fine-tune only the final classification layer, keeping pre-trained encoder weights frozen. While Multi-Layer Perceptrons (MLPs) are commonly used, their fixed non-linearity can limit expressiveness and increase computational cost. This work explores Kolmogorov-Arnold Networks (KANs) as alternative classification heads, evaluating Fourier-based FourierKAN, Spline-based EfficientKAN, and Grid-based FasterKAN-across diverse embeddings including TF-IDF, fastText, and multilingual transformers (mBERT, Distil-mBERT). Experimental results show that KAN-based heads are competitive with or superior to MLPs. EfficientKAN with fastText achieved the highest F1-score (0.928), while FasterKAN offered the best trade-off between speed and accuracy. On transformer embeddings, EfficientKAN matched or slightly outperformed MLPs with mBERT (0.917 F1). These findings highlight KANs as expressive, efficient alternatives to MLPs for low-resource language classification.

[28] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Bryan E. Tuck, Rakesh M. Verma

Main category: cs.CL

TL;DR: Cross-architecture evaluation of 28 LLM configurations on character-level constraint tasks reveals architectural differences matter more than parameter scaling, with systematic failures on orthographically atypical words.

Details

Motivation: To systematically evaluate how different LLM architectures handle hard orthographic constraints in controlled text generation, as current evaluation is limited.

Method: Evaluated 28 configurations across three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction, using human difficulty ratings from 10,000 solvers.

Result: Architectural differences produced 2.0-2.2x performance gaps (F1=0.761 vs. 0.343) vs. 83% gain from parameter scaling. Models showed heterogeneous thinking budget sensitivity and systematic failures on orthographically atypical words like “data”, “poop”, “loll” (86-95% human success vs. 89-96% model miss rate).

Conclusion: Constraint satisfaction requires specialized architectural features or training objectives beyond standard scaling, as models over-rely on distributional plausibility and penalize orthographically atypical but valid patterns.

Abstract: Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography (“data”, “poop”, “loll”: 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

[29] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

Manish Jain, Satheesh Kumar Ponnambalam, Salman Faroz, Chandrakanth Lns, Vinay Sharma

Main category: cs.CL

TL;DR: MortgageLLM is a dual-expert LLM specialized for mortgage finance that avoids performance trade-offs by training separate models for conversational Q&A and structured tasks, using instruction residual technique and intelligent routing.

Details

Motivation: LLMs lack domain-specific knowledge for specialized sectors like mortgage finance, and single multi-task models suffer from performance degradation when optimizing for different capability types.

Method: Dual-track specialization from LLaMA-3.1-8B base model creating two specialists (conversational Q&A and structured task), instruction residual technique to restore instruction-following, and intelligent task routing using few-shot classification.

Result: Significantly outperforms base model: LLM-as-a-Judge scores of 4.58 (summarization), 4.09 (Q&A), 2.6 (classification) vs 3.99, 4.0, 1.2 respectively; BERTScore improvements across all tasks.

Conclusion: The dual-expert approach effectively addresses domain specialization challenges in mortgage finance while maintaining instruction-following fidelity, demonstrating superior performance over baseline methods.

Abstract: Large Language Models (LLMs) demonstrate exceptional capabilities across general domains, yet their application to specialized sectors such as mortgage finance requires domain-specific knowledge augmentation while preserving instruction-following fidelity. We present MortgageLLM, a novel domain-specific large language model that addresses this dual challenge. It is developed using a dual-track specialization framework from a single base model (LLaMA-3.1-8B). We opted for this dual-expert approach as a single multi-task model suffers from performance trade-offs, where optimizing for structured tasks (via SFT) degrades conversational fidelity (via DPO). Our dual-track method solves this by creating two specialists, allowing each to be optimally trained for its distinct capability. Our approach applies the instruction residual technique to restore instruction-following capabilities post-domain adaptation without supervised fine-tuning. We contribute: (1) application of this residual technique to the highly specialized mortgage finance domain; (2) a dual-expert architecture combining a conversational Q&A model and a structured task model for classification and summarization; and (3) an intelligent task routing mechanism using few-shot classification performed by one of the expert models itself. We validate our approach on domain-specific benchmarks, where our final model (MLM v2) significantly outperforms the base LLaMA-3.1-8B-Instruct, achieving an LLM-as-a-Judge summarization score of 4.58 (vs. 3.99), a Q&A score of 4.09 (vs. 4.0), and a classification score of 2.6 (vs. 1.2). On semantic similarity, our model achieved a BERTScore of 0.77 for summarization (vs. 0.74), 0.68 for Q&A (vs. 0.58), and 0.75 for classification (vs. 0.73), substantially outperforming baseline approaches.

[30] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Yuhang Wang, Yanxu Zhu, Dongyuan Lu, Jitao Sang

Main category: cs.CL

TL;DR: SGASA framework enhances reasoning model safety against adversarial jailbreak prompts through adaptive safety alignment using synthesized guidelines.

Details

Motivation: Adversarial jailbreak prompts can evade safety mechanisms and generate harmful content, requiring adaptive safety alignment to autonomously reinforce defenses.

Method: Two-stage framework: Data Pre-synthesis generates safety guidelines and augmented prompts; Alignment Fine-tuning uses SFT and DPO to embed guidelines into the model.

Result: Extensive experiments show SGASA significantly improves model safety across multiple datasets while minimizing unnecessary refusals of benign requests.

Conclusion: SGASA provides an adaptive and scalable approach to enhance reasoning model robustness against harmful adversarial prompts.

Abstract: Reasoning models have demonstrated remarkable capabilities in complex reasoning tasks. However, ensuring their safety against adversarial jailbreak prompts remains a critical challenge. Due to the covert and deceptive nature of such prompts, they can often evade built-in safety mechanisms and lead to the generation of harmful content. This underscores the need for an adaptive safety alignment approach that enables models to autonomously reinforce their defenses in response to adversarial inputs. This paper introduces the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to strengthen models’ ability to enhance robustness against harmful adversarial prompts while minimizing unnecessary refusals of benign requests. SGASA consists of two key stages: Data Pre-synthesis, which generates safety guidelines and augmented prompts; and Alignment Fine-tuning, which leverages Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to embed these guidelines into the model. Extensive experiments across multiple datasets demonstrate that SGASA significantly improves model safety, validating its adaptive and scalable effectiveness.

[31] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

Steven Wang, Kyle Hunt, Shaojie Tang, Kenneth Joseph

Main category: cs.CL

TL;DR: Fine-tuning LLMs on small human survey data improves response heterogeneity and alignment with human behavior, but still fails to reproduce original study’s regression coefficients, making LLM-generated data unsuitable for replacing human participants in inferential analyses.

Details

Motivation: To determine if fine-tuning LLMs on small human survey samples can address limitations in simulating human behavior, particularly issues like limited diversity, subgroup misalignment, and belief-action discrepancies.

Method: Used a behavioral experiment on information disclosure to compare human and LLM-generated responses across multiple dimensions: distributional divergence, subgroup alignment, belief-action coherence, and regression coefficient recovery.

Result: Fine-tuning on small human samples substantially improved heterogeneity, alignment, and belief-action coherence compared to base models, but failed to reproduce the original study’s regression coefficients.

Conclusion: While fine-tuning improves LLM performance on some behavioral simulation metrics, LLM-generated data remains unsuitable for replacing human participants in formal inferential analyses due to inability to reproduce key statistical relationships.

Abstract: There is ongoing debate about whether large language models (LLMs) can serve as substitutes for human participants in survey and experimental research. While recent work in fields such as marketing and psychology has explored the potential of LLM-based simulation, a growing body of evidence cautions against this practice: LLMs often fail to align with real human behavior, exhibiting limited diversity, systematic misalignment for minority subgroups, insufficient within-group variance, and discrepancies between stated beliefs and actions. This study examines an important and distinct question in this domain: whether fine-tuning on a small subset of human survey data, such as that obtainable from a pilot study, can mitigate these issues and yield realistic simulated outcomes. Using a behavioral experiment on information disclosure, we compare human and LLM-generated responses across multiple dimensions, including distributional divergence, subgroup alignment, belief-action coherence, and the recovery of regression coefficients. We find that fine-tuning on small human samples substantially improves heterogeneity, alignment, and belief-action coherence relative to the base model. However, even the best-performing fine-tuned models fail to reproduce the regression coefficients of the original study, suggesting that LLM-generated data remain unsuitable for replacing human participants in formal inferential analyses.

[32] Developing an Open Conversational Speech Corpus for the Isan Language

Adisai Na-Thalang, Chanakan Wittayasakpan, Kritsadha Phatcharoen, Supakit Buakaw

Main category: cs.CL

TL;DR: First open conversational speech dataset for Isan language (Thai regional dialect) featuring natural speech with colloquialisms, spontaneous prosody, disfluencies, and code-switching with Thai.

Details

Motivation: Address lack of conversational speech resources for Isan language, which has no standardized orthography, to support inclusive AI development and research on underrepresented languages.

Method: Developed practical transcription protocols balancing representational accuracy with computational processing requirements, overcoming challenges of variable writing practices due to tonal differences between Thai and Isan.

Result: Created first open conversational speech dataset for Isan capturing authentic linguistic phenomena, establishing transcription guidelines for this non-standardized language.

Conclusion: The dataset contributes to inclusive AI, supports underrepresented language research, and provides foundation for modeling conversational speech challenges in linguistically diverse contexts.

Abstract: This paper introduces the development of the first open conversational speech dataset for the Isan language, the most widely spoken regional dialect in Thailand. Unlike existing speech corpora that are primarily based on read or scripted speech, this dataset consists of natural speech, thereby capturing authentic linguistic phenomena such as colloquials, spontaneous prosody, disfluencies, and frequent code-switching with central Thai. A key challenge in building this resource lies in the lack of a standardized orthography for Isan. Current writing practices vary considerably, due to the different lexical tones between Thai and Isan. This variability complicates the design of transcription guidelines and poses questions regarding consistency, usability, and linguistic authenticity. To address these issues, we establish practical transcription protocols that balance the need for representational accuracy with the requirements of computational processing. By releasing this dataset as an open resource, we aim to contribute to inclusive AI development, support research on underrepresented languages, and provide a basis for addressing the linguistic and technical challenges inherent in modeling conversational speech.

[33] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

Robert Belanec, Branislav Pecher, Ivan Srba, Maria Bielikova

Main category: cs.CL

TL;DR: PEFT-Bench is a unified benchmark for evaluating parameter-efficient fine-tuning methods on large language models, addressing limitations in current evaluations through comprehensive testing across 27 datasets and 6 PEFT methods with a new metric that considers computational factors.

Details

Motivation: Current evaluations of parameter-efficient fine-tuning (PEFT) methods are limited in scope and difficult to reproduce, despite PEFT's importance in reducing computational and environmental costs of large language models while maintaining performance.

Method: Developed PEFT-Bench, a unified end-to-end benchmark that evaluates diverse PEFT methods on autoregressive LLMs across 27 NLP datasets and 6 PEFT methods, and introduced PEFT Soft Score Penalties (PSCP) metric that accounts for trainable parameters, inference speed, and training memory usage.

Result: The benchmark provides comprehensive evaluation capabilities for PEFT methods, enabling systematic comparison across multiple dimensions including computational efficiency alongside performance metrics.

Conclusion: PEFT-Bench addresses the gap in current PEFT evaluations by providing a reproducible, comprehensive framework that considers both performance and computational factors, facilitating better comparison and selection of parameter-efficient fine-tuning methods.

Abstract: Despite the state-of-the-art performance of Large Language Models (LLMs) achieved on many tasks, their massive scale often leads to high computational and environmental costs, limiting their accessibility. Parameter-efficient fine-tuning (PEFT) methods address this challenge by reducing the number of trainable parameters while maintaining strong downstream performance. Despite the increased development in PEFT methods, current evaluations remain limited (in terms of evaluated models and datasets) and difficult to reproduce. To bridge this gap, we introduce PEFT-Bench, a unified end-to-end benchmark for evaluating diverse PEFT methods on autoregressive LLMs. We demonstrate its usage across 27 NLP datasets and 6 PEFT methods. To account for different PEFT training and inference factors, we also introduce the PEFT Soft Score Penalties (PSCP) metric, which takes trainable parameters, inference speed, and training memory usage into account.

[34] Emergent Lexical Semantics in Neural Language Models: Testing Martin’s Law on LLM-Generated Text

Kai Kugler

Main category: cs.CL

TL;DR: First systematic investigation of Martin’s Law in neural language models reveals non-monotonic development: emerges around checkpoint 100, peaks at 104, then degrades, with semantic collapse in smaller models.

Details

Motivation: To systematically investigate how Martin's Law (relationship between word frequency and polysemy) emerges in neural language models during training and understand the developmental trajectory of linguistic regularities.

Method: Used DBSCAN clustering of contextualized embeddings to operationalize word senses, analyzed four Pythia models (70M-1B parameters) across 30 training checkpoints.

Result: Martin’s Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models experience catastrophic semantic collapse while larger models show graceful degradation. Frequency-specificity trade-off remains stable (r ≈ -0.3).

Conclusion: Compliance with linguistic regularities in LLM-generated text follows a balanced trajectory with an optimal semantic window rather than monotonically increasing with training, establishing a novel methodology for evaluating emergent linguistic structure.

Abstract: We present the first systematic investigation of Martin’s Law - the empirical relationship between word frequency and polysemy - in text generated by neural language models during training. Using DBSCAN clustering of contextualized embeddings as an operationalization of word senses, we analyze four Pythia models (70M-1B parameters) across 30 training checkpoints. Our results reveal a non-monotonic developmental trajectory: Martin’s Law emerges around checkpoint 100, reaches peak correlation (r > 0.6) at checkpoint 104, then degrades by checkpoint 105. Smaller models (70M, 160M) experience catastrophic semantic collapse at late checkpoints, while larger models (410M, 1B) show graceful degradation. The frequency-specificity trade-off remains stable (r $\approx$ -0.3) across all models. These findings suggest that compliance with linguistic regularities in LLM-generated text is not monotonically increasing with training, but instead follows a balanced trajectory with an optimal semantic window. This work establishes a novel methodology for evaluating emergent linguistic structure in neural language models.

[35] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

Joshua Fonseca Rivera

Main category: cs.CL

TL;DR: Fine-tuning enables reliable detection of injected thoughts in language models, transforming near-zero accuracy to 85% with no false positives.

Details

Motivation: To determine if introspective awareness in language models can be directly trained rather than waiting for emergence, addressing Lindsey's open question about training for introspection.

Method: Fine-tuning on transient single-token injections to train a 7B parameter model to detect and report injected activation patterns.

Result: Model transformed from 0.4% accuracy to 85% accuracy on held-out concepts with 0% false positives, satisfying three of Lindsey’s criteria for introspective awareness.

Conclusion: At least one component of introspective behavior can be directly induced through training, offering a pathway to built-in AI transparency.

Abstract: Lindsey (2025) investigates introspective awareness in language models through four experiments, finding that models can sometimes detect and identify injected activation patterns – but unreliably (~20% success in the best model). We focus on the first of these experiments – self-report of injected “thoughts” – and ask whether this capability can be directly trained rather than waiting for emergence. Through fine-tuning on transient single-token injections, we transform a 7B parameter model from near-complete failure (0.4% accuracy, 6.7% false positive rate) to reliable detection (85% accuracy on held-out concepts at α=40, 0% false positives). Our model detects fleeting “thoughts” injected at a single token position, retains that information, and reports the semantic content across subsequent generation steps. On this task, our trained model satisfies three of Lindsey’s criteria: accuracy (correct identification), grounding (0/60 false positives), and internality (detection precedes verbalization). Generalization to unseen concept vectors (7.5pp gap) demonstrates the model learns a transferable skill rather than memorizing specific vectors, though this does not establish metacognitive representation in Lindsey’s sense. These results address an open question raised by Lindsey: whether “training for introspection would help eliminate cross-model differences.” We show that at least one component of introspective behavior can be directly induced, offering a pathway to built-in AI transparency.

[36] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Antonín Jarolím, Martin Fajčík, Lucia Makaiová

Main category: cs.CL

TL;DR: This paper creates a new dataset for fine-grained evidence extraction in Czech and Slovak claims, evaluates LLMs on this task, and finds that LLMs often fail to copy evidence verbatim, with varying performance across different model sizes.

Details

Motivation: Misinformation spreads in online news comments, requiring methods to detect incorrect information by identifying relevant documents and exact text spans that support or refute claims.

Method: Created a new dataset with two-way annotated fine-grained evidence by paid annotators, then evaluated various large language models (LLMs) on this dataset to assess their alignment with human annotations.

Result: LLMs often fail to copy evidence verbatim from source text, leading to invalid outputs. Llama3.1:8b achieved high proportion of correct outputs despite small size, while gpt-oss-120b underperformed despite more parameters. Qwen3:14b, deepseek-r1:32b, and gpt-oss:20b showed effective balance between model size and alignment.

Conclusion: Fine-grained evidence extraction for Czech and Slovak claims reveals LLMs’ limitations in verbatim evidence copying, with smaller models sometimes outperforming larger ones, suggesting model size alone doesn’t guarantee better performance on this task.

Abstract: Misinformation frequently spreads in user comments under online news articles, highlighting the need for effective methods to detect factually incorrect information. To strongly support or refute claims extracted from such comments, it is necessary to identify relevant documents and pinpoint the exact text spans that justify or contradict each claim. This paper focuses on the latter task – fine-grained evidence extraction for Czech and Slovak claims. We create new dataset, containing two-way annotated fine-grained evidence created by paid annotators. We evaluate large language models (LLMs) on this dataset to assess their alignment with human annotations. The results reveal that LLMs often fail to copy evidence verbatim from the source text, leading to invalid outputs. Error-rate analysis shows that the {llama3.1:8b model achieves a high proportion of correct outputs despite its relatively small size, while the gpt-oss-120b model underperforms despite having many more parameters. Furthermore, the models qwen3:14b, deepseek-r1:32b, and gpt-oss:20b demonstrate an effective balance between model size and alignment with human annotations.

[37] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

Zhifeng Hao, Qibin Song, Ruichu Cai, Boyan Xu

Main category: cs.CL

TL;DR: DSR-SQL introduces a dual-state reasoning framework for Text-to-SQL that overcomes limitations of Chain-of-Thought approaches in complex enterprise databases through adaptive context state and progressive generation state interaction.

Details

Motivation: Current divide-and-conquer reasoning approaches struggle with complex enterprise databases due to limited context capacity, unreliable schema linking, and weak grounding in database semantics.

Method: Models Text-to-SQL as interaction between adaptive context state (constructs compact semantically faithful environment) and progressive generation state (formalizes SQL synthesis as feedback-guided state transitions for self-correction).

Result: Achieves 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set without post-training or in-context examples.

Conclusion: DSR-SQL provides an effective framework for complex Text-to-SQL tasks through dual-state reasoning that enables coherent reasoning and better alignment with user intent.

Abstract: Recent divide-and-conquer reasoning approaches, particularly those based on Chain-of-Thought (CoT), have substantially improved the Text-to-SQL capabilities of Large Language Models (LLMs). However, when applied to complex enterprise databases, such methods struggle to maintain coherent reasoning due to limited context capacity, unreliable schema linking, and weak grounding in database semantics. To overcome these issues, we introduce DSR-SQL, a \textbf{D}ual-\textbf{S}tate \textbf{R}easoning framework that models Text-to-SQL as an interaction between an adaptive context state and a progressive generation state. The first constructs a compact, semantically faithful environment by refining large schemas and selecting relevant structures, while the second formalizes SQL synthesis as feedback-guided state transitions, enabling the model to self-correct and align with user intent. Without any post-training or in-context examples, DSR-SQL achieves competitive performance, reaching 35.28% execution accuracy on Spider 2.0-Snow and 68.32% on BIRD development set. Our implementation will be open-sourced at: https://github.com/DMIRLAB-Group/DSR-SQL.

[38] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

Kaifeng Hong, Yinglong Zhang, Xiaoying Hong, Xuewen Xia, Xing Xu

Main category: cs.CL

TL;DR: Odin is a new architecture that integrates graph structure into Transformers at specific layers through oriented dual-module mechanism, avoiding over-smoothing and hop-dependent diffusion while achieving state-of-the-art performance on text-rich graphs.

Details

Motivation: Existing approaches for text-attributed graphs either rely on GNNs (limited by over-smoothing and hop-dependent diffusion) or Transformers (that overlook graph topology and treat nodes as isolated sequences), creating a need for better structure-text integration.

Method: Odin injects graph structure into Transformers at selected depths through an oriented dual-module mechanism, integrating multi-hop structures at specific layers aligned with semantic hierarchy. It aggregates on global [CLS] representation to avoid over-smoothing and decouple structural abstraction from neighborhood size.

Result: Odin achieves state-of-the-art accuracy on multiple text-rich graph benchmarks, while Light Odin (lightweight variant) delivers competitive performance with significantly reduced computational cost.

Conclusion: Odin and Light Odin form a unified, hop-free framework for principled structure-text integration that strictly contains expressive power of both pure Transformers and GNNs, with released source code available.

Abstract: Text-attributed graphs require models to effectively combine strong textual understanding with structurally informed reasoning. Existing approaches either rely on GNNs–limited by over-smoothing and hop-dependent diffusion–or employ Transformers that overlook graph topology and treat nodes as isolated sequences. We propose Odin (Oriented Dual-module INtegration), a new architecture that injects graph structure into Transformers at selected depths through an oriented dual-module mechanism.Unlike message-passing GNNs, Odin does not rely on multi-hop diffusion; instead, multi-hop structures are integrated at specific Transformer layers, yielding low-, mid-, and high-level structural abstraction aligned with the model’s semantic hierarchy. Because aggregation operates on the global [CLS] representation, Odin fundamentally avoids over-smoothing and decouples structural abstraction from neighborhood size or graph topology. We further establish that Odin’s expressive power strictly contains that of both pure Transformers and GNNs.To make the design efficient in large-scale or low-resource settings, we introduce Light Odin, a lightweight variant that preserves the same layer-aligned structural abstraction for faster training and inference. Experiments on multiple text-rich graph benchmarks show that Odin achieves state-of-the-art accuracy, while Light Odin delivers competitive performance with significantly reduced computational cost. Together, Odin and Light Odin form a unified, hop-free framework for principled structure-text integration. The source code of this model has been released at https://github.com/hongkaifeng/Odin.

[39] A Systematic Study of Model Merging Techniques in Large Language Models

Oğuz Kağan Hitit, Leander Girrbach, Zeynep Akata

Main category: cs.CL

TL;DR: Model merging for LLMs: Task Arithmetic is the only reliable method among six evaluated approaches, while other methods cause performance drops, indicating current techniques don’t transfer well to modern LLMs.

Details

Motivation: To determine if model merging advantages from smaller models generalize to LLMs, and to systematically evaluate merging methods for efficient model reuse and performance improvement.

Method: Large-scale evaluation of six state-of-the-art merging methods across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks, measuring probability of outperforming base model and relative gains.

Result: Task Arithmetic (oldest and simplest method) is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops.

Conclusion: Current merging techniques do not directly transfer to modern LLMs, motivating the need for LLM-specific merging algorithms and merging-aware fine-tuning methods.

Abstract: Model merging combines multiple fine-tuned checkpoints into a single model without additional training, offering an attractive approach to reusing models and efficiently improving performance. However, it remains unclear whether the advantages reported for smaller models and classifiers generalize to LLMs. We present a large-scale, systematic evaluation of six state-of-the-art merging methods, including recent subspace methods, across four open-weight LLMs, twelve fine-tuned checkpoints per base model, and sixteen standard LLM benchmarks. Evaluating through standardized benchmarks, we measure both the probability that a merged model outperforms the base model and relative gains over the best individual checkpoint. Our results show that the oldest and simplest method, Task Arithmetic, is the only approach that reliably yields performance gains on LLMs. Other interference-aware and subspace merging methods typically result in significant performance drops. Our findings indicate that current merging techniques do not directly transfer to modern LLMs. This motivates the design of LLM-specific merging algorithms and merging-aware fine-tuning methods. Code will be released upon acceptance of this paper.

[40] Hierarchical Ranking Neural Network for Long Document Readability Assessment

Yurui Zheng, Yijun Chen, Shaohong Zhang

Main category: cs.CL

TL;DR: Proposes a bidirectional readability assessment model that uses sentence-level analysis and pairwise sorting to handle text length and ordinal label relationships, achieving competitive performance on Chinese and English datasets.

Details

Motivation: Most deep learning approaches for readability assessment fail to consider text length or the ordinal relationship between readability labels, limiting their effectiveness.

Method: Uses bidirectional mechanism to capture contextual information and identify semantic-rich regions for sentence-level readability prediction, then aggregates for document-level assessment. Introduces pairwise sorting algorithm to model ordinal relationships through label subtraction.

Result: Experimental results on Chinese and English datasets show the model achieves competitive performance and outperforms other baseline models.

Conclusion: The proposed bidirectional readability assessment mechanism effectively handles text length and ordinal label relationships, demonstrating superior performance compared to existing approaches.

Abstract: Readability assessment aims to evaluate the reading difficulty of a text. In recent years, while deep learning technology has been gradually applied to readability assessment, most approaches fail to consider either the length of the text or the ordinal relationship of readability labels. This paper proposes a bidirectional readability assessment mechanism that captures contextual information to identify regions with rich semantic information in the text, thereby predicting the readability level of individual sentences. These sentence-level labels are then used to assist in predicting the overall readability level of the document. Additionally, a pairwise sorting algorithm is introduced to model the ordinal relationship between readability levels through label subtraction. Experimental results on Chinese and English datasets demonstrate that the proposed model achieves competitive performance and outperforms other baseline models.

[41] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

Lina Conti, Dennis Fucci, Marco Gaido, Matteo Negri, Guillaume Wisniewski, Luisa Bentivogli

Main category: cs.CL

TL;DR: Speech translation models use acoustic cues and language patterns to assign gender to speaker-referring terms, with models learning broader masculine prevalence patterns rather than just replicating training data associations.

Details

Motivation: Speech conveys speaker gender through acoustic cues like pitch, creating modality-specific bias concerns in speech translation where vocal characteristics may influence gender assignment in gender-ambiguous terms, risking misgendering speakers.

Method: Investigated gender assignment mechanisms across three language pairs (en-es/fr/it) using contrastive feature attribution on spectrograms to examine training data patterns, internal language model biases, and acoustic information interactions.

Result: Models learn broader patterns of masculine prevalence rather than replicating term-specific gender associations. While ILM exhibits strong masculine bias, models can override these preferences using acoustic input. Higher accuracy models use first-person pronouns to link gendered terms to speakers, accessing gender information distributed across frequency spectrum.

Conclusion: Speech translation models employ complex mechanisms for gender assignment that combine acoustic cues and linguistic patterns, with successful models using distributed frequency information rather than just pitch cues to make accurate gender assignments.

Abstract: Unlike text, speech conveys information about the speaker, such as gender, through acoustic cues like pitch. This gives rise to modality-specific bias concerns. For example, in speech translation (ST), when translating from languages with notional gender, such as English, into languages where gender-ambiguous terms referring to the speaker are assigned grammatical gender, the speaker’s vocal characteristics may play a role in gender assignment. This risks misgendering speakers, whether through masculine defaults or vocal-based assumptions. Yet, how ST models make these decisions remains poorly understood. We investigate the mechanisms ST models use to assign gender to speaker-referring terms across three language pairs (en-es/fr/it), examining how training data patterns, internal language model (ILM) biases, and acoustic information interact. We find that models do not simply replicate term-specific gender associations from training data, but learn broader patterns of masculine prevalence. While the ILM exhibits strong masculine bias, models can override these preferences based on acoustic input. Using contrastive feature attribution on spectrograms, we reveal that the model with higher gender accuracy relies on a previously unknown mechanism: using first-person pronouns to link gendered terms back to the speaker, accessing gender information distributed across the frequency spectrum rather than concentrated in pitch.

[42] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

Husne Ara Rubaiyeat, Hasan Mahmud, Md Kamrul Hasan

Main category: cs.CL

TL;DR: The paper presents IsharaKhobor, a dataset for Bangla Sign Language Translation (BdSLT) to address the low-resource nature of the language, along with two subsets for research purposes.

Details

Motivation: To develop AI-based assistive tools for deaf and hard of hearing people in the Bangla speaking community by creating standard sentence-level datasets for Bangla Sign Language Translation, which has been severely constrained due to being a low-resource language.

Method: Created the IsharaKhobor dataset with two subsets (IsharaKhobor_small and IsharaKhobor_canonical_small) through vocabulary restriction and canonicalization. Benchmarked using landmark-based raw and RQE embeddings, and addressed challenges in dataset development.

Result: Successfully developed and made publicly available the IsharaKhobor dataset on Kaggle, along with two refined subsets that resulted from ablation studies on vocabulary restriction and canonicalization.

Conclusion: The IsharaKhobor dataset enables research in Bangla Sign Language Translation and provides a foundation for developing AI assistive tools for the deaf and hard of hearing Bangla-speaking community, with benchmarks established for future work.

Abstract: Bangla Sign Language Translation (BdSLT) has been severely constrained so far as the language itself is very low resource. Standard sentence level dataset creation for BdSLT is of immense importance for developing AI based assistive tools for deaf and hard of hearing people of Bangla speaking community. In this paper, we present a dataset, IsharaKhobor , and two subset of it for enabling research. We also present the challenges towards developing the dataset and present some way forward by benchmarking with landmark based raw and RQE embedding. We do some ablation on vocabulary restriction and canonicalization of the same within the dataset, which resulted in two more datasets, IsharaKhobor_small and IsharaKhobor_canonical_small. The dataset is publicly available at: www.kaggle.com/datasets/hasanssl/isharakhobor [1].

[43] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

Minjoon Choi

Main category: cs.CL

TL;DR: RoParQ benchmark evaluates LLM consistency on paraphrased questions, XParaCon metric measures robustness, and paraphrase-aware SFT improves model consistency to match larger models.

Details

Motivation: LLMs show inconsistent behavior on paraphrased questions, indicating reliance on surface patterns rather than true semantic understanding.

Method: Created RoParQ benchmark from standard datasets using model-generated paraphrases, proposed XParaCon metric for robustness measurement, and implemented reasoning-based paraphrase-aware SFT.

Result: Fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models, with targeted alignment significantly enhancing robustness.

Conclusion: The approach effectively mitigates superficial memorization and fosters more robust, reliable LLMs through semantic invariance alignment.

Abstract: Large Language Models (LLMs) often exhibit inconsistent behavior when answering paraphrased questions, suggesting a reliance on surface-level patterns rather than true semantic understanding. To address this limitation, we introduce RoParQ, a benchmark specifically constructed to evaluate cross-paraphrase consistency in closed-book multiple-choice QA. This benchmark is derived from standard datasets by generating paraphrases via proprietary models and selectively retaining examples that elicit inconsistent confidence from a judge model. We further propose XParaCon, a novel evaluation metric that quantifies a model’s robustness by measuring the standard deviation of accuracies across question variants. Additionally, we implement a reasoning-based, paraphrase-aware Supervised Fine-Tuning (SFT) strategy designed to align models toward semantic invariance. Our experiments demonstrate that this targeted alignment significantly enhances robustness. Notably, fine-tuned lightweight models achieved consistency levels comparable to much larger pre-trained models. These results highlight the efficacy of our approach in mitigating superficial memorization and fostering more robust, reliable LLMs.

[44] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

Yixiu Zhao, Xiaozhi Wang, Zijun Yao, Lei Hou, Juanzi Li

Main category: cs.CL

TL;DR: A lightweight method to identify “skill neurons” in LLMs that encode specific capabilities, extending beyond classification tasks to complex multi-skill scenarios using neuron activation correlations with external metrics.

Details

Motivation: LLMs demonstrate impressive capabilities but their internal workings remain largely black-box. Understanding which neurons encode specific skills could improve interpretability and reveal shortcuts in reasoning.

Method: Correlates neuron activations with auxiliary metrics (external labels, model confidence scores) to identify skill-specific neurons without manual token aggregation. Extends prior work on skill neurons to complex multi-skill scenarios.

Result: Successfully detected neurons that drive known skills and revealed previously unidentified shortcuts in arithmetic reasoning on BigBench. Validated on open-ended text generation and natural language inference tasks.

Conclusion: The method provides a simple, broadly applicable approach to uncover interpretable, task-specific neuron behaviors in LLMs, enhancing model transparency and revealing hidden reasoning patterns.

Abstract: Large language models (LLMs) exhibit remarkable capabilities across a wide range of tasks, yet their internal mechanisms remain largely opaque. In this paper, we introduce a simple, lightweight, and broadly applicable method with a focus on isolating neurons that encode specific skills. Building upon prior work that identified “skill neurons” via soft prompt training on classification tasks, our approach extends the analysis to complex scenarios involving multiple skills. We correlate neuron activations with auxiliary metrics – such as external labels and the model’s own confidence score – thereby uncovering interpretable and task-specific behaviors without the need for manual token aggregation. We empirically validate our method on tasks spanning open-ended text generation and natural language inference, demonstrating its ability to detect neurons that not only drive known skills but also reveal previously unidentified shortcuts in arithmetic reasoning on BigBench.

[45] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

Dongyang Fan, Diba Hashemi, Sai Praneeth Karimireddy, Martin Jaggi

Main category: cs.CL

TL;DR: Metadata beyond URLs can accelerate LLM pretraining when prepended or used as auxiliary tasks, with fine-grained quality indicators being particularly effective.

Details

Motivation: Prior work only explored URLs as useful metadata for accelerating LLM pretraining, leaving open whether other metadata types could provide greater benefits.

Method: Investigated various metadata types, introduced metadata appending as auxiliary tasks, used learnable meta-tokens with masked loss, and analyzed latent representations through probing.

Result: Found that fine-grained document quality indicators and other metadata types accelerate pretraining when prepended. Metadata appending and learnable meta-tokens also improve training efficiency.

Conclusion: Multiple metadata types beyond URLs can improve LLM pretraining efficiency and effectiveness, with fine-grained granularity being a key feature of effective metadata.

Abstract: Incorporating metadata in Large Language Models (LLMs) pretraining has recently emerged as a promising approach to accelerate training. However prior work highlighted only one useful signal-URLs, leaving open the question of whether other forms of metadata could yield greater benefits. In this study, we investigate a wider range of metadata types and find other types of metadata, such as fine-grained indicators of document quality that can also accelerate pretraining when prepended. We identify a common feature among effective metadata: they encode information at a finer granularity. We further introduce metadata appending as a means of improving training efficiency, where predicting an appropriate metadata as auxiliary task can help speed up pretraining. In addition, learnable meta-tokens trained with masked loss can recover part of the speedup by inducing quality-aware latent structure. Using probing, we analyze latent representations to understand how metadata shapes learning. Together, these results yield practical guidelines for integrating metadata to improve both the efficiency and effectiveness of LLM pretraining.

[46] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

Anna Marklová, Ondřej Vinš, Martina Vokáčová, Jiří Milička

Main category: cs.CL

TL;DR: Czech native speakers cannot reliably distinguish AI-generated from human-written poetry, performing at chance level (45.8% accuracy). Aesthetic evaluations show strong authorship bias - poems believed to be AI-generated are rated lower, though AI poems actually receive equal or better ratings.

Details

Motivation: To examine perception of AI-generated poetry in Czech, a morphologically complex Slavic language with limited training data, since most AI poetry research focuses on English.

Method: Conducted study with Czech native speakers who guessed authorship of poems and provided aesthetic evaluations. Used logistic regression to analyze factors affecting recognition accuracy.

Result: Participants performed at chance level identifying authorship. AI poems were rated equally or more favorably than human ones, but poems believed to be AI-generated received lower ratings regardless of actual authorship. Higher liking of poems correlated with lower recognition accuracy.

Conclusion: AI can convincingly produce poetry even in complex, low-resource languages like Czech. Readers’ beliefs about authorship strongly influence aesthetic evaluation, creating an interconnected relationship between perceived source and perceived quality.

Abstract: Large language models are increasingly capable of producing creative texts, yet most studies on AI-generated poetry focus on English – a language that dominates training data. In this paper, we examine the perception of AI- and human-written Czech poetry. We ask if Czech native speakers are able to identify it and how they aesthetically judge it. Participants performed at chance level when guessing authorship (45.8% correct on average), indicating that Czech AI-generated poems were largely indistinguishable from human-written ones. Aesthetic evaluations revealed a strong authorship bias: when participants believed a poem was AI-generated, they rated it as less favorably, even though AI poems were in fact rated equally or more favorably than human ones on average. The logistic regression model uncovered that the more the people liked a poem, the less probable was that they accurately assign the authorship. Familiarity with poetry or literary background had no effect on recognition accuracy. Our findings show that AI can convincingly produce poetry even in a morphologically complex, low-resource (with respect of the training data of AI models) Slavic language such as Czech. The results suggest that readers’ beliefs about authorship and the aesthetic evaluation of the poem are interconnected.

[47] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole-Jean Wu, Shang-Wen Li

Main category: cs.CL

TL;DR: Matrix is a decentralized framework for multi-agent synthetic data generation that eliminates central orchestrators, uses message queues for control/data flow, and achieves 2-15x higher throughput than existing approaches.

Details

Motivation: Existing multi-agent synthesis frameworks have scalability bottlenecks due to centralized orchestrators or are hardcoded for specific domains, limiting flexibility and performance.

Method: Decentralized peer-to-peer design with serialized messages through distributed queues, lightweight agents for task progression, and distributed services for compute-intensive operations like LLM inference, built on Ray.

Result: Achieves 2-15x higher data generation throughput under identical hardware resources across diverse scenarios including multi-agent dialogue, web reasoning, and tool-use trajectory generation, without compromising output quality.

Conclusion: Matrix provides a scalable, flexible framework for multi-agent synthetic data generation that significantly outperforms existing approaches while maintaining output quality.

Abstract: Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality, more diverse, and structurally richer. However, existing frameworks for multi-agent synthesis often depend on a centralized orchestrator, creating scalability bottlenecks, or are hardcoded for specific domains, limiting flexibility. We present \textbf{Matrix}, a decentralized framework that represents both control and data flow as serialized messages passed through distributed queues. This peer-to-peer design eliminates the central orchestrator. Each task progresses independently through lightweight agents, while compute-intensive operations, such as LLM inference or containerized environments, are handled by distributed services. Built on Ray, Matrix scales to tens of thousands of concurrent agentic workflows and provides a modular, configurable design that enables easy adaptation to a wide range of data generation workflows. We evaluate Matrix across diverse synthesis scenarios, such as multi-agent collaborative dialogue, web-based reasoning data extraction, and tool-use trajectory generation in customer service environments. In all cases, Matrix achieves $2$–$15\times$ higher data generation throughput under identical hardware resources, without compromising output quality.

[48] Revisiting Generalization Across Difficulty Levels: It’s Not So Easy

Yeganeh Kordi, Nihal V. Nayak, Max Zuo, Ilana Nguyen, Stephen H. Bach

Main category: cs.CL

TL;DR: LLMs show limited generalization across task difficulties - training on easy or hard data alone doesn’t consistently improve performance across all difficulty levels, highlighting the need for diverse difficulty ranges in training and evaluation data.

Details

Motivation: To understand how LLMs generalize across different task difficulties, addressing conflicting findings about whether training on easier or harder data leads to better results and where those gains occur.

Method: Systematic evaluation using six datasets with examples ranked by difficulty using outputs from thousands of LLMs and Item Response Theory (IRT), creating objective difficulty ratings based solely on LLM performance rather than human judgment.

Result: Cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties.

Conclusion: Both training and evaluation data for LLMs should include a range of difficulties, and taking shortcuts with respect to difficulty is risky for achieving robust performance.

Abstract: We investigate how well large language models (LLMs) generalize across different task difficulties, a key question for effective data curation and evaluation. Existing research is mixed regarding whether training on easier or harder data leads to better results, and whether those gains come on easier or harder test data. We address this question by conducting a systematic evaluation of LLMs’ generalization across models, datasets, and fine-grained groups of example difficulty. We rank examples in six datasets using the outputs of thousands of different LLMs and Item Response Theory (IRT), a well-established difficulty metric in educational testing. Unlike prior work, our difficulty ratings are therefore determined solely by the abilities of many different LLMs, excluding human opinions of difficulty. With a more objective, larger-scale, and finer-grained analysis, we show that cross-difficulty generalization is often limited; training on either easy or hard data cannot achieve consistent improvements across the full range of difficulties. These results show the importance of having a range of difficulties in both training and evaluation data for LLMs, and that taking shortcuts with respect to difficulty is risky.

[49] Evaluating Large Language Models for Radiology Natural Language Processing

Zhengliang Liu, Tianyang Zhong, Yiwei Li, Yutong Zhang, Yi Pan, Zihao Zhao, Peixin Dong, Chao Cao, Yuxiao Liu, Peng Shu, Yaonai Wei, Zihao Wu, Chong Ma, Jiaqi Wang, Sheng Wang, Mengyue Zhou, Zuowei Jiang, Chunlin Li, Jason Holmes, Shaochen Xu, Lu Zhang, Haixing Dai, Kai Zhang, Lin Zhao, Yuanhao Chen, Xu Liu, Peilong Wang, Junhao Chen, Pingkun Yan, Jun Liu, Bao Ge, Lichao Sun, Dajiang Zhu, Xiang Li, Wei Liu, Xiaoyan Cai, Xintao Hu, Xi Jiang, Shu Zhang, Xin Zhang, Tuo Zhang, Shijie Zhao, Quanzheng Li, Hongtu Zhu, Dinggang Shen, Tianming Liu

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: The paper analysis request could not be completed due to API rate limiting

Method: Attempted to fetch paper metadata from arXiv API but encountered HTTP 429 error

Result: No paper content retrieved for analysis

Conclusion: Need to retry the request after the rate limit resets

Abstract: Failed to fetch summary for 2307.13693: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2307.13693&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[50] Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

Yue Zhang, Jingxuan Zuo, Liqiang Jing

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting) from arXiv API

Details

Motivation: The motivation cannot be determined as the paper content is unavailable

Method: The methodology cannot be analyzed due to content unavailability

Result: No results can be reported as the paper content is inaccessible

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper

Abstract: Failed to fetch summary for 2402.11414: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.11414&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[51] Scaling Efficient LLMs

B.N. Kausik

Main category: cs.CL

TL;DR: Failed to fetch summary for 2402.14746 due to HTTP 429 error from arXiv API

Details

Motivation: Unable to determine motivation due to API rate limiting preventing access to paper content

Method: Cannot analyze method as paper content is unavailable

Result: No results can be analyzed due to HTTP 429 error blocking access to the paper

Conclusion: Unable to provide analysis due to technical limitations in accessing the paper content

Abstract: Failed to fetch summary for 2402.14746: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2402.14746&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[52] Gram2Vec: An Interpretable Document Vectorizer

Peter Zeng, Hannah Stortz, Eric Sclafani, Alina Shabaeva, Maria Elizabeth Garza, Daniel Greeson, Owen Rambow

Main category: cs.CL

TL;DR: Unable to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Cannot determine motivation due to retrieval failure

Method: Cannot determine method due to retrieval failure

Result: Cannot determine results due to retrieval failure

Conclusion: Cannot determine conclusion due to retrieval failure

Abstract: Failed to fetch summary for 2406.12131: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2406.12131&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[53] A Psychology-based Unified Dynamic Framework for Curriculum Learning

Guangyu Meng, Qingkai Zeng, John P. Lalor, Hong Yu

Main category: cs.CL

TL;DR: PUDF is a psychology-based curriculum learning framework that uses Item Response Theory to quantify data difficulty and dynamically schedule training data, leading to faster convergence and higher accuracy in fine-tuning large language models.

Details

Motivation: Traditional curriculum learning faces challenges in defining data difficulty and determining appropriate data amounts at each training step. The paper aims to create a unified framework that addresses these issues using psychometric principles.

Method: Proposes PUDF framework that: 1) Uses Item Response Theory with Artificial Crowds to quantify global, interpretable difficulty values; 2) Implements Dynamic Data Selection via Model Ability Estimation (DDS-MAE) to schedule appropriate data amounts during training.

Result: Experimental results show that fine-tuning pre-trained LLMs with PUDF achieves higher accuracy and faster convergence on benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods.

Conclusion: PUDF provides an effective curriculum learning framework where consistent IRT-based difficulty labeling and model ability estimation enable aligned training data selection, leading to improved performance and convergence speed.

Abstract: Directly learning from examples of varying difficulty levels is often challenging for both humans and machine learning models. A more effective strategy involves exposing learners to examples in a progressive order from easy to difficult. Curriculum Learning (CL) has been proposed to implement this strategy in machine learning model training. However, two key challenges persist in CL framework design: defining the difficulty of training data and determining the appropriate amount of data to input at each training step. Drawing inspiration from psychometrics, this paper presents a Psychology-based Unified Dynamic Framework for Curriculum Learning (PUDF). We quantify the difficulty of training data by applying Item Response Theory (IRT) to responses from Artificial Crowds (AC). This theory-driven IRT-AC approach leads to global (i.e., model-independent) and interpretable difficulty values. Leveraging IRT, we propose a training strategy, Dynamic Data Selection via Model Ability Estimation (DDS-MAE), to schedule the appropriate amount of data during model training. Since our difficulty labeling and model ability estimation are based on a consistent theory, namely IRT, their values are comparable within the same scope, potentially leading to aligned training data selection and faster convergence compared to the other CL methods. Experimental results demonstrate that fine-tuning pre-trained large language models with PUDF leads to higher accuracy and faster convergence on a suite of benchmark datasets compared to standard fine-tuning and state-of-the-art CL methods. Ablation studies and downstream analyses further validate the impact of PUDF for CL.

[54] Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

Yinlam Chow, Guy Tennenholtz, Izzeddin Gur, Vincent Zhuang, Bo Dai, Sridhar Thiagarajan, Craig Boutilier, Rishabh Agarwal, Aviral Kumar, Aleksandra Faust

Main category: cs.CL

TL;DR: Proposes inference-aware fine-tuning that optimizes LLM performance for Best-of-N (BoN) inference strategy, using imitation learning and RL methods to overcome non-differentiable argmax challenges.

Details

Motivation: Effectively utilizing inference-time compute is crucial for better LLM performance, and current methods don't directly optimize for inference strategies like BoN.

Method: Develops imitation learning and reinforcement learning methods for BoN-aware fine-tuning, addressing the non-differentiable argmax operator in BoN selection.

Result: BoN-aware models learn meta-strategies that balance best responses with diverse exploration. Improves Gemma 2B performance: MATH from 26.8% to 30.8% (Bo32), HumanEval pass@16 from 61.6% to 67.1%.

Conclusion: BoN-aware fine-tuning effectively improves LLM performance and inference-time compute efficiency by directly optimizing for inference strategies.

Abstract: Recent studies have indicated that effectively utilizing inference-time compute is crucial for attaining better performance from large language models (LLMs). In this work, we propose a novel inference-aware fine-tuning paradigm, in which the model is fine-tuned in a manner that directly optimizes the performance of the inference-time strategy. We study this paradigm using the simple yet effective Best-of-N (BoN) inference strategy, in which a verifier selects the best out of a set of LLM-generated responses. We devise the first imitation learning and reinforcement learning~(RL) methods for BoN-aware fine-tuning, overcoming the challenging, non-differentiable argmax operator within BoN. We empirically demonstrate that our BoN-aware models implicitly learn a meta-strategy that interleaves best responses with more diverse responses that might be better suited to a test-time input – a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the Bo32 performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and pass@32 from 60.0% to 67.0%, as well as the pass@16 on HumanEval from 61.6% to 67.1%.

[55] BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

Simone Giovannini, Fabio Coppini, Andrea Gemelli, Simone Marinai

Main category: cs.CL

TL;DR: A unified document QA dataset combining multiple public datasets, reformulating Document AI tasks into QA format with OCR and bounding box annotations for LLM training and evaluation.

Details

Motivation: To create a comprehensive resource for document Question-Answering by unifying existing Document AI datasets and reformulating tasks like Information Extraction into QA format suitable for Large Language Models.

Method: Combined multiple public Document AI and VRDU datasets, reformulated IE tasks as QA, provided OCR text and bounding box positions for answers, and experimented with different prompting techniques including bounding box information.

Result: Created a unified document QA dataset with OCR and spatial annotations, enabling evaluation of prompting strategies for document comprehension in open-weight models.

Conclusion: The dataset facilitates training and evaluation of LLMs on document QA tasks, with bounding box information enhancing prompting effectiveness for document understanding.

Abstract: We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.

[56] Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang

Main category: cs.CL

TL;DR: Proposes Position-Aware Depth Decay Decoding (D³), a training-free layer skipping method that dynamically reduces computation during LLM inference by allocating fewer layers to later tokens based on their lower perplexity.

Details

Motivation: LLM inference is resource-intensive due to large parameter counts, and traditional compression requires retraining. Dynamic computation methods show not all components are needed for inference, enabling training-free optimization.

Method: Uses a power-law decay function ⌊L × (α^i)⌋ to determine layers retained for token Ti, based on observation that later tokens have lower perplexity and require less computation.

Result: Achieves 1.5x speedup on Llama models (7B-70B) with <1% performance drop on GSM8K and BBH benchmarks, maintaining comparable performance while reducing operations.

Conclusion: D³ demonstrates successful training-free dynamic depth optimization for LLMs, enabling significant inference speedup without performance degradation across various generation tasks.

Abstract: Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (α^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1%$) on the GSM8K and BBH benchmarks.

[57] Reasoning Transfer for an Extremely Low-Resource and Endangered Language: Bridging Languages Through Sample-Efficient Language Understanding

Khanh-Tung Tran, Barry O’Sullivan, Hoang D. Nguyen

Main category: cs.CL

TL;DR: English-Pivoted CoT Training improves reasoning in low-resource languages by generating English chain-of-thought rationales while outputting final responses in the target language, achieving up to 28.33% improvement in mathematical reasoning.

Details

Motivation: Current CoT reasoning gains mainly benefit high-resource languages, leaving low-resource languages behind. The paper addresses this gap by exploring methods to enable effective reasoning in extremely low-resource scenarios.

Method: English-Pivoted CoT Training: supervised fine-tuning to generate CoT in English for low-resource language inputs, while outputting final responses in the target language. Also explores Mixed-Language CoT and Two-Stage Training.

Result: Outperforms other baselines with up to 28.33% improvement in low-resource mathematical reasoning benchmarks. Introduces LC2024, the first mathematical benchmark for Irish (extremely low-resource language).

Conclusion: Explicitly separating language understanding from reasoning enhances cross-lingual reasoning abilities. Provides a practical pathway for multilingual reasoning without extensive retraining in every low-resource language.

Abstract: Recent advances have enabled Large Language Models (LLMs) to tackle reasoning tasks by generating chain-of-thought (CoT) rationales, yet these gains have largely applied to high-resource languages, leaving low-resource languages behind. In this work, we first investigate CoT techniques in extremely low-resource scenarios through previous prompting, model-editing, and fine-tuning approaches. We introduce English-Pivoted CoT Training, leveraging the insight that LLMs internally operate in a latent space aligned toward the dominant language. Given input in a low-resource language, we perform supervised fine-tuning to generate CoT in English and output the final response in the target language. Across mathematical reasoning benchmarks, our approach outperforms other baselines with up to 28.33% improvement in low-resource scenarios. Our analysis and additional experiments, including Mixed-Language CoT and Two-Stage Training, show that explicitly separating language understanding from reasoning enhances cross-lingual reasoning abilities. To facilitate future work, we also release \emph{LC2024}, the first benchmark for mathematical tasks in Irish, an extremely low-resource and endangered language. Our results and resources highlight a practical pathway to multilingual reasoning without extensive retraining in every extremely low-resource language, despite data scarcity.

[58] The Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors

Linxuan Wang, Shuiyuan Yu

Main category: cs.CL

TL;DR: Study examines dependency distance (DD) and hierarchical distance (HD) in Japanese, finding predicate valency drives their trade-off relationship and affects their probability distributions differently.

Details

Motivation: To understand the relationship between dependency distance and hierarchical distance in Japanese language structure and identify the underlying factors that influence their interaction.

Method: Analyzed probability distributions of DD and HD with/without fixed sentence length, examined changes in mean dependency distance (MDD) and mean hierarchical distance (MHD) with increasing sentence length, and calculated correlation coefficients using the Balanced Corpus of Contemporary Written Japanese.

Result: Predicate valency is the key factor behind the trade-off between MDD and MHD. Japanese speakers regulate linear and hierarchical complexity through predicate valency, with relative MDD/MHD sizes depending on valency thresholds. Valency affects HD distributions more than DD distributions, causing MDD to be lower than MHD.

Conclusion: Predicate valency governs the relationship between dependency and hierarchical distances in Japanese, with differential effects on their probability distributions that result in systematic differences between mean dependency and hierarchical distances.

Abstract: To explore the relationship between dependency distance (DD) and hierarchical distance (HD) in Japanese, we compared the probability distributions of DD and HD with and without sentence length fixed, and analyzed the changes in mean dependency distance (MDD) and mean hierarchical distance (MHD) as sentence length increases, along with their correlation coefficient based on the Balanced Corpus of Contemporary Written Japanese. It was found that the valency of the predicates is the underlying factor behind the trade-off relation between MDD and MHD in Japanese. Native speakers of Japanese regulate the linear complexity and hierarchical complexity through the valency of the predicates, and the relative sizes of MDD and MHD depend on whether the threshold of valency has been reached. Apart from the cognitive load, the valency of the predicates also affects the probability distributions of DD and HD. The effect of the valency of the predicates on the distribution of HD is greater than on that of DD, which leads to differences in their probability distributions and causes the mean of MDD to be lower than that of MHD.

[59] A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

Sihyeong Park, Sungryeol Jeon, Chaelyn Lee, Seokhun Jeon, Byung-Soo Kim, Jemin Lee

Main category: cs.CL

TL;DR: This paper provides a comprehensive evaluation of 25 open-source and commercial LLM inference engines, analyzing their ease-of-use, deployment, scalability, and optimization techniques to guide researchers and developers.

Details

Motivation: The increasing use of LLMs in various applications has led to high inference costs due to repeated model invocations. While optimization methods exist, the diverse service requirements make it difficult to select appropriate methods, and there's a lack of systematic study on inference engines.

Method: The authors conducted a comprehensive evaluation of 25 open-source and commercial inference engines, examining them across multiple dimensions including ease-of-use, deployment, general-purpose support, scalability, and suitability for throughput/latency requirements. They also investigated optimization techniques and ecosystem maturity.

Result: The study provides detailed analysis of inference engine capabilities, optimization techniques supported, and performance characteristics. It identifies current limitations and offers practical guidance for selection and design of optimized LLM inference engines.

Conclusion: The paper outlines future research directions including support for complex LLM-based services, hardware diversity, and enhanced security. It provides a public repository to track ongoing developments in this rapidly evolving field.

Abstract: Large language models (LLMs) are widely applied in chatbots, code generators, and search engines. Workload such as chain-of-throught, complex reasoning, agent services significantly increase the inference cost by invoke the model repeatedly. Optimization methods such as parallelism, compression, and caching have been adopted to reduce costs, but the diverse service requirements make it hard to select the right method. Recently, specialized LLM inference engines have emerged as a key component for integrating the optimization methods into service-oriented infrastructures. However, a systematic study on inference engines is still lacking.This paper provides a comprehensive evaluation of 25 open-source and commercial inference engines. We examine each inference engine in terms of ease-of-use, ease-of-deployment, general-purpose support, scalability, and suitability for throughput- and latency-aware computation. Furthermore, we explore the design goals of each inference engine by investigating the optimization techniques it supports. In addition, we assess the ecosystem maturity of open source inference engines and handle the performance and cost policy of commercial solutions.We outline future research directions that include support for complex LLM-based services, support of various hardware, and enhanced security, offering practical guidance to researchers and developers in selecting and designing optimized LLM inference engines. We also provide a public repository to continually track developments in this fast-evolving field: \href{https://github.com/sihyeong/Awesome-LLM-Inference-Engine}{https://github.com/sihyeong/Awesome-LLM-Inference-Engine}.

[60] Enhancing Large Language Models for Detecting Mental Manipulation via Annotation-Free Data Augmentation and Anti-Curriculum Distillation

Yuansheng Gao, Han Bao, Tong Zhang, Bin Li, Jixiang Luo, Ronghao Chen, Zonghui Wang, Wenzhi Chen

Main category: cs.CL

TL;DR: MentalMAC is a framework that enhances LLMs’ ability to detect mental manipulation in dialogues using data augmentation, multi-task supervision, and progressive distillation, achieving significant performance improvements over baselines.

Details

Motivation: Mental manipulation is a serious psychological abuse form that's hard to detect due to insufficient training data, covert nature, and lack of real-world datasets.

Method: Three key components: EvoSA (annotation-free data augmentation using evolutionary operations and speech act theory), teacher-model-generated multi-task supervision, and progressive task-level anti-curriculum distillation.

Result: Achieved up to 25.9% improvement in F1mac and 8.1% in accuracy over best-performing baselines, outperforming commercial LLMs like GPT-4 and Claude-3.5-Sonnet. Created ReaMent dataset with 5,000 real-world dialogue samples.

Conclusion: MentalMAC effectively addresses the challenges in mental manipulation detection and demonstrates superior performance through its innovative framework components.

Abstract: Mental manipulation is a subtle yet pervasive form of psychological abuse that poses serious threats to mental health. Nevertheless, detecting mental manipulation remains a largely underexplored research problem. The field faces three major challenges: (i) insufficient and hard-to-obtain training data; (ii) the covert nature of mental manipulation, which hinders detection; and (iii) the lack of real-world datasets. To address these challenges, we propose MentalMAC, a novel framework that enhances large language models’ ability to detect elements of mental manipulation in multi-turn dialogue. Our approach consists of three key components: EvoSA, an annotation-free data augmentation method based on evolutionary operations and speech act theory; teacher-model-generated multi-task supervision; and progressive task-level anti-curriculum distillation. We then constructed the ReaMent dataset, comprising 5,000 real-world dialogue samples, utilizing MentalMAC-distilled models to aid in human annotation. Vast experiments show that MentalMAC achieves up to 25.9% improvement in F1mac and 8.1% in accuracy over the best-performing baseline, outperforming commercial LLMs such as GPT-4 and Claude-3.5-Sonnet. Warning: This paper contains content that may be offensive to the reader.

[61] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

Hyungjoo Chae, Sunghwan Kim, Junhee Cho, Seungone Kim, Seungjun Moon, Gyeom Hwangbo, Dongha Lim, Minjin Kim, Yeonjun Hwang, Minju Gwak, Dongwook Choi, Minseok Kang, Gwanhoon Im, ByeongUng Cho, Hyojun Kim, Jun Hee Han, Taeyoon Kwon, Minju Kim, Beong-woo Kwak, Dongjin Kang, Jinyoung Yeo

Main category: cs.CL

TL;DR: Web-Shepherd is the first process reward model for web navigation that evaluates step-level trajectories, achieving better accuracy and cost-effectiveness than MLLMs like GPT-4o.

Details

Motivation: Web navigation requires long-horizon sequential decision making beyond typical MLLM capabilities, and existing methods using MLLMs as reward models are too slow and expensive for real-world deployment.

Method: Created WebPRM Collection (40K step-level preference pairs with annotated checklists) and WebRewardBench benchmark, then developed Web-Shepherd PRM to assess web navigation trajectories step-by-step.

Result: Web-Shepherd achieves ~30 points better accuracy than GPT-4o on WebRewardBench, and when used as verifier with GPT-4o-mini policy, achieves 10.9 points better performance at 10x lower cost compared to using GPT-4o-mini as verifier.

Conclusion: Web-Shepherd provides an effective, cost-efficient solution for web navigation reward modeling that outperforms MLLM-based approaches and enables practical deployment.

Abstract: Web navigation is a unique domain that can automate many repetitive real-life tasks and is challenging as it requires long-horizon sequential decision making beyond typical multimodal large language model (MLLM) tasks. Yet, specialized reward models for web navigation that can be utilized during both training and test-time have been absent until now. Despite the importance of speed and cost-effectiveness, prior works have utilized MLLMs as reward models, which poses significant constraints for real-world deployment. To address this, in this work, we propose the first process reward model (PRM) called Web-Shepherd which could assess web navigation trajectories in a step-level. To achieve this, we first construct the WebPRM Collection, a large-scale dataset with 40K step-level preference pairs and annotated checklists spanning diverse domains and difficulty levels. Next, we also introduce the WebRewardBench, the first meta-evaluation benchmark for evaluating PRMs. In our experiments, we observe that our Web-Shepherd achieves about 30 points better accuracy compared to using GPT-4o on WebRewardBench. Furthermore, when testing on WebArena-lite by using GPT-4o-mini as the policy and Web-Shepherd as the verifier, we achieve 10.9 points better performance, in 10 less cost compared to using GPT-4o-mini as the verifier. Our model, dataset, and code are publicly available at LINK.

[62] UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

Wenkang Han, Zhixiong Zeng, Jing Huang, Shu Jiang, Liming Zheng, Longrong Yang, Haibo Qiu, Chang Yao, Jingyuan Chen, Lin Ma

Main category: cs.CL

TL;DR: UITron-Speech is the first end-to-end GUI agent that processes speech instructions and screenshots to predict user actions, addressing text limitations in hands-free scenarios through speech synthesis, mixed-modality training, and grounding refinement.

Details

Motivation: Text-based GUI agents limit accessibility in hands-free scenarios; speech input offers more convenient and accessible human-computer interaction.

Method: Uses speech synthesis for dataset creation, mixed-modality training to address modality imbalance, and a two-step grounding refinement method for localization accuracy.

Result: Achieves robust performance and superior adaptability across multiple benchmarks, demonstrating the feasibility of speech-driven GUI agents.

Conclusion: Speech-driven GUI agents show great potential for more accessible and intelligent human-computer interaction, with UITron-Speech providing a viable solution.

Abstract: Autonomous agents for Graphical User Interfaces (GUIs) are revolutionizing human-computer interaction, yet their reliance on text-based instructions imposes limitations on accessibility and convenience, particularly in hands-free scenarios. To address this issue, we propose replacing text with speech as the instruction input modality for GUI agents, and introduce UITron-Speech, which is the first end-to-end GUI agent capable of directly processing speech instructions and on-device screenshots to predict user actions. To tackle the problem of data scarcity, we synthesize high-quality speech instruction datasets using a random-speaker text-to-speech model. Additionally, we design a mixed-modality training strategy to mitigate the inherent modality imbalance in pre-trained foundation models. Furthermore, we conduct a statistical analysis of the distribution of GUI grounding prediction errors and propose a training-free two-step grounding refinement method to alleviate minor localization deviations. Extensive experiments on multiple benchmarks demonstrate that UITron-Speech achieves robust performance and superior adaptability, underscoring the feasibility and potential of speech-driven GUI agents for more accessible and intelligent human-computer interaction. Our code and datasets are available at https://github.com/UITron-hub/UITron-Speech.

[63] The Structure-Content Trade-off in Knowledge Graph Retrieval

Valentin Six, Evan Dufraisse, Gaël de Chalendar

Main category: cs.CL

TL;DR: Subquestion-based retrieval improves content precision but yields disjoint subgraphs, while question-based retrieval maintains structure at the cost of relevance. Optimal performance arises between these extremes.

Details

Motivation: LLMs increasingly rely on knowledge graphs for factual reasoning, yet how retrieval design shapes their performance remains unclear.

Method: Using a hybrid retrieval function that controls the importance of initial question and subquestions to examine how question decomposition changes the retrieved subgraph’s content and structure.

Result: Subquestion-based retrieval improves content precision but yields disjoint subgraphs, while question-based retrieval maintains structure at the cost of relevance.

Conclusion: Balancing retrieval content and structure is key to effective LLM reasoning over structured knowledge.

Abstract: Large Language Models (LLMs) increasingly rely on knowledge graphs for factual reasoning, yet how retrieval design shapes their performance remains unclear. We examine how question decomposition changes the retrieved subgraph’s content and structure. Using a hybrid retrieval function that controls the importance of initial question and subquestions, we show that subquestion-based retrieval improves content precision, but yields disjoint subgraphs, while question-based retrieval maintains structure at the cost of relevance. Optimal performance arises between these extremes, revealing that balancing retrieval content and structure is key to effective LLM reasoning over structured knowledge.

[64] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

Minh Hoang Nguyen, Thuat Thien Nguyen, Minh Nhat Ta, Tung Le, Huy Tien Nguyen

Main category: cs.CL

TL;DR: Co-NAML-LSTUR is a hybrid news recommendation framework that combines multi-view news encoding with hierarchical user modeling, achieving significant improvements over existing baselines on MIND benchmarks while being designed for limited data resources.

Details

Motivation: To address the challenge of jointly modeling multi-view news representations and capturing dynamic, dual-scale user interests (short- and long-term preferences) in news recommendation systems, particularly for resource-limited scenarios.

Method: Integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, leveraging BERT-based embeddings to enhance semantic representation. Designed as a hybrid framework for training on limited data resources.

Result: Significantly outperforms strong baselines on MIND-small and MIND-large benchmarks: improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR.

Conclusion: The hybrid model effectively combines multi-view news modeling with dual-scale user representations, demonstrating practical effectiveness for resource-limited scenarios rather than claiming absolute state-of-the-art performance.

Abstract: News recommendation systems play a critical role in alleviating information overload by delivering personalized content. A key challenge lies in jointly modeling multi-view representations of news articles and capturing the dynamic, dual-scale nature of user interests-encompassing both short- and long-term preferences. Prior methods often rely on single-view features or insufficiently model user behavior across time. In this work, we introduce Co-NAML-LSTUR, a hybrid news recommendation framework that integrates NAML for attentive multi-view news encoding and LSTUR for hierarchical user modeling, designed for training on limited data resources. Our approach leverages BERT-based embeddings to enhance semantic representation. We evaluate Co-NAML-LSTUR on two widely used benchmarks, MIND-small and MIND-large. Results show that our model significantly outperforms strong baselines, achieving improvements over NRMS by 1.55% in AUC and 1.15% in MRR, and over NAML by 2.45% in AUC and 1.71% in MRR. These findings highlight the effectiveness of our efficiency-focused hybrid model, which combines multi-view news modeling with dual-scale user representations for practical, resource-limited resources rather than a claim to absolute state-of-the-art (SOTA). The implementation of our model is publicly available at https://github.com/MinhNguyenDS/Co-NAML-LSTUR

[65] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Meishan Zhang, Xin Zhang, Xinping Zhao, Shouzheng Huang, Baotian Hu, Min Zhang

Main category: cs.CL

TL;DR: Failed to fetch paper summary due to HTTP 429 error (rate limiting)

Details

Motivation: Unable to determine from provided information

Method: Unable to determine from provided information

Result: Unable to determine from provided information

Conclusion: Unable to determine from provided information

Abstract: Failed to fetch summary for 2507.20783: Page request resulted in HTTP 429 (https://export.arxiv.org/api/query?search_query=&id_list=2507.20783&sortBy=relevance&sortOrder=descending&start=0&max_results=100)

[66] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

Yuzhuang Xu, Xu Han, Yuanchi Zhang, Yixuan Wang, Yijun Liu, Shiyu Ji, Qingfu Zhu, Wanxiang Che

Main category: cs.CL

TL;DR: CAMERA introduces micro-expert as a finer-grained compression unit for Mixture-of-Experts LLMs, enabling efficient pruning and quantization while maintaining performance.

Details

Motivation: MoE models suffer from computational and storage overheads without proportional performance gains from parameter growth, and existing compression methods face challenges in both performance and efficiency.

Method: Views MoE layers as mixtures of micro-experts, proposes CAMERA-P for structured micro-expert pruning and CAMERA-Q for mixed-precision quantization of micro-experts.

Result: CAMERA-P outperforms baselines under 20-60% pruning ratios, CAMERA-Q achieves superior results under 2-bit quantization, and enables complete micro-expert analysis of Qwen2-57B-A14B in <5 minutes on single A100 GPU.

Conclusion: Micro-expert level compression provides effective and efficient approach for MoE model optimization, achieving strong performance with significant computational savings.

Abstract: Large Language Models (LLMs) with Mixture-of-Experts (MoE) architectures are distinguished by their strong performance scaling with increasing parameters across a wide range of tasks, yet they also suffer from substantial computational and storage overheads. Notably, the performance gains of MoE models do not scale proportionally with the growth in expert parameters. While prior works attempt to reduce parameters via expert-level pruning, merging, or decomposition, they still suffer from challenges in both performance and computational efficiency. In this paper, we address these challenges by introducing micro-expert as a finer-grained compression unit that spans across matrices. We first establish a more fundamental perspective, viewing MoE layers as mixtures of micro-experts, and present CAMERA, a lightweight and training-free framework for identifying micro-expert redundancy. Our analysis uncovers significant variance in micro-expert contributions during decoding. Based on this insight, we further propose CAMERA-P, a structured micro-expert pruning framework, and CAMERA-Q, a mixed-precision quantization idea designed for micro-experts. Extensive experiments on nine downstream tasks show that CAMERA-P consistently outperforms strong baselines under pruning ratios ranging from 20% to 60%. Furthermore, CAMERA-Q achieves superior results under aggressive 2-bit quantization, surpassing existing matrix- and channel-level ideas. Notably, our method enables complete micro-expert analysis of Qwen2-57B-A14B in less than 5 minutes on a single NVIDIA A100-40GB GPU.

[67] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

Leroy Z. Wang

Main category: cs.CL

TL;DR: LLMs show upward monotonicity bias in quantifiers during in-context concept learning, revealing hidden biases not apparent in direct prompting.

Details

Motivation: To uncover implicit biases in large language models using concept learning tasks.

Method: Used in-context concept learning experiments to test language models’ understanding of quantifiers.

Result: Language models exhibit bias toward upward monotonicity in quantifiers during concept learning, which is less visible in direct prompting.

Conclusion: In-context concept learning effectively reveals hidden biases in language models that standard testing methods may miss.

Abstract: We introduce a dataset of concept learning tasks that helps uncover implicit biases in large language models. Using in-context concept learning experiments, we found that language models may have a bias toward upward monotonicity in quantifiers; such bias is less apparent when the model is tested by direct prompting without concept learning components. This demonstrates that in-context concept learning can be an effective way to discover hidden biases in language models.

[68] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

Adity Khisa, Nusrat Jahan Lia, Tasnim Mahfuz Nafis, Zarif Masud, Tanzir Pial, Shebuti Rayana, Ahmedul Kabir

Main category: cs.CL

TL;DR: Fine-tuning multilingual transformer models on Bangla-transliterated Chakma corpus improves performance for this low-resource Indo-Aryan language, achieving up to 73.54% token accuracy and 2.90 perplexity.

Details

Motivation: Chakma is an Indo-Aryan language with limited available data and remains underrepresented in language models, creating a need for effective approaches to handle low-resource languages.

Method: Created a novel corpus of Bangla-transliterated Chakma from literature, validated by native speakers, and fine-tuned six encoder-based transformer models (mBERT, XLM-RoBERTa, DistilBERT, BanglaBERT, IndicBERT, DeBERTaV3) on masked language modeling tasks.

Result: Fine-tuned multilingual models outperformed pre-trained counterparts, achieving up to 73.54% token accuracy and perplexity as low as 2.90. Analysis showed data quality impacts performance and OCR pipelines have limitations for Indic scripts.

Conclusion: Bangla-transliterated Chakma is effective for transfer learning, and the released dataset encourages further research on multilingual modeling for low-resource languages.

Abstract: As an Indo-Aryan language with limited available data, Chakma remains largely underrepresented in language models. In this work, we introduce a novel corpus of contextually coherent Bangla-transliterated Chakma, curated from Chakma literature, and validated by native speakers. Using this dataset, we fine-tune six encoder-based transformer models, including multilingual (mBERT, XLM-RoBERTa, DistilBERT), regional (BanglaBERT, IndicBERT), and monolingual English (DeBERTaV3) variants on masked language modeling (MLM) tasks. Our experiments show that fine-tuned multilingual models outperform their pre-trained counterparts when adapted to Bangla-transliterated Chakma, achieving up to 73.54% token accuracy and a perplexity as low as 2.90. Our analysis further highlights the impact of data quality on model performance and shows the limitations of OCR pipelines for morphologically rich Indic scripts. Our research demonstrates that Bangla-transliterated Chakma can be very effective for transfer learning for Chakma language, and we release our dataset to encourage further research on multilingual language modeling for low-resource languages.

[69] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

Wenjin Liu, Haoran Luo, Xueyuan Lin, Haoming Liu, Tiesunlong Shen, Jiapu Wang, Rui Mao, Erik Cambria

Main category: cs.CL

TL;DR: Prompt-R1 is an RL framework where a small LLM generates prompts for a large LLM to solve complex problems, improving performance over baselines.

Details

Motivation: Users struggle to provide effective prompts for complex problems, limiting LLM performance. Prompt-R1 addresses this by automating prompt generation.

Method: End-to-end reinforcement learning with small LLM generating prompts for large LLM reasoning. Uses dual-constrained reward for correctness, quality, and accuracy.

Result: Significantly outperforms baseline models across multiple public datasets and tasks.

Conclusion: Prompt-R1 provides an effective plug-and-play framework for improving LLM performance on complex problems through automated prompt generation.

Abstract: Recently, advanced large language models (LLMs) have emerged at an increasingly rapid pace. However, when faced with complex problems, most users are often unable to provide accurate and effective prompts to interact with LLMs, thus limiting the performance of LLMs. To address this challenge, we propose Prompt-R1, an end-to-end reinforcement learning framework that uses a small-scale LLM to collaborate with large-scale LLMs, replacing user interaction to solve problems better. This collaboration is cast as a multi-turn prompt interaction, where the small-scale LLM thinks and generates prompts, and the large-scale LLM performs complex reasoning. A dual-constrained reward is designed to optimize for correctness, generation quality, and reasoning accuracy. Prompt-R1 provides a plug-and-play framework that supports both inference and training with various large-scale LLMs. Experiments on multiple public datasets show that Prompt-R1 significantly outperforms baseline models across tasks. Our code is publicly available at https://github.com/QwenQKing/Prompt-R1.

[70] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Karishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Julian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi, Licheng Yu, Amine Benhalloum, Hany Awadalla, Manaal Faruqui

Main category: cs.CL

TL;DR: The paper introduces AdvancedIF, a benchmark for evaluating complex instruction following in LLMs, and proposes RIFL, a training pipeline that uses rubric-based reinforcement learning to improve instruction following capabilities.

Details

Motivation: Advanced instruction following for complex, multi-turn, and system-prompted instructions remains a significant challenge in LLMs, hindered by lack of high-quality benchmarks and reliable reward signals.

Method: Proposes RIFL (Rubric-based Instruction-Following Learning) - a post-training pipeline using rubric generation, finetuned rubric verifier, and reward shaping for reinforcement learning.

Result: RIFL achieves 6.7% absolute gain on AdvancedIF benchmark and strong results on public benchmarks, with ablation studies confirming effectiveness of each component.

Conclusion: Rubrics serve as a powerful tool for both training and evaluating advanced instruction following in LLMs, enabling more capable and reliable AI systems.

Abstract: Recent progress in large language models (LLMs) has led to impressive performance on a range of tasks, yet advanced instruction following (IF)-especially for complex, multi-turn, and system-prompted instructions-remains a significant challenge. Rigorous evaluation and effective training for such capabilities are hindered by the lack of high-quality, human-annotated benchmarks and reliable, interpretable reward signals. In this work, we introduce AdvancedIF (we will release this benchmark soon), a comprehensive benchmark featuring over 1,600 prompts and expert-curated rubrics that assess LLMs ability to follow complex, multi-turn, and system-level instructions. We further propose RIFL (Rubric-based Instruction-Following Learning), a novel post-training pipeline that leverages rubric generation, a finetuned rubric verifier, and reward shaping to enable effective reinforcement learning for instruction following. Extensive experiments demonstrate that RIFL substantially improves the instruction-following abilities of LLMs, achieving a 6.7% absolute gain on AdvancedIF and strong results on public benchmarks. Our ablation studies confirm the effectiveness of each component in RIFL. This work establishes rubrics as a powerful tool for both training and evaluating advanced IF in LLMs, paving the way for more capable and reliable AI systems.

[71] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Zhaopei Huang, Qifeng Dai, Guozheng Wu, Xiaopeng Wu, Kehan Chen, Chuan Yu, Xubin Li, Tiezheng Ge, Wenxuan Wang, Qin Jin

Main category: cs.CL

TL;DR: PAL-Bench is a new benchmark for evaluating personalization capabilities in service-oriented dialogue assistants, featuring PAL-Set (a Chinese multi-session dataset) and H²Memory framework for improved personalized interactions.

Details

Motivation: Existing approaches overlook long-term interaction complexities and fail to capture users' subjective characteristics in service-oriented human-agent interactions, creating a need for better personalized dialogue assistants.

Method: Developed a multi-step LLM-based synthesis pipeline to create PAL-Set dataset, and proposed H²Memory - a hierarchical and heterogeneous memory framework using retrieval-augmented generation for personalized response generation.

Result: Comprehensive experiments on PAL-Bench and external datasets demonstrate the effectiveness of the proposed memory framework in improving personalized service-oriented interactions.

Conclusion: PAL-Bench provides a valuable evaluation framework for personalization capabilities, and H²Memory effectively addresses the limitations of existing approaches in capturing user-specific traits for long-term interactions.

Abstract: With the rise of smart personal devices, service-oriented human-agent interactions have become increasingly prevalent. This trend highlights the need for personalized dialogue assistants that can understand user-specific traits to accurately interpret requirements and tailor responses to individual preferences. However, existing approaches often overlook the complexities of long-term interactions and fail to capture users’ subjective characteristics. To address these gaps, we present PAL-Bench, a new benchmark designed to evaluate the personalization capabilities of service-oriented assistants in long-term user-agent interactions. In the absence of available real-world data, we develop a multi-step LLM-based synthesis pipeline, which is further verified and refined by human annotators. This process yields PAL-Set, the first Chinese dataset comprising multi-session user logs and dialogue histories, which serves as the foundation for PAL-Bench. Furthermore, to improve personalized service-oriented interactions, we propose H$^2$Memory, a hierarchical and heterogeneous memory framework that incorporates retrieval-augmented generation to improve personalized response generation. Comprehensive experiments on both our PAL-Bench and an external dataset demonstrate the effectiveness of the proposed memory framework.

[72] AICC: Parse HTML Finer, Make Models Better – A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Runyuan Ma, Chenlin Su, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He

Main category: cs.CL

TL;DR: MinerU-HTML is a novel HTML-to-text extraction pipeline using a 0.6B-parameter language model that significantly outperforms heuristic methods like Trafilatura, achieving 81.8% ROUGE-N F1 vs 63.6% and better preserving structured elements.

Details

Motivation: Current web data curation focuses on filtering/deduplication while treating HTML extraction as fixed pre-processing. Heuristic extractors struggle to preserve document structure and corrupt elements like formulas, codes, and tables.

Method: Reformulates content extraction as sequence labeling using a 0.6B-parameter language model. Uses semantic understanding and two-stage formatting pipeline that categorizes semantic elements before converting to Markdown.

Result: Achieves 81.8% ROUGE-N F1 vs Trafilatura’s 63.6%, with excellent structured element preservation (90.9% for code blocks, 94.0% for formulas). AICC corpus (7.3T tokens) outperforms TfCC by 1.08pp on 13 benchmarks.

Conclusion: HTML extraction quality significantly impacts model capabilities and is a critical, often underestimated component of web corpus construction. Model-based approaches are inherently scalable compared to heuristic methods.

Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8% ROUGE-N F1 compared to Trafilatura’s 63.6%, with exceptional structured element preservation (90.9% for code blocks, 94.0% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.

[73] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Jie He, Richard He Bai, Sinead Williamson, Jeff Z. Pan, Navdeep Jaitly, Yizhe Zhang

Main category: cs.CL

TL;DR: CLaRa is a unified framework that performs embedding-based compression and joint optimization for retrieval-augmented generation in a shared continuous space, achieving state-of-the-art performance on QA benchmarks.

Details

Motivation: Retrieval-augmented generation (RAG) suffers from long contexts and disjoint retrieval-generation optimization, creating inefficiencies in current approaches.

Method: Proposes CLaRa framework with SCP data synthesis for semantically rich compressed vectors, and trains reranker and generator end-to-end via a single language modeling loss using differentiable top-k estimator.

Result: Experiments show CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines across multiple QA benchmarks.

Conclusion: The unified optimization in CLaRa successfully aligns retrieval relevance with answer quality, providing an effective solution to RAG’s limitations.

Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but still suffers from long contexts and disjoint retrieval-generation optimization. In this work, we propose CLaRa (Continuous Latent Reasoning), a unified framework that performs embedding-based compression and joint optimization in a shared continuous space. To obtain semantically rich and retrievable compressed vectors, we introduce SCP, a key-preserving data synthesis framework using QA and paraphrase supervision. CLaRa then trains the reranker and generator end-to-end via a single language modeling loss, with gradients flowing through both modules using a differentiable top-k estimator. Theoretically, this unified optimization aligns retrieval relevance with answer quality. Experiments across multiple QA benchmarks show that CLaRa achieves state-of-the-art compression and reranking performance, often surpassing text-based fine-tuned baselines.

[74] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Rulin Shao, Akari Asai, Shannon Zejiang Shen, Hamish Ivison, Varsha Kishore, Jingming Zhuo, Xinran Zhao, Molly Park, Samuel G. Finlayson, David Sontag, Tyler Murray, Sewon Min, Pradeep Dasigi, Luca Soldaini, Faeze Brahman, Wen-tau Yih, Tongshuang Wu, Luke Zettlemoyer, Yoon Kim, Hannaneh Hajishirzi, Pang Wei Koh

Main category: cs.CL

TL;DR: RLER enables training open models for long-form deep research by co-evolving rubrics with the policy model, resulting in DR Tulu-8B that matches proprietary systems while being smaller and cheaper.

Details

Motivation: Existing open deep research models are trained on short-form QA tasks with verifiable rewards, which doesn't extend to realistic long-form research tasks.

Method: Reinforcement Learning with Evolving Rubrics (RLER) - constructing and maintaining rubrics that co-evolve with the policy model during training to provide discriminative, on-policy feedback.

Result: DR Tulu-8B substantially outperforms existing open deep research models and matches/exceeds proprietary systems across four long-form benchmarks in science, healthcare and general domains.

Conclusion: RLER successfully enables direct training of open models for open-ended long-form deep research, with released data, models, and MCP-based agent infrastructure to facilitate future research.

Abstract: Deep research models perform multi-step research to produce long-form, well-attributed answers. However, most open deep research models are trained on easily verifiable short-form QA tasks via reinforcement learning with verifiable rewards (RLVR), which does not extend to realistic long-form tasks. We address this with Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training; this allows the rubrics to incorporate information that the model has newly explored and to provide discriminative, on-policy feedback. Using RLER, we develop Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research. Across four long-form deep research benchmarks in science, healthcare and general domains, DR Tulu substantially outperforms existing open deep research models, and matches or exceeds proprietary deep research systems, while being significantly smaller and cheaper per query. To facilitate future research, we release all data, models, and code, including our new MCP-based agent infrastructure for deep research systems.

[75] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

Farzad Ahmed, Joniel Augustine Jerome, Meliha Yetisgen, Özlem Uzuner

Main category: cs.CL

TL;DR: Retrieval-augmented dynamic prompting (RDP) outperforms zero-shot and static prompting for medical error detection and correction across nine LLMs, reducing false positives by 15% and improving recall.

Details

Motivation: Clinical documentation contains errors that compromise patient safety, and LLMs may help detect/correct them, but their behavior under different prompting strategies is unclear.

Method: Evaluated 9 LLMs using MEDEC dataset with three prompting strategies: zero-shot, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for error flag detection, error sentence detection, and error correction.

Result: RDP reduced FPR by ~15%, improved recall by 5-10% in error sentence detection, and generated more contextually accurate corrections compared to other methods.

Conclusion: RDP outperforms other prompting methods across diverse LLMs, improving detection accuracy, reducing false positives, and enhancing reliability of medical error correction.

Abstract: Objective: Clinical documentation contains factual, diagnostic, and management errors that can compromise patient safety. Large language models (LLMs) may help detect and correct such errors, but their behavior under different prompting strategies remains unclear. We evaluate zero-shot prompting, static prompting with random exemplars (SPR), and retrieval-augmented dynamic prompting (RDP) for three subtasks of medical error processing: error flag detection, error sentence detection, and error correction. Methods: Using the MEDEC dataset, we evaluated nine instruction-tuned LLMs (GPT, Claude, Gemini, and OpenAI o-series models). We measured performance using accuracy, recall, false-positive rate (FPR), and an aggregate score of ROUGE-1, BLEURT, and BERTScore for error correction. We also analyzed example outputs to identify failure modes and differences between LLM and clinician reasoning. Results: Zero-shot prompting showed low recall in both detection tasks, often missing abbreviation-heavy or atypical errors. SPR improved recall but increased FPR. Across all nine LLMs, RDP reduced FPR by about 15 percent, improved recall by 5 to 10 percent in error sentence detection, and generated more contextually accurate corrections. Conclusion: Across diverse LLMs, RDP outperforms zero-shot and SPR prompting. Using retrieved exemplars improves detection accuracy, reduces false positives, and enhances the reliability of medical error correction.

[76] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

Xiaopeng Li, Yuanjin Zheng, Wanyu Wang, wenlin zhang, Pengyue Jia, Yiqi Wang, Maolin Wang, Xuetao Wei, Xiangyu Zhao

Main category: cs.CL

TL;DR: MTA is a Merge-then-Adapt framework for Personalized LLMs that addresses scalability and sparse data issues by dynamically merging meta-LoRA modules and adding lightweight LoRA stacking for few-shot personalization.

Details

Motivation: Current PLLM approaches face two major limitations: linear storage costs with user count (unscalable) and suboptimal performance for users with sparse data from fine-tuning static models from scratch.

Method: Three-stage framework: 1) Construct shared Meta-LoRA Bank with anchor users and meta-personalization traits; 2) Adaptive LoRA Fusion to dynamically merge relevant anchor meta-LoRAs; 3) LoRA Stacking for Few-Shot Personalization using ultra-low-rank LoRA module.

Result: Extensive experiments on LaMP benchmark show MTA outperforms existing state-of-the-art methods across multiple tasks.

Conclusion: MTA provides an effective solution for scalable and flexible personalization in LLMs, eliminating user-specific storage needs and enabling effective personalization even with sparse user data.

Abstract: Personalized Large Language Models (PLLMs) aim to align model outputs with individual user preferences, a crucial capability for user-centric applications. However, the prevalent approach of fine-tuning a separate module for each user faces two major limitations: (1) storage costs scale linearly with the number of users, rendering the method unscalable; and (2) fine-tuning a static model from scratch often yields suboptimal performance for users with sparse data. To address these challenges, we propose MTA, a Merge-then-Adapt framework for PLLMs. MTA comprises three key stages. First, we construct a shared Meta-LoRA Bank by selecting anchor users and pre-training meta-personalization traits within meta-LoRA modules. Second, to ensure scalability and enable dynamic personalization combination beyond static models, we introduce an Adaptive LoRA Fusion stage. This stage retrieves and dynamically merges the most relevant anchor meta-LoRAs to synthesize a user-specific one, thereby eliminating the need for user-specific storage and supporting more flexible personalization. Third, we propose a LoRA Stacking for Few-Shot Personalization stage, which applies an additional ultra-low-rank, lightweight LoRA module on top of the merged LoRA. Fine-tuning this module enables effective personalization under few-shot settings. Extensive experiments on the LaMP benchmark demonstrate that our approach outperforms existing SOTA methods across multiple tasks.

[77] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali

Abdullah Al Sefat

Main category: cs.CL

TL;DR: BengaliFig is a Bengali challenge set with 435 culturally-grounded riddles to evaluate LLMs’ figurative and cultural reasoning in low-resource contexts, revealing weaknesses in metaphorical and culturally specific reasoning.

Details

Motivation: To address the gap in evaluating large language models on figurative and culturally grounded reasoning, especially in low-resource languages like Bengali, which are underrepresented in current benchmarks.

Method: Created BengaliFig dataset with 435 unique Bengali riddles from oral/literary traditions, annotated along five dimensions, and converted to multiple-choice format using AI-assisted pipeline. Evaluated 8 frontier LLMs with zero-shot and few-shot chain-of-thought prompting.

Result: LLMs showed consistent weaknesses in metaphorical and culturally specific reasoning, highlighting limitations in handling culturally grounded figurative language despite broad multilingual capabilities.

Conclusion: BengaliFig provides both a diagnostic tool for evaluating LLM robustness in low-resource cultural contexts and advances inclusive, heritage-aware NLP evaluation.

Abstract: Large language models excel on broad multilingual benchmarks but remain to be evaluated extensively in figurative and culturally grounded reasoning, especially in low-resource contexts. We present BengaliFig, a compact yet richly annotated challenge set that targets this gap in Bengali, a widely spoken low-resourced language. The dataset contains 435 unique riddles drawn from Bengali oral and literary traditions. Each item is annotated along five orthogonal dimensions capturing reasoning type, trap type, cultural depth, answer category, and difficulty, and is automatically converted to multiple-choice format through a constraint-aware, AI-assisted pipeline. We evaluate eight frontier LLMs from major providers under zero-shot and few-shot chain-of-thought prompting, revealing consistent weaknesses in metaphorical and culturally specific reasoning. BengaliFig thus contributes both a diagnostic probe for evaluating LLM robustness in low-resource cultural contexts and a step toward inclusive and heritage-aware NLP evaluation.

cs.CV

David Amebley, Sayanton Dibbo

Main category: cs.CV

TL;DR: This paper introduces a neuroscience-inspired topological regularization framework to enhance privacy resilience in multi-modal vision-language models against membership inference attacks, showing 24% reduction in attack success while maintaining model utility.

Details

Motivation: Multi-modal models introduce new privacy attack vectors, and while neuro-inspired approaches have shown resilience against adversarial attacks in unimodal systems, their effectiveness against privacy attacks in multi-modal models remains unexplored.

Method: Proposed a topological regularization framework (tau) applied to three VLMs (BLIP, PaliGemma 2, ViT-GPT2) across three datasets (COCO, CC3M, NoCaps), comparing baseline models with neuro variants (tau > 0) against black-box membership inference attacks.

Result: NEURO VLMs showed 24% mean ROC-AUC drop in MIA attack success on BLIP with COCO dataset while maintaining similar model utility (MPNet and ROUGE-2 metrics). Results were consistent across PaliGemma 2 and ViT-GPT2 models on CC3M and NoCaps datasets.

Conclusion: Neuro-inspired VLMs with topological regularization are more resilient against privacy attacks without significantly compromising model utility, contributing to understanding privacy risks in multi-modal models.

Abstract: In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

[79] Pistachio: Towards Synthetic, Balanced, and Long-Form Video Anomaly Benchmarks

Jie Li, Hongyi Cai, Mingkang Dong, Muxin Pu, Shan You, Fei Wang, Tao Huang

Main category: cs.CV

TL;DR: Pistachio is a new Video Anomaly Detection/Understanding benchmark created using a controlled generation-based pipeline that addresses limitations in existing datasets by providing diverse scenes, balanced anomaly coverage, and temporal complexity.

Details

Motivation: Existing VAD benchmarks lack scene diversity, balanced anomaly coverage, and temporal complexity needed for reliable real-world assessment, while VAU requires deeper semantic reasoning but is difficult to benchmark due to heavy manual annotation requirements.

Method: Leverages video generation models with a pipeline that includes scene-conditioned anomaly assignment, multi-step storyline generation, and temporally consistent long-form synthesis to produce coherent 41-second videos with minimal human intervention.

Result: Pistachio demonstrates scale, diversity, and complexity that reveal new challenges for existing methods and effectively eliminates biases of Internet-collected datasets.

Conclusion: The benchmark motivates future research on dynamic and multi-event anomaly understanding by providing a controlled, generation-based alternative to traditional datasets.

Abstract: Automatically detecting abnormal events in videos is crucial for modern autonomous systems, yet existing Video Anomaly Detection (VAD) benchmarks lack the scene diversity, balanced anomaly coverage, and temporal complexity needed to reliably assess real-world performance. Meanwhile, the community is increasingly moving toward Video Anomaly Understanding (VAU), which requires deeper semantic and causal reasoning but remains difficult to benchmark due to the heavy manual annotation effort it demands. In this paper, we introduce Pistachio, a new VAD/VAU benchmark constructed entirely through a controlled, generation-based pipeline. By leveraging recent advances in video generation models, Pistachio provides precise control over scenes, anomaly types, and temporal narratives, effectively eliminating the biases and limitations of Internet-collected datasets. Our pipeline integrates scene-conditioned anomaly assignment, multi-step storyline generation, and a temporally consistent long-form synthesis strategy that produces coherent 41-second videos with minimal human intervention. Extensive experiments demonstrate the scale, diversity, and complexity of Pistachio, revealing new challenges for existing methods and motivating future research on dynamic and multi-event anomaly understanding.

[80] Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation

Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang

Main category: cs.CV

TL;DR: Inferix is a next-generation inference engine optimized for semi-autoregressive decoding to enable immersive world synthesis and realistic video generation, distinguishing it from high-concurrency systems and traditional video diffusion models.

Details

Motivation: To advance world models as core simulators for agentic AI, embodied AI, and gaming by enabling long, physically realistic, and interactive high-quality video generation, moving beyond current LLM-centric vision foundation models.

Method: Utilizes semi-autoregressive (block-diffusion) decoding that combines diffusion and autoregressive methods, generating video tokens in blocks with diffusion within each block while conditioning on previous ones. Features LLM-style KV Cache management for efficient variable-length generation, interactive video streaming, profiling, and LV-Bench integration for benchmarking.

Result: Enables coherent and stable video sequences with efficient, variable-length, high-quality generation capabilities, overcoming limitations of standard video diffusion models.

Conclusion: Inferix represents a specialized inference engine for world simulation that fosters exploration of world models through optimized semi-autoregressive decoding and comprehensive evaluation tools, encouraging community collaboration to advance this paradigm.

Abstract: World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.

[81] Video Object Recognition in Mobile Edge Networks: Local Tracking or Edge Detection?

Kun Guo, Yun Shen, Xijun Wang, Chaoqun You, Yun Rui, Tony Q. S. Quek

Main category: cs.CV

TL;DR: LTED-Ada is a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection for video object recognition, optimized for both single-device and multi-device scenarios using federated learning.

Details

Motivation: Resource-constrained devices like traffic cameras struggle with fast and accurate video object recognition. While mobile edge computing enables offloading computation-intensive detection to edge servers, the challenge lies in deciding when to use edge detection versus local tracking.

Method: Formulated long-term optimization problems for single/multi-device scenarios, proposed LTED-Ada using deep reinforcement learning to adaptively select between local tracking and edge detection based on frame rate, accuracy, and delay requirements. Enhanced with federated learning for multi-device collaboration.

Result: Extensive hardware-in-the-loop experiments using Raspberry Pi 4B devices and PC edge server demonstrated LTED-Ada’s superiority in performance.

Conclusion: LTED-Ada provides an effective solution for adaptive video object recognition in mobile edge computing environments, balancing accuracy and efficiency through intelligent detection/tracking selection.

Abstract: Fast and accurate video object recognition, which relies on frame-by-frame video analytics, remains a challenge for resource-constrained devices such as traffic cameras. Recent advances in mobile edge computing have made it possible to offload computation-intensive object detection to edge servers equipped with high-accuracy neural networks, while lightweight and fast object tracking algorithms run locally on devices. This hybrid approach offers a promising solution but introduces a new challenge: deciding when to perform edge detection versus local tracking. To address this, we formulate two long-term optimization problems for both single-device and multi-device scenarios, taking into account the temporal correlation of consecutive frames and the dynamic conditions of mobile edge networks. Based on the formulation, we propose the LTED-Ada in single-device setting, a deep reinforcement learning-based algorithm that adaptively selects between local tracking and edge detection, according to the frame rate as well as recognition accuracy and delay requirement. In multi-device setting, we further enhance LTED-Ada using federated learning to enable collaborative policy training across devices, thereby improving its generalization to unseen frame rates and performance requirements. Finally, we conduct extensive hardware-in-the-loop experiments using multiple Raspberry Pi 4B devices and a personal computer as the edge server, demonstrating the superiority of LTED-Ada.

[82] DeeAD: Dynamic Early Exit of Vision-Language Action for Efficient Autonomous Driving

Haibo HU, Lianming Huang, Nan Guan, Chun Jason Xue

Main category: cs.CV

TL;DR: DeeAD is a training-free early-exit framework that accelerates Vision-Language Action models for autonomous driving by terminating inference when trajectories align with planning priors, achieving 29% latency reduction.

Details

Motivation: VLA models suffer from significant inference latency due to deep transformer stacks, which limits their practical deployment in real-time autonomous driving applications.

Method: Uses action-guided early-exit with physical feasibility evaluation, multi-hop controller for adaptive layer skipping, and integrates without retraining by checking trajectory alignment with lightweight planning priors.

Result: Achieves up to 28% transformer-layer sparsity and 29% latency reduction on Bench2Drive benchmark while preserving planning quality and safety.

Conclusion: DeeAD provides an effective training-free acceleration method for VLA planning models that maintains performance while significantly reducing computational overhead.

Abstract: Vision-Language Action (VLA) models unify perception, reasoning, and trajectory generation for autonomous driving, but suffer from significant inference latency due to deep transformer stacks. We present DeeAD, a training-free, action-guided early-exit framework that accelerates VLA planning by evaluating the physical feasibility of intermediate trajectories. Instead of relying on confidence scores, DeeAD terminates inference when predicted trajectories align with lightweight planning priors (e.g., Navigation or Low-precision Planning) within a tolerable deviation (<2m). To improve efficiency, we introduce a multi-hop controller that adaptively skips redundant layers based on the change rate of scores. DeeAD integrates into existing VLA models, such as ORION, without requiring retraining. Experiments on the Bench2Drive benchmark demonstrate up to 28% transformer-layer sparsity and 29% latency reduction, while preserving planning quality and safety.

[83] Foundry: Distilling 3D Foundation Models for the Edge

Guillaume Letellier, Siddharth Srivastava, Frédéric Jurie, Gaurav Sharma

Main category: cs.CV

TL;DR: FMD is a new paradigm for compressing large SSL foundation models into compact proxies that retain general-purpose representational power, with Foundry as the first implementation for 3D point clouds.

Details

Motivation: Large foundation models are too computationally expensive for edge devices, and existing compression methods sacrifice the crucial generality that makes foundation models valuable.

Method: Foundry trains a student to learn compressed SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space.

Result: A single distilled model maintains strong transferability across diverse downstream tasks (classification, part segmentation, few-shot scenarios), approaching full foundation-model performance while using significantly fewer tokens and FLOPs.

Conclusion: FMD enables practical deployment of foundation models on resource-constrained hardware by creating efficient yet general-purpose compressed models.

Abstract: Foundation models pre-trained with self-supervised learning (SSL) on large-scale datasets have become powerful general-purpose feature extractors. However, their immense size and computational cost make them prohibitive for deployment on edge devices such as robots and AR/VR headsets. Existing compression techniques like standard knowledge distillation create efficient ‘specialist’ models but sacrifice the crucial, downstream-agnostic generality that makes foundation models so valuable. In this paper, we introduce Foundation Model Distillation (FMD), a new paradigm for compressing large SSL models into compact, efficient, and faithful proxies that retain their general-purpose representational power. We present Foundry, the first implementation of FMD for 3D point clouds. Our approach, Foundry, trains a student to learn a compressed set of SuperTokens that reconstruct the teacher’s token-level representations, capturing a compact basis of its latent space. A single distilled model maintains strong transferability across diverse downstream tasks-classification, part segmentation, and few-shot scenarios-approaching full foundation-model performance while using significantly fewer tokens and FLOPs, making such models more practical for deployment on resourceconstrained hardware.

[84] DinoLizer: Learning from the Best for Generative Inpainting Localization

Minh Thong Doi, Jan Butora, Vincent Itier, Jérémie Boulanger, Patrick Bas

Main category: cs.CV

TL;DR: DinoLizer is a DINOv2-based model for detecting manipulated regions in generative inpainting, using patch-level classification and sliding-window inference to achieve state-of-the-art localization performance.

Details

Motivation: To develop an effective method for localizing manipulated regions in generative inpainting by leveraging the strong representational power of Vision Transformers, particularly addressing the challenge of detecting semantically altered regions while ignoring non-semantic edits.

Method: Uses DINOv2 pretrained on B-Free dataset for synthetic image detection, adds linear classification head on patch embeddings for manipulation prediction at 14×14 resolution, employs sliding-window strategy for larger images, and post-processes heatmaps to refine binary masks.

Result: Outperforms state-of-the-art local manipulation detectors across various inpainting datasets, achieving 12% higher IoU on average with even greater gains after post-processing. Remains robust to common operations like resizing, noise addition, and JPEG compression.

Conclusion: DINOv2-based Vision Transformers demonstrate strong representational power for manipulation localization, with DinoLizer showing superior performance over existing methods and confirming its effectiveness through extensive ablation studies comparing DINOv2 and DINOv3.

Abstract: We introduce DinoLizer, a DINOv2-based model for localizing manipulated regions in generative inpainting. Our method builds on a DINOv2 model pretrained to detect synthetic images on the B-Free dataset. We add a linear classification head on top of the Vision Transformer’s patch embeddings to predict manipulations at a $14\times 14$ patch resolution. The head is trained to focus on semantically altered regions, treating non-semantic edits as part of the original content. Because the ViT accepts only fixed-size inputs, we use a sliding-window strategy to aggregate predictions over larger images; the resulting heatmaps are post-processed to refine the estimated binary manipulation masks. Empirical results show that DinoLizer surpasses state-of-the-art local manipulation detectors on a range of inpainting datasets derived from different generative models. It remains robust to common post-processing operations such as resizing, noise addition, and JPEG (double) compression. On average, DinoLizer achieves a 12% higher Intersection-over-Union (IoU) than the next best model, with even greater gains after post-processing. Our experiments with off-the-shelf DINOv2 demonstrate the strong representational power of Vision Transformers for this task. Finally, extensive ablation studies comparing DINOv2 and its successor, DINOv3, in deepfake localization confirm DinoLizer’s superiority. The code will be publicly available upon acceptance of the paper.

[85] CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

Daeheon Jeong, Seoyeon Byun, Kihoon Son, Dae Hyun Kim, Juho Kim

Main category: cs.CV

TL;DR: CANVAS is a benchmark for evaluating vision language models’ ability to perform tool-based UI design tasks through design software like Figma or Sketch.

Details

Motivation: There is no existing benchmark to evaluate VLMs' capacity to operate design software and iteratively refine UI designs, which is important for understanding their potential to collaborate with designers.

Method: Created CANVAS benchmark with 598 tool-based design tasks sampled from 3.3K mobile UI designs across 30 categories, featuring two task types: design replication (reproducing whole UI screens) and design modification (modifying specific parts of existing screens).

Result: Leading models show more strategic tool invocations that improve design quality, and common error patterns were identified to guide future improvements.

Conclusion: CANVAS provides the first benchmark for evaluating VLMs’ tool-based UI design capabilities, revealing current performance levels and error patterns to inform future development of design collaboration tools.

Abstract: User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs’ potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

[86] MODEST: Multi-Optics Depth-of-Field Stereo Dataset

Nisarg K. Trivedi, Vinayak A. Belludi, Li-Yun Wang, Pardis Taghavi, Dante Lok

Main category: cs.CV

TL;DR: A high-resolution stereo DSLR dataset with 18,000 images captured across 50 optical configurations (10 focal lengths, 5 apertures) to address the lack of real-world optical data for depth estimation and related tasks.

Details

Motivation: Current depth estimation research is limited by the lack of large-scale, high-fidelity real stereo DSLR datasets, which restricts real-world generalization and evaluation of models trained on synthetic data.

Method: Captured 18,000 high-resolution (5472×3648px) stereo images across 9 scenes with varying complexity, using two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), totaling 50 optical configurations per scene.

Result: Created the first comprehensive stereo DSLR dataset enabling controlled analysis of geometric and optical effects for depth estimation, depth-of-field rendering, deblurring, 3D reconstruction, and novel view synthesis.

Conclusion: The dataset bridges the realism gap between synthetic training data and real camera optics, revealing challenges with current state-of-the-art methods, and is released to support reproducible research on real-world optical generalization.

Abstract: Reliable depth estimation under real optical conditions remains a core challenge for camera vision in systems such as autonomous robotics and augmented reality. Despite recent progress in depth estimation and depth-of-field rendering, research remains constrained by the lack of large-scale, high-fidelity, real stereo DSLR datasets, limiting real-world generalization and evaluation of models trained on synthetic data as shown extensively in literature. We present the first high-resolution (5472$\times$3648px) stereo DSLR dataset with 18000 images, systematically varying focal length and aperture across complex real scenes and capturing the optical realism and complexity of professional camera systems. For 9 scenes with varying scene complexity, lighting and background, images are captured with two identical camera assemblies at 10 focal lengths (28-70mm) and 5 apertures (f/2.8-f/22), spanning 50 optical configurations in 2000 images per scene. This full-range optics coverage enables controlled analysis of geometric and optical effects for monocular and stereo depth estimation, shallow depth-of-field rendering, deblurring, 3D scene reconstruction and novel view synthesis. Each focal configuration has a dedicated calibration image set, supporting evaluation of classical and learning based methods for intrinsic and extrinsic calibration. The dataset features challenging visual elements such as multi-scale optical illusions, reflective surfaces, mirrors, transparent glass walls, fine-grained details, and natural / artificial ambient light variations. This work attempts to bridge the realism gap between synthetic training data and real camera optics, and demonstrates challenges with the current state-of-the-art monocular, stereo depth and depth-of-field methods. We release the dataset, calibration files, and evaluation code to support reproducible research on real-world optical generalization.

[87] Text-Guided Semantic Image Encoder

Raghuveer Thirukovalluru, Xiaochuang Han, Bhuwan Dhingra, Emily Dinan, Maha Elbayad

Main category: cs.CV

TL;DR: TIE is a text-guided semantic image encoder that generates image representations conditioned on text queries, improving vision-language model performance and efficiency.

Details

Motivation: Standard image encoders in VLMs process images agnostically without considering downstream tasks or text queries, limiting their effectiveness.

Method: Proposed Text-Guided Semantic Image Encoder (TIE) that generates image representations conditioned on input text queries through text-conditioned training.

Result: TIE-based VLMs outperform conventional counterparts by +1.5 and +1.3 points on average across nine benchmarks, with up to 6-point gains on DocVQA and InfoVQA, while using only half the image tokens for improved efficiency.

Conclusion: TIE effectively optimizes encoders to capture key visual features, generalizes well with generic queries, and enhances interpretability through query-specific attention.

Abstract: Image encoders, a fundamental component of vision-language models (VLMs), are typically pretrained independently before being aligned with a language model. This standard paradigm results in encoders that process images agnostically, without regard to the specific downstream task or text query. To address this limitation, we propose the Text-Guided Semantic Image Encoder (TIE), which generates image representations conditioned on the input text query. VLMs equipped with TIE outperform their conventional counterparts by +1.5 and +1.3 points on average across nine image-to-text benchmarks at the 1B and 3B scales, respectively, with gains reaching up to 6 points on tasks such as DocVQA and InfoVQA. Moreover, TIE-based VLMs attain superior performance while utilizing only half as many image tiles (tokens), resulting in notably improved inference efficiency. TIE also generalizes well with generic queries, indicating that text-conditioned training effectively optimizes the encoder to capture key visual features. Qualitative analysis confirms that TIE consistently attends to query-relevant regions, enhancing both interpretability and query-specific grounding.

[88] One Patch is All You Need: Joint Surface Material Reconstruction and Classification from Minimal Visual Cues

Sindhuja Penchala, Gavin Money, Gabriel Marques, Samuel Wood, Jessica Kirschman, Travis Atkison, Shahram Rahimi, Noorbakhsh Amiri Golilarz

Main category: cs.CV

TL;DR: SMARC is a unified model for surface material reconstruction and classification from minimal visual input (single 10% image patch), achieving state-of-the-art performance in both tasks.

Details

Motivation: Existing methods require dense or full-scene observations, limiting effectiveness in constrained environments. Need for surface understanding from sparse visual cues in robotics, simulation, and material perception applications.

Method: Combines Partial Convolutional U-Net with classification head for spatial inpainting and semantic understanding. Processes only 10% contiguous image patch to reconstruct full RGB surface and classify material category.

Result: Achieves PSNR of 17.55 dB for reconstruction and 85.10% accuracy for material classification on Touch and Go dataset, outperforming five baseline models including ViT, MAE, Swin Transformer, and DETR.

Conclusion: Demonstrates advantages of partial convolution for spatial reasoning under missing data, establishing foundation for minimal-vision surface understanding in constrained environments.

Abstract: Understanding material surfaces from sparse visual cues is critical for applications in robotics, simulation, and material perception. However, most existing methods rely on dense or full-scene observations, limiting their effectiveness in constrained or partial view environment. To address this challenge, we introduce SMARC, a unified model for Surface MAterial Reconstruction and Classification from minimal visual input. By giving only a single 10% contiguous patch of the image, SMARC recognizes and reconstructs the full RGB surface while simultaneously classifying the material category. Our architecture combines a Partial Convolutional U-Net with a classification head, enabling both spatial inpainting and semantic understanding under extreme observation sparsity. We compared SMARC against five models including convolutional autoencoders [17], Vision Transformer (ViT) [13], Masked Autoencoder (MAE) [5], Swin Transformer [9], and DETR [2] using Touch and Go dataset [16] of real-world surface textures. SMARC achieves state-of-the-art results with a PSNR of 17.55 dB and a material classification accuracy of 85.10%. Our findings highlight the advantages of partial convolution in spatial reasoning under missing data and establish a strong foundation for minimal-vision surface understanding.

[89] LongVT: Incentivizing “Thinking with Long Videos” via Native Tool Calling

Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Chengwei Qin, Shijian Lu, Xingxuan Li, Lidong Bing

Main category: cs.CV

TL;DR: LongVT is an agentic framework that enables multimodal reasoning with long videos through interleaved tool use and chain-of-thought, addressing hallucinations in large multimodal models by implementing global-to-local video analysis.

Details

Motivation: Large multimodal models struggle with hallucinations when processing long-form videos where evidence is sparse and temporally dispersed, similar to how humans need to skim globally and examine relevant clips for details.

Method: Uses LMMs’ temporal grounding ability as a native video cropping tool to zoom in on specific clips and resample finer-grained frames, creating a global-to-local reasoning loop. Employs a three-stage training strategy with tool-integrated supervised fine-tuning and agentic reinforcement learning.

Result: Consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks, with comprehensive training on 247.9K+ samples and evaluation on 1,280 QA pairs.

Conclusion: LongVT effectively addresses hallucination issues in long video reasoning through agentic multimodal chain-of-tool-thought, providing a robust framework for evidence-grounded video understanding with publicly available code, data, and models.

Abstract: Large multimodal models (LMMs) have shown great potential for video reasoning with textual Chain-of-Thought. However, they remain vulnerable to hallucinations, especially when processing long-form videos where evidence is sparse and temporally dispersed. Inspired by how humans comprehend long videos - by first skimming globally and then examining relevant clips for details - we introduce LongVT, an end-to-end agentic framework that enables “Thinking with Long Videos” via interleaved Multimodal Chain-of-Tool-Thought. Specifically, we exploit LMMs’ inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames. This global-to-local reasoning loop continues until answers are grounded in retrieved visual evidence. Given the scarcity of fine-grained question-answering (QA) data for the long video reasoning task, we curate and will release a data suite named VideoSIAH to facilitate both training and evaluation. Specifically, our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning, respectively. Our evaluation benchmark consists of 1,280 QA pairs that are carefully curated through a semi-automatic data pipeline with human-in-the-loop validation. With a meticulously designed three-stage training strategy and extensive empirical validation, LongVT consistently outperforms existing strong baselines across four challenging long-video understanding and reasoning benchmarks. Our codes, data, and model checkpoints are publicly available at https://github.com/EvolvingLMMs-Lab/LongVT .

[90] Revisiting KRISP: A Lightweight Reproduction and Analysis of Knowledge-Enhanced Vision-Language Models

Souradeep Dutta, Keshav Bulia, Neena S Nair

Main category: cs.CV

TL;DR: Lightweight reproduction of KRISP model with significantly fewer parameters, achieving 75% of original performance while revealing design flaws and enabling edge device deployment.

Details

Motivation: To create a more efficient and accessible version of KRISP that addresses computational demands and industrial-scale requirements, making knowledge-enhanced VQA suitable for resource-constrained environments.

Method: Systematic replication with reduced parameters, ablation studies on synthetic VQA data, evaluation on DAQUAR dataset, and constrained knowledge graph domain to prevent AI hallucinations.

Result: Replicated model achieves about 75% of original KRISP performance while being significantly more lightweight, enabling deployment on edge devices like smartphones and AR-VR.

Conclusion: The study demonstrates that knowledge-enhanced VQA architectures can be effectively scaled down for resource-constrained environments while maintaining reasonable performance and preventing hallucinations through domain constraints.

Abstract: Facebook AI Research introduced KRISP [4], which integrates structured external knowledge into pipelines for vision-language reasoning. Despite its effectiveness, the original model has been developed for industrial-scale training, is computationally demanding, and is tightly connected to a large backbone. In this work, we reexamine KRISP from a different angle and offer a lightweight reproduction with significantly fewer parameters. Even though our replicated model performs about 75 % of the original, the replication process uncovers a number of design flaws, real-world pitfalls, and implicit problems that were not fully covered in the original paper. We offer insights into the scalability and efficacy of knowledge-enhanced VQA architectures under resource constraints through systematic ablation studies, which include a proof-of-concept on synthetic VQA data and evaluation on the DAQUAR dataset. Our model, configured with a low parameter setup and constrained by the external Knowledge graph domain, prevents AI hallucinations and generates outputs solely within that domain. Minimal parameters allow us to function on edge devices like smartphones and AR-VR, further improving offline visual reasoning.

[91] Intriguing Properties of Dynamic Sampling Networks

Dario Morle, Reid Zaffino

Main category: cs.CV

TL;DR: This paper develops a unified theoretical framework called ‘warping’ that generalizes dynamic sampling mechanisms in deep learning, analyzes their statistical properties, and reveals unique training asymmetries between forward and backward passes.

Details

Motivation: To unify the theoretical analysis of various dynamic sampling methods in computer vision models, as existing methods like deformable convolutions and spatial transformer networks lack a common analytical framework.

Method: Developed a novel ‘warping’ operator that generalizes existing dynamic sampling methods, providing statistical analysis of inputs as IID variables and homogeneous random fields, and introducing gradient-based loss landscape visualization.

Result: Discovered unique asymmetry between forward and backward passes in training, identified warping as an orthogonal class of operators to traditional convolutions, and established conditions for stable training of dynamic sampling networks.

Conclusion: The warping framework successfully unifies analysis of dynamic sampling mechanisms, reveals fundamental differences from traditional operators, and provides theoretical foundations for understanding and stabilizing these architectures.

Abstract: Dynamic sampling mechanisms in deep learning architectures have demonstrated utility across many computer vision models, though the theoretical analysis of these structures has not yet been unified. In this paper we connect the various dynamic sampling methods by developing and analyzing a novel operator which generalizes existing methods, which we term “warping”. Warping provides a minimal implementation of dynamic sampling which is amenable to analysis, and can be used to reconstruct existing architectures including deformable convolutions, active convolutional units, and spatial transformer networks. Using our formalism, we provide statistical analysis of the operator by modeling the inputs as both IID variables and homogeneous random fields. Extending this analysis, we discover a unique asymmetry between the forward and backward pass of the model training. We demonstrate that these mechanisms represent an entirely different class of orthogonal operators to the traditional translationally invariant operators defined by convolutions. With a combination of theoretical analysis and empirical investigation, we find the conditions necessary to ensure stable training of dynamic sampling networks. In addition, statistical analysis of discretization effects are studied. Finally, we introduce a novel loss landscape visualization which utilizes gradient update information directly, to better understand learning behavior.

[92] One-Step Diffusion-Based Image Compression with Semantic Distillation

Naifu Xue, Zhaoyang Jia, Jiahao Li, Bin Li, Yuan Zhang, Yan Lu

Main category: cs.CV

TL;DR: OneDC is a one-step diffusion-based generative image codec that eliminates iterative sampling latency while achieving state-of-the-art perceptual quality through semantic guidance from hyperpriors and hybrid optimization.

Details

Motivation: To address the unpleasing latency introduced by iterative sampling in diffusion-based generative image codecs, while maintaining high compression performance.

Method: Integrates latent compression with one-step diffusion generation, uses hyperprior as semantic guidance instead of text prompts, employs semantic distillation from pretrained generative tokenizer, and applies hybrid pixel- and latent-domain optimization.

Result: Achieves SOTA perceptual quality with one-step generation, offering over 39% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs.

Conclusion: Multi-step sampling is not necessary for generative compression, and one-step diffusion codecs can achieve superior performance with proper semantic guidance and optimization techniques.

Abstract: While recent diffusion-based generative image codecs have shown impressive performance, their iterative sampling process introduces unpleasing latency. In this work, we revisit the design of a diffusion-based codec and argue that multi-step sampling is not necessary for generative compression. Based on this insight, we propose OneDC, a One-step Diffusion-based generative image Codec – that integrates a latent compression module with a one-step diffusion generator. Recognizing the critical role of semantic guidance in one-step diffusion, we propose using the hyperprior as a semantic signal, overcoming the limitations of text prompts in representing complex visual content. To further enhance the semantic capability of the hyperprior, we introduce a semantic distillation mechanism that transfers knowledge from a pretrained generative tokenizer to the hyperprior codec. Additionally, we adopt a hybrid pixel- and latent-domain optimization to jointly enhance both reconstruction fidelity and perceptual realism. Extensive experiments demonstrate that OneDC achieves SOTA perceptual quality even with one-step generation, offering over 39% bitrate reduction and 20x faster decoding compared to prior multi-step diffusion-based codecs. Project: https://onedc-codec.github.io/

Kriti Ghosh, Devjyoti Chakraborty, Lakshmish Ramaswamy, Suchendra M. Bhandarkar, In Kee Kim, Nancy O’Hare, Deepak Mishra

Main category: cs.CV

TL;DR: Δ-NeRF is a modular residual framework for incremental refinement of Neural Radiance Fields that enables efficient updates without catastrophic forgetting, achieving comparable performance to joint training while reducing training time by 30-42%.

Details

Motivation: Existing NeRF frameworks require complete retraining when new views are added incrementally, which is problematic for sequential data scenarios like satellite terrain analysis where regions are repeatedly observed over time.

Method: Proposes a residual controller that injects per-layer corrections into a frozen base NeRF, uncertainty-aware gating to prevent overcorrection, view selection to reduce training data by 47%, and knowledge distillation to compress the model to 20% of original size.

Result: Achieves performance comparable to joint training while reducing training time by 30-42%, outperforms baselines with up to 43.5% PSNR improvement over naive fine-tuning, and surpasses joint training on some metrics.

Conclusion: Δ-NeRF provides an effective solution for incremental NeRF refinement that avoids catastrophic forgetting and enables efficient updates without access to past data, making it suitable for sequential data applications like satellite imagery analysis.

Abstract: Neural Radiance Fields (NeRFs) have demonstrated remarkable capabilities in 3D reconstruction and novel view synthesis. However, most existing NeRF frameworks require complete retraining when new views are introduced incrementally, limiting their applicability in domains where data arrives sequentially. This limitation is particularly problematic in satellite-based terrain analysis, where regions are repeatedly observed over time. Incremental refinement of NeRFs remains underexplored, and naive approaches suffer from catastrophic forgetting when past data is unavailable. We propose $Δ$-NeRF, a unique modular residual framework for incremental NeRF refinement. $Δ$-NeRF introduces several novel techniques including: (1) a residual controller that injects per-layer corrections into a frozen base NeRF, enabling refinement without access to past data; (2) an uncertainty-aware gating mechanism that prevents overcorrection by adaptively combining base and refined predictions; and (3) a view selection strategy that reduces training data by up to 47% while maintaining performance. Additionally, we employ knowledge distillation to compress the enhanced model into a compact student network (20% of original size). Experiments on satellite imagery demonstrate that $Δ$-NeRF achieves performance comparable to joint training while reducing training time by 30-42%. $Δ$-NeRF consistently outperforms existing baselines, achieving an improvement of up to 43.5% in PSNR over naive fine-tuning and surpassing joint training on some metrics.

[94] GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction

Eya Cherif, Arthur Ouaknine, Luke A. Brown, Phuong D. Dao, Kyle R. Kovach, Bing Lu, Daniel Mederer, Hannes Feilhauer, Teja Kattenborn, David Rolnick

Main category: cs.CV

TL;DR: GreenHyperSpectra is a pretraining dataset for plant trait prediction using hyperspectral data, addressing label scarcity and domain shifts across sensors and ecosystems through semi- and self-supervised learning methods.

Details

Motivation: Plant traits are crucial for biodiversity and climate studies, but field sampling cannot cover meaningful spatial scales. Machine learning with hyperspectral data offers a solution, but faces challenges with label scarcity and domain shifts across sensors and ecosystems.

Method: Created GreenHyperSpectra dataset with cross-sensor and cross-ecosystem samples, used pretraining with semi- and self-supervised methods, and evaluated models in both in-distribution and out-of-distribution scenarios.

Result: Pretrained label-efficient multi-output regression models outperformed state-of-the-art supervised baselines, showing substantial improvements in learning spectral representations for trait prediction.

Conclusion: Established a comprehensive methodological framework that advances research at the intersection of representation learning and plant functional traits assessment, with all code and data made publicly available.

Abstract: Plant traits such as leaf carbon content and leaf mass are essential variables in the study of biodiversity and climate change. However, conventional field sampling cannot feasibly cover trait variation at ecologically meaningful spatial scales. Machine learning represents a valuable solution for plant trait prediction across ecosystems, leveraging hyperspectral data from remote sensing. Nevertheless, trait prediction from hyperspectral data is challenged by label scarcity and substantial domain shifts (\eg across sensors, ecological distributions), requiring robust cross-domain methods. Here, we present GreenHyperSpectra, a pretraining dataset encompassing real-world cross-sensor and cross-ecosystem samples designed to benchmark trait prediction with semi- and self-supervised methods. We adopt an evaluation framework encompassing in-distribution and out-of-distribution scenarios. We successfully leverage GreenHyperSpectra to pretrain label-efficient multi-output regression models that outperform the state-of-the-art supervised baseline. Our empirical analyses demonstrate substantial improvements in learning spectral representations for trait prediction, establishing a comprehensive methodological framework to catalyze research at the intersection of representation learning and plant functional traits assessment. All code and data are available at: https://github.com/echerif18/HyspectraSSL.

[95] Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran

Main category: cs.CV

TL;DR: Split-then-Merge (StM) is a novel framework that enhances control in generative video composition by splitting unlabeled videos into foreground/background layers and self-composing them to learn compositional dynamics, addressing data scarcity without annotated datasets.

Details

Motivation: To address data scarcity in generative video composition and enhance control over dynamic subject-scene interactions without relying on annotated datasets or handcrafted rules.

Method: Splits unlabeled videos into dynamic foreground and background layers, then self-composes them using transformation-aware training with multi-layer fusion, augmentation for affordance-aware composition, and identity-preservation loss.

Result: Outperforms state-of-the-art methods in both quantitative benchmarks and human/VLLM-based qualitative evaluations.

Conclusion: StM effectively learns complex compositional dynamics for realistic video generation through self-supervised learning on unlabeled video data.

Abstract: We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

[96] SPHINX: A Synthetic Environment for Visual Perception and Reasoning

Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi

Main category: cs.CV

TL;DR: Sphinx is a synthetic environment for visual perception and reasoning that generates puzzles with verifiable solutions, covering 25 cognitive tasks. Current LVLMs perform poorly (51.1% accuracy), but RLVR training significantly improves performance.

Details

Motivation: To create a precise evaluation framework for visual reasoning that targets core cognitive primitives and enables large-scale dataset construction with verifiable ground-truth solutions.

Method: Procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable solutions. Evaluates models on 25 task types and applies reinforcement learning with verifiable rewards (RLVR).

Result: State-of-the-art GPT-5 achieves only 51.1% accuracy, well below human performance. RLVR training substantially improves model accuracy on Sphinx tasks and yields gains on external visual reasoning benchmarks.

Conclusion: Sphinx provides a rigorous benchmark for visual reasoning, revealing significant gaps in current LVLMs. RLVR shows promise for advancing multimodal reasoning capabilities.

Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.

[97] Training-Free Diffusion Priors for Text-to-Image Generation via Optimization-based Visual Inversion

Samuele Dell’Erba, Andrew D. Bagdanov

Main category: cs.CV

TL;DR: Replaces expensive trained diffusion priors with Optimization-based Visual Inversion (OVI) - a training-free, data-free method that optimizes latent representations to match text embeddings, achieving comparable results with better visual fidelity.

Details

Motivation: Current diffusion models rely on computationally expensive prior networks that require massive training datasets. This work challenges whether such trained priors are necessary at all.

Method: Uses OVI to initialize random pseudo-tokens and iteratively optimize them to maximize cosine similarity with text embeddings. Introduces two constraints: Mahalanobis-based and Nearest-Neighbor losses to regularize optimization toward realistic image distributions.

Result: OVI serves as viable alternative to traditional priors. Reveals critical flaw in current benchmarks where text embedding alone scores high despite poor quality. Constrained OVI improves visual fidelity, with Nearest-Neighbor approach achieving scores comparable to state-of-the-art data-efficient priors.

Conclusion: Training-free OVI methods can effectively replace expensive diffusion priors, suggesting the idea merits further investigation and highlighting issues with current evaluation benchmarks.

Abstract: Diffusion models have established the state-of-the-art in text-to-image generation, but their performance often relies on a diffusion prior network to translate text embeddings into the visual manifold for easier decoding. These priors are computationally expensive and require extensive training on massive datasets. In this work, we challenge the necessity of a trained prior at all by employing Optimization-based Visual Inversion (OVI), a training-free and data-free alternative, to replace the need for a prior. OVI initializes a latent visual representation from random pseudo-tokens and iteratively optimizes it to maximize the cosine similarity with input textual prompt embedding. We further propose two novel constraints, a Mahalanobis-based and a Nearest-Neighbor loss, to regularize the OVI optimization process toward the distribution of realistic images. Our experiments, conducted on Kandinsky 2.2, show that OVI can serve as an alternative to traditional priors. More importantly, our analysis reveals a critical flaw in current evaluation benchmarks like T2I-CompBench++, where simply using the text embedding as a prior achieves surprisingly high scores, despite lower perceptual quality. Our constrained OVI methods improve visual fidelity over this baseline, with the Nearest-Neighbor approach proving particularly effective, achieving quantitative scores comparable to or higher than the state-of-the-art data-efficient prior, indicating that the idea merits further investigation. The code will be publicly available upon acceptance.

Roman Naeem, David Hagerman, Jennifer Alvén, Fredrik Kahl

Main category: cs.CV

TL;DR: RefTr is a 3D image-to-graph model for vascular tree centerline detection using recurrent refinement of confluent trajectories with Producer-Refiner Transformer architecture, achieving high recall and efficiency.

Details

Motivation: Accurate centerline detection with correct tree topology is critical for clinical applications like diagnosis and surgical navigation, where missing small branches can lead to fatal errors due to incomplete assessments.

Method: Uses Producer-Refiner Transformer architecture where Producer proposes initial confluent trajectories and Refiner recurrently refines them. Introduces efficient non-maximum suppression for spatial tree graphs to merge duplicate branches.

Result: Achieves superior recall and comparable precision to previous SOTA across multiple datasets, with 2.4x reduction in decoder parameters and faster inference.

Conclusion: RefTr demonstrates potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging, offering improved efficiency and performance.

Abstract: Tubular trees, such as blood vessels and lung airways, are essential for material transport within the human body. Accurately detecting their centerlines with correct tree topology is critical for clinical tasks such as diagnosis, treatment planning, and surgical navigation. In these applications, maintaining high recall is crucial, as missing small branches can result in fatal mistakes caused by incomplete assessments or undetected abnormalities. We present RefTr, a 3D image-to-graph model for centerline generation of vascular trees via recurrent refinement of confluent trajectories. RefTr uses a Producer-Refiner architecture based on a Transformer decoder, where the Producer proposes a set of initial confluent trajectories that are recurrently refined by the Refiner to produce final trajectories, which forms the centerline graph. The confluent trajectory representation enables refinement of complete trajectories while explicitly enforcing a valid tree topology. The recurrent refinement scheme improves precision and reuses the same Refiner block across multiple steps, yielding a 2.4x reduction in decoder parameters compared to previous SOTA. We also introduce an efficient non-maximum suppression algorithm for spatial tree graphs to merge duplicate branches and boost precision. Across multiple public centerline datasets, RefTr achieves superior recall and comparable precision to previous SOTA, while offering faster inference and substantially fewer parameters, demonstrating its potential as a new state-of-the-art framework for vascular tree analysis in 3D medical imaging.

[99] Unsupervised Memorability Modeling from Tip-of-the-Tongue Retrieval Queries

Sree Bhattacharyya, Yaman Kumar Singla, Sudhir Yarram, Somesh Kumar Singh, Harini S, James Z. Wang

Main category: cs.CV

TL;DR: This paper introduces the first large-scale unsupervised dataset for visual memorability using tip-of-the-tongue queries from Reddit, containing 82,000+ videos with recall descriptions, enabling better recall generation and ToT retrieval than state-of-the-art models.

Details

Motivation: Existing visual memorability datasets are limited by expensive human annotations, lack diversity/scalability, and only capture aggregate scores rather than nuanced recall signals from natural descriptions.

Method: Leverage tip-of-the-tongue retrieval queries from online platforms like Reddit to create an unsupervised dataset, then fine-tune large vision-language models and use contrastive training for multimodal ToT retrieval.

Result: Models fine-tuned on the dataset outperform GPT-4o in generating open-ended memorability descriptions and create the first model capable of multimodal ToT retrieval.

Conclusion: The unsupervised dataset and models provide a novel direction for visual content memorability research, offering rich signals for recall generation and ToT retrieval tasks.

Abstract: Visual content memorability has intrigued the scientific community for decades, with applications ranging widely, from understanding nuanced aspects of human memory to enhancing content design. A significant challenge in progressing the field lies in the expensive process of collecting memorability annotations from humans. This limits the diversity and scalability of datasets for modeling visual content memorability. Most existing datasets are limited to collecting aggregate memorability scores for visual content, not capturing the nuanced memorability signals present in natural, open-ended recall descriptions. In this work, we introduce the first large-scale unsupervised dataset designed explicitly for modeling visual memorability signals, containing over 82,000 videos, accompanied by descriptive recall data. We leverage tip-of-the-tongue (ToT) retrieval queries from online platforms such as Reddit. We demonstrate that our unsupervised dataset provides rich signals for two memorability-related tasks: recall generation and ToT retrieval. Large vision-language models fine-tuned on our dataset outperform state-of-the-art models such as GPT-4o in generating open-ended memorability descriptions for visual content. We also employ a contrastive training strategy to create the first model capable of performing multimodal ToT retrieval. Our dataset and models present a novel direction, facilitating progress in visual content memorability research.

[100] Estimating Fog Parameters from a Sequence of Stereo Images

Yining Ding, João F. C. Mota, Andrew M. Wallace, Sen Wang

Main category: cs.CV

TL;DR: Proposes simultaneous fog parameter estimation method for stereo images, handles locally homogeneous fog, integrates with SLAM systems, and introduces SDIRF dataset with real foggy road scenes.

Details

Motivation: Previous fog parameter estimation methods suffer from error propagation due to sequential estimation. Real-world fog is often globally inhomogeneous, requiring more robust approaches.

Method: Simultaneous estimation of all fog parameters through novel optimization, assuming locally homogeneous fog. Creates SDIRF dataset with calibrated photometric parameters and clear weather counterparts.

Result: Superior performance on both synthetic and real foggy data from SDIRF, producing most accurate estimates and better adaptation to real fog conditions.

Conclusion: The method effectively handles real-world fog, can integrate with existing SLAM systems, and the SDIRF dataset advances fog perception research.

Abstract: We propose a method which, given a sequence of stereo foggy images, estimates the parameters of a fog model and updates them dynamically. In contrast with previous approaches, which estimate the parameters sequentially and thus are prone to error propagation, our algorithm estimates all the parameters simultaneously by solving a novel optimisation problem. By assuming that fog is only locally homogeneous, our method effectively handles real-world fog, which is often globally inhomogeneous. The proposed algorithm can be easily used as an add-on module in existing visual Simultaneous Localisation and Mapping (SLAM) or odometry systems in the presence of fog. In order to assess our method, we also created a new dataset, the Stereo Driving In Real Fog (SDIRF), consisting of high-quality, consecutive stereo frames of real, foggy road scenes under a variety of visibility conditions, totalling over 40 minutes and 34k frames. As a first-of-its-kind, SDIRF contains the camera’s photometric parameters calibrated in a lab environment, which is a prerequisite for correctly applying the atmospheric scattering model to foggy images. The dataset also includes the counterpart clear data of the same routes recorded in overcast weather, which is useful for companion work in image defogging and depth reconstruction. We conducted extensive experiments using both synthetic foggy data and real foggy sequences from SDIRF to demonstrate the superiority of the proposed algorithm over prior methods. Our method not only produces the most accurate estimates on synthetic data, but also adapts better to real fog. We make our code and SDIRF publicly available\footnote{https://github.com/SenseRoboticsLab/estimating-fog-parameters} to the community with the aim of advancing the research on visual perception in fog.

[101] V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Jiancheng Pan, Runze Wang, Tianwen Qian, Mohammad Mahdi, Yanwei Fu, Xiangyang Xue, Xiaomeng Huang, Luc Van Gool, Danda Pani Paudel, Yuqian Fu

Main category: cs.CV

TL;DR: V^2-SAM adapts SAM2 for cross-view object correspondence using two complementary prompt generators and a multi-expert selection mechanism, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: Existing segmentation models like SAM2 struggle with cross-view object correspondence due to drastic viewpoint and appearance variations between ego-centric and exo-centric views.

Method: Proposes V^2-SAM with Cross-View Anchor Prompt Generator (geometry-aware) and Cross-View Visual Prompt Generator (appearance-guided), plus Post-hoc Cyclic Consistency Selector for adaptive expert selection.

Result: Achieves state-of-the-art performance on Ego-Exo4D, DAVIS-2017, and HANDAL-X benchmarks for cross-view object correspondence tasks.

Conclusion: V^2-SAM successfully bridges the gap between single-view segmentation and cross-view correspondence through complementary prompt generation and adaptive selection mechanisms.

Abstract: Cross-view object correspondence, exemplified by the representative task of ego-exo object correspondence, aims to establish consistent associations of the same object across different viewpoints (e.g., ego-centric and exo-centric). This task poses significant challenges due to drastic viewpoint and appearance variations, making existing segmentation models, such as SAM2, non-trivial to apply directly. To address this, we present V^2-SAM, a unified cross-view object correspondence framework that adapts SAM2 from single-view segmentation to cross-view correspondence through two complementary prompt generators. Specifically, the Cross-View Anchor Prompt Generator (V^2-Anchor), built upon DINOv3 features, establishes geometry-aware correspondences and, for the first time, unlocks coordinate-based prompting for SAM2 in cross-view scenarios, while the Cross-View Visual Prompt Generator (V^2-Visual) enhances appearance-guided cues via a novel visual prompt matcher that aligns ego-exo representations from both feature and structural perspectives. To effectively exploit the strengths of both prompts, we further adopt a multi-expert design and introduce a Post-hoc Cyclic Consistency Selector (PCCS) that adaptively selects the most reliable expert based on cyclic consistency. Extensive experiments validate the effectiveness of V^2-SAM, achieving new state-of-the-art performance on Ego-Exo4D (ego-exo object correspondence), DAVIS-2017 (video object tracking), and HANDAL-X (robotic-ready cross-view correspondence).

[102] Test-Time Alignment of Text-to-Image Diffusion Models via Null-Text Embedding Optimisation

Taehoon Kim, Henry Gouk, Timothy Hospedales

Main category: cs.CV

TL;DR: Null-TTA aligns diffusion models by optimizing the unconditional embedding in classifier-free guidance, preventing reward hacking while achieving state-of-the-art test-time alignment and cross-reward generalization.

Details

Motivation: Existing test-time alignment methods tend to either under-optimize or over-optimize (reward hack) the target reward function, exploiting non-semantic noise patterns rather than achieving meaningful semantic alignment.

Method: Optimize the unconditional embedding in classifier-free guidance rather than manipulating latent or noise variables, leveraging the structured semantic nature of the text embedding space to ensure alignment occurs on a semantically coherent manifold.

Result: Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalization, directly steering the model’s generative distribution towards the target reward without updating model parameters.

Conclusion: Semantic-space optimization through unconditional embedding manipulation establishes an effective and principled paradigm for test-time alignment that prevents reward hacking while maintaining model performance.

Abstract: Test-time alignment (TTA) aims to adapt models to specific rewards during inference. However, existing methods tend to either under-optimise or over-optimise (reward hack) the target reward function. We propose Null-Text Test-Time Alignment (Null-TTA), which aligns diffusion models by optimising the unconditional embedding in classifier-free guidance, rather than manipulating latent or noise variables. Due to the structured semantic nature of the text embedding space, this ensures alignment occurs on a semantically coherent manifold and prevents reward hacking (exploiting non-semantic noise patterns to improve the reward). Since the unconditional embedding in classifier-free guidance serves as the anchor for the model’s generative distribution, Null-TTA directly steers model’s generative distribution towards the target reward rather than just adjusting the samples, even without updating model parameters. Thanks to these desirable properties, we show that Null-TTA achieves state-of-the-art target test-time alignment while maintaining strong cross-reward generalisation. This establishes semantic-space optimisation as an effective and principled novel paradigm for TTA.

[103] GaINeR: Geometry-Aware Implicit Network Representation

Weronika Jakubowska, Mikołaj Zieliński, Rafał Tobiasz, Krzysztof Byrski, Maciej Zięba, Dominik Belter, Przemysław Spurek

Main category: cs.CV

TL;DR: GaINeR is a novel geometry-aware implicit neural representation for 2D images that combines trainable Gaussian distributions with neural networks to enable continuous representation, interpretable structure, and local editing capabilities.

Details

Motivation: Traditional implicit neural representations (INRs) lack explicit geometric structure and have limited local editing capabilities, restricting their use in dynamic or interactive settings.

Method: Combines trainable Gaussian distributions with neural network-based INR. For each image coordinate, retrieves K nearest Gaussians, aggregates distance-weighted embeddings, and predicts RGB values via neural network.

Result: Enables continuous image representation with interpretable geometric structure and flexible local editing capabilities.

Conclusion: GaINeR provides a foundation for physically aware and interactive image manipulation by addressing limitations of traditional INRs through geometry-aware design.

Abstract: Implicit Neural Representations (INRs) have become an essential tool for modeling continuous 2D images, enabling high-fidelity reconstruction, super-resolution, and compression. Popular architectures such as SIREN, WIRE, and FINER demonstrate the potential of INR for capturing fine-grained image details. However, traditional INRs often lack explicit geometric structure and have limited capabilities for local editing or integration with physical simulation, restricting their applicability in dynamic or interactive settings. To address these limitations, we propose GaINeR: Geometry-Aware Implicit Network Representation, a novel framework for 2D images that combines trainable Gaussian distributions with a neural network-based INR. For a given image coordinate, the model retrieves the K nearest Gaussians, aggregates distance-weighted embeddings, and predicts the RGB value via a neural network. This design enables continuous image representation, interpretable geometric structure, and flexible local editing, providing a foundation for physically aware and interactive image manipulation. The official implementation of our method is publicly available at https://github.com/WJakubowska/GaINeR.

[104] A deep learning model to reduce agent dose for contrast-enhanced MRI of the cerebellopontine angle cistern

Yunjie Chen, Rianne A. Weber, Olaf M. Neve, Stephan R. Romeijn, Erik F. Hensen, Jelmer M. Wolterink, Qian Tao, Marius Staring, Berit M. Verbist

Main category: cs.CV

TL;DR: Deep learning model successfully restores standard-dose MRI quality from low-dose (10-30%) contrast-enhanced T1-weighted images of cerebellopontine angle cistern, enabling accurate vestibular schwannoma segmentation and diagnosis with significantly reduced contrast agent.

Details

Motivation: To reduce contrast agent dose in MRI examinations while maintaining diagnostic image quality, particularly for cerebellopontine angle cistern imaging in vestibular schwannoma patients.

Method: Multi-center retrospective study using T1 and contrast-enhanced T1-weighted MRI to simulate low-dose images. Deep learning models were trained to restore standard-dose images from low-dose simulations, with evaluation of image quality metrics and radiologist assessment.

Result: DL restoration significantly improved image quality metrics (SSIM from 0.639 to 0.993, PSNR from 21.6 to 41.4 dB) and segmentation performance (Dice from 0.673 to 0.734). Radiologists rated DL-restored images from 10-30% input doses as excellent and diagnostically informative.

Conclusion: Deep learning enables high-quality cerebellopontine angle MRI with only 10-30% of standard contrast agent dose, maintaining lesion detection and diagnostic characterization capabilities.

Abstract: Objectives: To evaluate a deep learning (DL) model for reducing the agent dose of contrast-enhanced T1-weighted MRI (T1ce) of the cerebellopontine angle (CPA) cistern. Materials and methods: In this multi-center retrospective study, T1 and T1ce of vestibular schwannoma (VS) patients were used to simulate low-dose T1ce with varying reductions of contrast agent dose. DL models were trained to restore standard-dose T1ce from the low-dose simulation. The image quality and segmentation performance of the DL-restored T1ce were evaluated. A head and neck radiologist was asked to rate DL-restored images in multiple aspects, including image quality and diagnostic characterization. Results: 203 MRI studies from 72 VS patients (mean age, 58.51 \pm 14.73, 39 men) were evaluated. As the input dose increased, the structural similarity index measure of the restored T1ce increased from 0.639 \pm 0.113 to 0.993 \pm 0.009, and the peak signal-to-noise ratio increased from 21.6 \pm 3.73 dB to 41.4 \pm 4.84 dB. At 10% input dose, using DL-restored T1ce for segmentation improved the Dice from 0.673 to 0.734, the 95% Hausdorff distance from 2.38 mm to 2.07 mm, and the average surface distance from 1.00 mm to 0.59 mm. Both DL-restored T1ce from 10% and 30% input doses showed excellent images, with the latter being considered more informative. Conclusion: The DL model improved the image quality of low-dose MRI of the CPA cistern, which makes lesion detection and diagnostic characterization possible with 10% - 30% of the standard dose.

[105] Smooth regularization for efficient video recognition

Gil Goldman, Raja Giryes, Mahadev Satyanarayanan

Main category: cs.CV

TL;DR: A smooth regularization technique using Gaussian Random Walk to enforce temporal coherence in video recognition models, improving accuracy of lightweight architectures by 3.8-6.4% on Kinetics-600.

Details

Motivation: To instill strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures by better capturing natural temporal coherence in videos.

Method: Encourages smoothness in intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW), penalizing abrupt representational shifts and promoting low-acceleration solutions.

Result: Lightweight models achieve 3.8% to 6.4% accuracy improvement on Kinetics-600. MoViNets improve state-of-the-art by 3.8-6.1% within FLOP constraints, while MobileNetV3 and MoViNets-Stream gain 4.9-6.4% over prior models with comparable memory footprints.

Conclusion: The proposed smooth regularization effectively enhances temporal modeling in lightweight video recognition models, achieving significant accuracy improvements while maintaining computational efficiency.

Abstract: We propose a smooth regularization technique that instills a strong temporal inductive bias in video recognition models, particularly benefiting lightweight architectures. Our method encourages smoothness in the intermediate-layer embeddings of consecutive frames by modeling their changes as a Gaussian Random Walk (GRW). This penalizes abrupt representational shifts, thereby promoting low-acceleration solutions that better align with the natural temporal coherence inherent in videos. By leveraging this enforced smoothness, lightweight models can more effectively capture complex temporal dynamics. Applied to such models, our technique yields a 3.8% to 6.4% accuracy improvement on Kinetics-600. Notably, the MoViNets model family trained with our smooth regularization improves the current state of the art by 3.8% to 6.1% within their respective FLOP constraints, while MobileNetV3 and the MoViNets-Stream family achieve gains of 4.9% to 6.4% over prior state-of-the-art models with comparable memory footprints. Our code and models are available at https://github.com/gilgoldm/grw-smoothing.

[106] Open Vocabulary Compositional Explanations for Neuron Alignment

Biagio La Rosa, Leilani H. Gilpin

Main category: cs.CV

TL;DR: A framework for open vocabulary compositional explanations in vision that uses semantic segmentation masks instead of human-annotated data to probe neurons for arbitrary concepts.

Details

Motivation: Current compositional explanations rely on human-annotated datasets, limiting their applicability to specific domains and predefined concepts. This paper aims to overcome this limitation.

Method: Three-step framework: 1) specify arbitrary concepts, 2) generate semantic segmentation masks using open vocabulary models, 3) derive compositional explanations from these masks.

Result: The framework enables probing neurons for arbitrary concepts and datasets, provides more flexible explanations, and shows differences when shifting from human-annotated to model-annotated data.

Conclusion: The proposed open vocabulary framework overcomes limitations of human-annotated data, enabling more flexible and broader application of compositional explanations in vision tasks.

Abstract: Neurons are the fundamental building blocks of deep neural networks, and their interconnections allow AI to achieve unprecedented results. Motivated by the goal of understanding how neurons encode information, compositional explanations leverage logical relationships between concepts to express the spatial alignment between neuron activations and human knowledge. However, these explanations rely on human-annotated datasets, restricting their applicability to specific domains and predefined concepts. This paper addresses this limitation by introducing a framework for the vision domain that allows users to probe neurons for arbitrary concepts and datasets. Specifically, the framework leverages masks generated by open vocabulary semantic segmentation to compute open vocabulary compositional explanations. The proposed framework consists of three steps: specifying arbitrary concepts, generating semantic segmentation masks using open vocabulary models, and deriving compositional explanations from these masks. The paper compares the proposed framework with previous methods for computing compositional explanations both in terms of quantitative metrics and human interpretability, analyzes the differences in explanations when shifting from human-annotated data to model-annotated data, and showcases the additional capabilities provided by the framework in terms of flexibility of the explanations with respect to the tasks and properties of interest.

[107] UruDendro4: A Benchmark Dataset for Automatic Tree-Ring Detection in Cross-Section Images of Pinus taeda L

Henry Marichal, Joaquin Blanco, Diego Passarella, Gregory Randall

Main category: cs.CV

TL;DR: The paper introduces UruDendro4, a dataset of 102 Pinus taeda L. cross-section images with manual ring annotations, collected at multiple stem heights for volumetric growth modeling. It provides baseline performance using DeepCS-TRD method and shows improved generalization when included in training.

Details

Motivation: Manual tree-ring measurement is time-consuming and imprecise, and there's scarcity of wood cross-section data for automated ring detection algorithms.

Method: Created UruDendro4 dataset with 102 manually annotated Pinus taeda L. samples from multiple stem heights, and evaluated state-of-the-art ring detection methods including DeepCS-TRD with ablation experiments.

Result: DeepCS-TRD achieved best performance with 0.838 mAP, 0.782 mAR, and 0.084 ARE. Training with this dataset improved model generalization for tree-ring detection.

Conclusion: UruDendro4 enables volumetric wood growth modeling and provides a valuable resource for developing automated tree-ring detection algorithms with improved generalization capabilities.

Abstract: Tree-ring growth represents the annual wood increment for a tree, and quantifying it allows researchers to assess which silvicultural practices are best suited for each species. Manual measurement of this growth is time-consuming and often imprecise, as it is typically performed along 4 to 8 radial directions on a cross-sectional disc. In recent years, automated algorithms and datasets have emerged to enhance accuracy and automate the delineation of annual rings in cross-sectional images. To address the scarcity of wood cross-section data, we introduce the UruDendro4 dataset, a collection of 102 image samples of Pinus taeda L., each manually annotated with annual growth rings. Unlike existing public datasets, UruDendro4 includes samples extracted at multiple heights along the stem, allowing for the volumetric modeling of annual growth using manually delineated rings. This dataset (images and annotations) allows the development of volumetric models for annual wood estimation based on cross-sectional imagery. Additionally, we provide a performance baseline for automatic ring detection on this dataset using state-of-the-art methods. The highest performance was achieved by the DeepCS-TRD method, with a mean Average Precision of 0.838, a mean Average Recall of 0.782, and an Adapted Rand Error score of 0.084. A series of ablation experiments were conducted to empirically validate the final parameter configuration. Furthermore, we empirically demonstrate that training a learning model including this dataset improves the model’s generalization in the tree-ring detection task.

[108] BUSTR: Breast Ultrasound Text Reporting with a Descriptor-Aware Vision-Language Model

Rawa Mohammed, Mina Attin, Bryar Shareef

Main category: cs.CV

TL;DR: BUSTR is a multitask vision-language framework that generates breast ultrasound reports without paired image-report data by using structured descriptors and radiomics features, achieving improved clinical efficacy and report quality.

Details

Motivation: Automated radiology report generation for breast ultrasound is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models.

Method: BUSTR constructs reports from structured descriptors (BI-RADS, pathology, histology) and radiomics features, uses a multi-head Swin encoder with multitask loss for descriptor-aware visual representations, and aligns visual/textual tokens via dual-level objective combining token-level cross-entropy with cosine-similarity alignment loss.

Result: BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics across two public BUS datasets (BrEaST and BUS-BRA), particularly for key targets like BI-RADS category and pathology.

Conclusion: The descriptor-aware vision model trained with combined token-level and alignment loss improves both automatic report metrics and clinical efficacy without requiring paired image-report data.

Abstract: Automated radiology report generation (RRG) for breast ultrasound (BUS) is limited by the lack of paired image-report datasets and the risk of hallucinations from large language models. We propose BUSTR, a multitask vision-language framework that generates BUS reports without requiring paired image-report supervision. BUSTR constructs reports from structured descriptors (e.g., BI-RADS, pathology, histology) and radiomics features, learns descriptor-aware visual representations with a multi-head Swin encoder trained using a multitask loss over dataset-specific descriptor sets, and aligns visual and textual tokens via a dual-level objective that combines token-level cross-entropy with a cosine-similarity alignment loss between input and output representations. We evaluate BUSTR on two public BUS datasets, BrEaST and BUS-BRA, which differ in size and available descriptors. Across both datasets, BUSTR consistently improves standard natural language generation metrics and clinical efficacy metrics, particularly for key targets such as BI-RADS category and pathology. Our results show that this descriptor-aware vision model, trained with a combined token-level and alignment loss, improves both automatic report metrics and clinical efficacy without requiring paired image-report data. The source code can be found at https://github.com/AAR-UNLV/BUSTR

[109] Beyond Realism: Learning the Art of Expressive Composition with StickerNet

Haoming Lu, David Kocharian, Humphrey Shi

Main category: cs.CV

TL;DR: StickerNet is a two-stage framework for expressive image composition that learns from real-world user editing patterns to predict placement parameters like opacity, mask, location, and scale, prioritizing artistic expression over realism.

Details

Motivation: Traditional image composition research focuses on visual realism, but modern content creation often aims for artistic, playful, or socially engaging compositions that don't preserve realism, reflecting how users edit images on creative platforms.

Method: Two-stage framework: first determines composition type, then predicts placement parameters (opacity, mask, location, scale). Dataset built from 1.8 million real user editing actions from an online platform, ensuring alignment with authentic editing behavior.

Result: Outperforms common baselines and closely matches human placement behavior in user studies and quantitative evaluations, demonstrating effectiveness despite task ambiguity.

Conclusion: Introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism, showing the value of learning from real-world editing patterns.

Abstract: As a widely used operation in image editing workflows, image composition has traditionally been studied with a focus on achieving visual realism and semantic plausibility. However, in practical editing scenarios of the modern content creation landscape, many compositions are not intended to preserve realism. Instead, users of online platforms motivated by gaining community recognition often aim to create content that is more artistic, playful, or socially engaging. Taking inspiration from this observation, we define the expressive composition task, a new formulation of image composition that embraces stylistic diversity and looser placement logic, reflecting how users edit images on real-world creative platforms. To address this underexplored problem, we present StickerNet, a two-stage framework that first determines the composition type, then predicts placement parameters such as opacity, mask, location, and scale accordingly. Unlike prior work that constructs datasets by simulating object placements on real images, we directly build our dataset from 1.8 million editing actions collected on an anonymous online visual creation and editing platform, each reflecting user-community validated placement decisions. This grounding in authentic editing behavior ensures strong alignment between task definition and training supervision. User studies and quantitative evaluations show that StickerNet outperforms common baselines and closely matches human placement behavior, demonstrating the effectiveness of learning from real-world editing patterns despite the inherent ambiguity of the task. This work introduces a new direction in visual understanding that emphasizes expressiveness and user intent over realism.

[110] TrafficLens: Multi-Camera Traffic Video Analysis Using LLMs

Md Adnan Arefeen, Biplob Debnath, Srimat Chakradhar

Main category: cs.CV

TL;DR: TrafficLens is a multi-camera traffic analysis system that uses sequential VLM processing with overlapping camera coverage and object-level similarity detection to reduce video-to-text conversion time by 4x while maintaining accuracy.

Details

Motivation: Current methods for analyzing multi-camera traffic videos using LLMs require converting video to text via VLMs, which is time-consuming and delays timely insights for traffic management and incident investigation.

Method: Uses sequential approach with overlapping camera coverage areas, iteratively applying VLMs with varying token limits using previous outputs as prompts, and includes object-level similarity detector to bypass redundant VLM invocations.

Result: Experimental results show TrafficLens reduces video-to-text conversion time by up to 4x while maintaining information accuracy on real-world datasets.

Conclusion: TrafficLens provides an efficient solution for multi-camera traffic intersection analysis by optimizing VLM usage and maintaining timely processing for traffic management applications.

Abstract: Traffic cameras are essential in urban areas, playing a crucial role in intelligent transportation systems. Multiple cameras at intersections enhance law enforcement capabilities, traffic management, and pedestrian safety. However, efficiently managing and analyzing multi-camera feeds poses challenges due to the vast amount of data. Analyzing such huge video data requires advanced analytical tools. While Large Language Models (LLMs) like ChatGPT, equipped with retrieval-augmented generation (RAG) systems, excel in text-based tasks, integrating them into traffic video analysis demands converting video data into text using a Vision-Language Model (VLM), which is time-consuming and delays the timely utilization of traffic videos for generating insights and investigating incidents. To address these challenges, we propose TrafficLens, a tailored algorithm for multi-camera traffic intersections. TrafficLens employs a sequential approach, utilizing overlapping coverage areas of cameras. It iteratively applies VLMs with varying token limits, using previous outputs as prompts for subsequent cameras, enabling rapid generation of detailed textual descriptions while reducing processing time. Additionally, TrafficLens intelligently bypasses redundant VLM invocations through an object-level similarity detector. Experimental results with real-world datasets demonstrate that TrafficLens reduces video-to-text conversion time by up to $4\times$ while maintaining information accuracy.

[111] Privacy-Preserving Federated Vision Transformer Learning Leveraging Lightweight Homomorphic Encryption in Medical AI

Al Amin, Kamrul Hasan, Liang Hong, Sharif Ullah

Main category: cs.CV

TL;DR: A privacy-preserving federated learning framework combining Vision Transformers with homomorphic encryption for secure multi-institutional histopathology classification, achieving significant communication reduction while maintaining strong privacy guarantees.

Details

Motivation: Healthcare institutions need to collaborate for improved diagnostic accuracy but cannot share patient data due to privacy regulations like HIPAA. Conventional federated learning remains vulnerable to gradient-based reconstruction attacks that can expose sensitive medical information.

Method: Combines Vision Transformers (ViT) with homomorphic encryption (HE), using ViT CLS tokens as compact feature representations that are encrypted using CKKS homomorphic encryption before transmission to the server for secure aggregation.

Result: Encrypting CLS tokens achieves 30-fold communication reduction compared to gradient encryption. Prevents model inversion attacks (vs vulnerable gradients with PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741). Achieves 96.12% global classification accuracy (unencrypted) and 90.02% (encrypted) with only 326 KB encrypted data per aggregation round.

Conclusion: The proposed framework provides strong privacy protection against reconstruction attacks while enabling efficient encrypted inference, making it suitable for secure collaborative machine learning in healthcare settings with strict privacy requirements.

Abstract: Collaborative machine learning across healthcare institutions promises improved diagnostic accuracy by leveraging diverse datasets, yet privacy regulations such as HIPAA prohibit direct patient data sharing. While federated learning (FL) enables decentralized training without raw data exchange, recent studies show that model gradients in conventional FL remain vulnerable to reconstruction attacks, potentially exposing sensitive medical information. This paper presents a privacy-preserving federated learning framework combining Vision Transformers (ViT) with homomorphic encryption (HE) for secure multi-institutional histopathology classification. The approach leverages the ViT CLS token as a compact 768-dimensional feature representation for secure aggregation, encrypting these tokens using CKKS homomorphic encryption before transmission to the server. We demonstrate that encrypting CLS tokens achieves a 30-fold communication reduction compared to gradient encryption while maintaining strong privacy guarantees. Through evaluation on a three-client federated setup for lung cancer histopathology classification, we show that gradients are highly susceptible to model inversion attacks (PSNR: 52.26 dB, SSIM: 0.999, NMI: 0.741), enabling near-perfect image reconstruction. In contrast, the proposed CLS-protected HE approach prevents such attacks while enabling encrypted inference directly on ciphertexts, requiring only 326 KB of encrypted data transmission per aggregation round. The framework achieves 96.12 percent global classification accuracy in the unencrypted domain and 90.02 percent in the encrypted domain.

[112] Inversion-Free Style Transfer with Dual Rectified Flows

Yingying Deng, Xiangyu He, Fan Tang, Weiming Dong, Xucheng Yin

Main category: cs.CV

TL;DR: An inversion-free style transfer framework using dual rectified flows that fuses content and style trajectories through dynamic midpoint interpolation, eliminating the need for computationally expensive inversion processes.

Details

Motivation: To overcome the limitations of existing diffusion-based style transfer methods that rely on computationally intensive inversion processes, which compromise efficiency and introduce visual distortions when inversion is inaccurate.

Method: Proposes an inversion-free framework based on dual rectified flows that predicts content and style trajectories in parallel, then fuses them through dynamic midpoint interpolation with velocity field design and attention injection for style integration.

Result: Extensive experiments demonstrate generalization across diverse styles and content, providing effective and efficient style transfer with improved visual fidelity and content preservation.

Conclusion: The proposed inversion-free framework achieves robust style fusion, avoids shortcomings of naive overlays, and provides an efficient pipeline for style transfer using only forward passes.

Abstract: Style transfer, a pivotal task in image processing, synthesizes visually compelling images by seamlessly blending realistic content with artistic styles, enabling applications in photo editing and creative design. While mainstream training-free diffusion-based methods have greatly advanced style transfer in recent years, their reliance on computationally inversion processes compromises efficiency and introduces visual distortions when inversion is inaccurate. To address these limitations, we propose a novel \textit{inversion-free} style transfer framework based on dual rectified flows, which tackles the challenge of finding an unknown stylized distribution from two distinct inputs (content and style images), \textit{only with forward pass}. Our approach predicts content and style trajectories in parallel, then fuses them through a dynamic midpoint interpolation that integrates velocities from both paths while adapting to the evolving stylized image. By jointly modeling the content, style, and stylized distributions, our velocity field design achieves robust fusion and avoids the shortcomings of naive overlays. Attention injection further guides style integration, enhancing visual fidelity, content preservation, and computational efficiency. Extensive experiments demonstrate generalization across diverse styles and content, providing an effective and efficient pipeline for style transfer.

[113] RefOnce: Distilling References into a Prototype Memory for Referring Camouflaged Object Detection

Yu-Huan Wu, Zi-Xuan Zhu, Yan Wang, Liangli Zhen, Deng-Ping Fan

Main category: cs.CV

TL;DR: A Ref-COD framework that distills references into class-prototype memory during training and synthesizes reference vectors at inference without needing reference images at test time.

Details

Motivation: Current Ref-COD systems require reference images at test time, which limits deployability, adds latency, and increases data-collection burden.

Method: Maintain EMA-updated prototypes per category, predict mixture weights from query to produce guidance vectors, and use bidirectional attention alignment to bridge representation gaps between reference statistics and camouflaged query features.

Result: Achieves competitive or superior performance on R2C7K benchmark compared to state-of-the-art methods while eliminating test-time reference requirements.

Conclusion: Proposed approach provides a simple, efficient path to Ref-COD without mandatory references at inference time.

Abstract: Referring Camouflaged Object Detection (Ref-COD) segments specified camouflaged objects in a scene by leveraging a small set of referring images. Though effective, current systems adopt a dual-branch design that requires reference images at test time, which limits deployability and adds latency and data-collection burden. We introduce a Ref-COD framework that distills references into a class-prototype memory during training and synthesizes a reference vector at inference via a query-conditioned mixture of prototypes. Concretely, we maintain an EMA-updated prototype per category and predict mixture weights from the query to produce a guidance vector without any test-time references. To bridge the representation gap between reference statistics and camouflaged query features, we propose a bidirectional attention alignment module that adapts both the query features and the class representation. Thus, our approach yields a simple, efficient path to Ref-COD without mandatory references. We evaluate the proposed method on the large-scale R2C7K benchmark. Extensive experiments demonstrate competitive or superior performance of the proposed method compared with recent state-of-the-arts. Code is available at https://github.com/yuhuan-wu/RefOnce.

[114] Wavefront-Constrained Passive Obscured Object Detection

Zhiwen Zheng, Yiwei Ouyang, Zhao Huang, Tao Zhang, Xiaoshuai Zhang, Huiyu Zhou, Wenwen Tang, Shaowei Jiang, Jin Liu, Xingru Huang

Main category: cs.CV

TL;DR: WavePCNet: A physics-driven network for localizing and segmenting obscured objects using wavefront propagation simulation with complex amplitude constraints and perturbation suppression.

Details

Motivation: Existing methods fail to capture coherent light propagation physics and produce non-physical solutions under low signal-to-noise conditions, compromising observation stability and reliability.

Method: Proposes WavePCNet with Tri-Phase Wavefront Complex-Propagation Reprojection for precise coherent propagation constraints, momentum memory mechanism for perturbation suppression, and High-frequency Cross-layer Compensation Enhancement for multi-scale frequency modeling.

Result: Outperforms state-of-the-art methods on four physically collected datasets in both accuracy and robustness.

Conclusion: WavePCNet effectively addresses the challenges of obscured object localization and segmentation through physics-driven wavefront propagation simulation and perturbation compensation.

Abstract: Accurately localizing and segmenting obscured objects from faint light patterns beyond the field of view is highly challenging due to multiple scattering and medium-induced perturbations. Most existing methods, based on real-valued modeling or local convolutional operations, are inadequate for capturing the underlying physics of coherent light propagation. Moreover, under low signal-to-noise conditions, these methods often converge to non-physical solutions, severely compromising the stability and reliability of the observation. To address these challenges, we propose a novel physics-driven Wavefront Propagating Compensation Network (WavePCNet) to simulate wavefront propagation and enhance the perception of obscured objects. This WavePCNet integrates the Tri-Phase Wavefront Complex-Propagation Reprojection (TriWCP) to incorporate complex amplitude transfer operators to precisely constrain coherent propagation behavior, along with a momentum memory mechanism to effectively suppress the accumulation of perturbations. Additionally, a High-frequency Cross-layer Compensation Enhancement is introduced to construct frequency-selective pathways with multi-scale receptive fields and dynamically model structural consistency across layers, further boosting the model’s robustness and interpretability under complex environmental conditions. Extensive experiments conducted on four physically collected datasets demonstrate that WavePCNet consistently outperforms state-of-the-art methods across both accuracy and robustness.

[115] GuardTrace-VL: Detecting Unsafe Multimodel Reasoning via Iterative Safety Supervision

Yuxiao Xiang, Junchi Chen, Zhenchao Jin, Changtao Miao, Haojie Yuan, Qi Chu, Tao Gong, Nenghai Yu

Main category: cs.CV

TL;DR: GuardTrace-VL is a vision-aware safety auditor that detects unsafe content in multimodal reasoning traces, addressing gaps in existing safety methods that only evaluate final answers.

Details

Motivation: Existing multimodal safety guards overlook unsafe content in intermediate reasoning traces, allowing harmful content like biased inferences to go undetected even when final answers appear safe.

Method: Introduces GuardTrace-VL with joint image-text analysis of the full Question-Thinking-Answer pipeline, uses a curated GuardTrace dataset generated through diverse prompting and human verification, and employs a three-stage progressive training scheme.

Result: Achieves 93.1% F1 score on unsafe reasoning detection, representing a 13.5% improvement over previous multimodal safety methods on both in-domain and out-of-domain test scenarios.

Conclusion: GuardTrace-VL effectively monitors multimodal reasoning processes to detect emerging unsafe content, significantly improving safety detection capabilities in vision-language models.

Abstract: Multimodal large reasoning models (MLRMs) are increasingly deployed for vision-language tasks that produce explicit intermediate rationales. However, reasoning traces can contain unsafe content even when the final answer is non-harmful, creating deployment risks. Existing multimodal safety guards primarily evaluate only the input question and the final answer, neglecting the intermediate reasoning process. This oversight allows undetected harm, such as biased inferences or policy-violating use of visual context, to emerge during reasoning. We introduce GuardTrace-VL, a vision-aware safety auditor that monitors the full Question-Thinking-Answer (QTA) pipeline via joint image-text analysis, enabling detection of unsafe content as it emerges in the reasoning stage. To support training and evaluation, we construct the GuardTrace dataset, which is generated through diverse prompting strategies and refined via a MLRM- and human-based voting and verification pipeline. Furthermore, we propose a three-stage progressive training scheme combined with the data refinement process, enabling the model to learn nuanced and context-dependent safety preferences according to different risk levels. On our proposed test set covering both in-domain and out-of-domain scenarios, GuardTrace-VL model achieves an F1 score of 93.1% on unsafe reasoning detection tasks, representing a 13.5% improvement in F1 score compared to the previous strongest multimodal safety defense methods. The codes will be made publicly available.

[116] From Inpainting to Layer Decomposition: Repurposing Generative Inpainting Models for Image Layer Decomposition

Jingxi Chen, Yixiao Zhang, Xiaoye Qian, Zongxia Li, Cornelia Fermuller, Caren Chen, Yiannis Aloimonos

Main category: cs.CV

TL;DR: A diffusion-based inpainting model is adapted for layer decomposition using lightweight finetuning and a multi-modal context fusion module, achieving superior object removal and occlusion recovery.

Details

Motivation: Images can be viewed as layered compositions enabling independent editing, but decomposing single images into layers remains challenging due to limited methods and data.

Method: Adapt diffusion-based inpainting model for layer decomposition with lightweight finetuning, introduce multi-modal context fusion module with linear attention complexity, train on synthetic dataset from open-source assets.

Result: Achieves superior performance in object removal and occlusion recovery, enabling new possibilities in downstream editing and creative applications.

Conclusion: The proposed approach successfully addresses layer decomposition challenges and unlocks new editing capabilities through the connection between layer decomposition and in/outpainting tasks.

Abstract: Images can be viewed as layered compositions, foreground objects over background, with potential occlusions. This layered representation enables independent editing of elements, offering greater flexibility for content creation. Despite the progress in large generative models, decomposing a single image into layers remains challenging due to limited methods and data. We observe a strong connection between layer decomposition and in/outpainting tasks, and propose adapting a diffusion-based inpainting model for layer decomposition using lightweight finetuning. To further preserve detail in the latent space, we introduce a novel multi-modal context fusion module with linear attention complexity. Our model is trained purely on a synthetic dataset constructed from open-source assets and achieves superior performance in object removal and occlusion recovery, unlocking new possibilities in downstream editing and creative applications.

[117] Knowledge Completes the Vision: A Multimodal Entity-aware Retrieval-Augmented Generation Framework for News Image Captioning

Xiaoxing You, Qiang Huang, Lingyu Li, Chi Zhang, Xiaopeng Liu, Min Zhang, Jun Yu

Main category: cs.CV

TL;DR: MERGE is a multimodal entity-aware retrieval-augmented generation framework that addresses key challenges in news image captioning through enriched knowledge retrieval, improved cross-modal alignment, and enhanced visual-entity grounding.

Details

Motivation: Existing news image captioning methods struggle with incomplete information coverage, weak cross-modal alignment, and suboptimal visual-entity grounding, limiting their journalistic informativeness.

Method: MERGE constructs an entity-centric multimodal knowledge base (EMKB) integrating textual, visual, and structured knowledge. It uses a multistage hypothesis-caption strategy for cross-modal alignment and dynamic retrieval guided by image content for visual-entity matching.

Result: Significant improvements on GoodNews (+6.84 CIDEr, +4.14 F1) and NYTimes800k (+1.16 CIDEr, +2.64 F1) datasets. Strong generalization on unseen Visual News dataset (+20.17 CIDEr, +6.22 F1), demonstrating robustness and domain adaptability.

Conclusion: MERGE effectively addresses key challenges in news image captioning through its multimodal entity-aware retrieval-augmented approach, achieving state-of-the-art performance and strong generalization capabilities across different news datasets.

Abstract: News image captioning aims to produce journalistically informative descriptions by combining visual content with contextual cues from associated articles. Despite recent advances, existing methods struggle with three key challenges: (1) incomplete information coverage, (2) weak cross-modal alignment, and (3) suboptimal visual-entity grounding. To address these issues, we introduce MERGE, the first Multimodal Entity-aware Retrieval-augmented GEneration framework for news image captioning. MERGE constructs an entity-centric multimodal knowledge base (EMKB) that integrates textual, visual, and structured knowledge, enabling enriched background retrieval. It improves cross-modal alignment through a multistage hypothesis-caption strategy and enhances visual-entity matching via dynamic retrieval guided by image content. Extensive experiments on GoodNews and NYTimes800k show that MERGE significantly outperforms state-of-the-art baselines, with CIDEr gains of +6.84 and +1.16 in caption quality, and F1-score improvements of +4.14 and +2.64 in named entity recognition. Notably, MERGE also generalizes well to the unseen Visual News dataset, achieving +20.17 in CIDEr and +6.22 in F1-score, demonstrating strong robustness and domain adaptability.

[118] MetaRank: Task-Aware Metric Selection for Model Transferability Estimation

Yuhang Liu, Wenjie Zhao, Yunhui Guo

Main category: cs.CV

TL;DR: MetaRank is a meta-learning framework that automatically selects the most appropriate Model Transferability Estimation (MTE) metric for transfer learning tasks using dataset and metric descriptions in a shared semantic space.

Details

Motivation: Current MTE metric selection is ad hoc or based on average historical performance, but no single metric works optimally across all target datasets. The effectiveness of MTE metrics is highly task-dependent.

Method: Formulates metric selection as learning-to-rank problem. Uses pretrained language model to encode textual descriptions of datasets and metrics into shared semantic space. Trains meta-predictor offline on diverse meta-tasks with listwise objective to prioritize top-performing metrics.

Result: Extensive experiments across 11 pretrained models and 11 target datasets demonstrate strong effectiveness of MetaRank in selecting appropriate MTE metrics for new datasets.

Conclusion: MetaRank provides an efficient, task-aware approach for automatic MTE metric selection that outperforms ad hoc selection methods by leveraging semantic understanding of datasets and metrics.

Abstract: Selecting an appropriate pre-trained source model is a critical, yet computationally expensive, task in transfer learning. Model Transferability Estimation (MTE) methods address this by providing efficient proxy metrics to rank models without full fine-tuning. In practice, the choice of which MTE metric to use is often ad hoc or guided simply by a metric’s average historical performance. However, we observe that the effectiveness of MTE metrics is highly task-dependent and no single metric is universally optimal across all target datasets. To address this gap, we introduce MetaRank, a meta-learning framework for automatic, task-aware MTE metric selection. We formulate metric selection as a learning-to-rank problem. Rather than relying on conventional meta-features, MetaRank encodes textual descriptions of both datasets and MTE metrics using a pretrained language model, embedding them into a shared semantic space. A meta-predictor is then trained offline on diverse meta-tasks to learn the intricate relationship between dataset characteristics and metric mechanisms, optimized with a listwise objective that prioritizes correctly ranking the top-performing metrics. During the subsequent online phase, MetaRank efficiently ranks the candidate MTE metrics for a new, unseen target dataset based on its textual description, enabling practitioners to select the most appropriate metric a priori. Extensive experiments across 11 pretrained models and 11 target datasets demonstrate the strong effectiveness of our approach.

[119] Structure-Aware Prototype Guided Trusted Multi-View Classification

Haojian Huang, Jiahao Shi, Zhe Liu, Harold Haodong Chen, Han Fang, Hao Sun, Zhongjiang He

Main category: cs.CV

TL;DR: A novel trustworthy multi-view classification framework that uses prototypes to represent neighbor structures, enabling efficient intra-view learning and dynamic alignment of intra- and inter-view structures for more consistent cross-view consensus discovery.

Details

Motivation: Existing TMVC methods have high computational costs due to dense neighbor relationships, cannot ensure consistency across inter-view relationships, and use manually assigned weights for evidence aggregation without guaranteeing consistency in learned neighbor structures within class space.

Method: Introduces prototypes to represent neighbor structures of each view, simplifies intra-view neighbor relation learning, and enables dynamic alignment of intra- and inter-view structures to facilitate efficient and consistent cross-view consensus discovery.

Result: Extensive experiments on multiple public multi-view datasets demonstrate competitive downstream performance and robustness compared to prevalent TMVC methods.

Conclusion: The proposed framework effectively addresses limitations of existing TMVC approaches by providing more efficient and consistent multi-view classification with improved trustworthiness.

Abstract: Trustworthy multi-view classification (TMVC) addresses the challenge of achieving reliable decision-making in complex scenarios where multi-source information is heterogeneous, inconsistent, or even conflicting. Existing TMVC approaches predominantly rely on globally dense neighbor relationships to model intra-view dependencies, leading to high computational costs and an inability to directly ensure consistency across inter-view relationships. Furthermore, these methods typically aggregate evidence from different views through manually assigned weights, lacking guarantees that the learned multi-view neighbor structures are consistent within the class space, thus undermining the trustworthiness of classification outcomes. To overcome these limitations, we propose a novel TMVC framework that introduces prototypes to represent the neighbor structures of each view. By simplifying the learning of intra-view neighbor relations and enabling dynamic alignment of intra- and inter-view structure, our approach facilitates more efficient and consistent discovery of cross-view consensus. Extensive experiments on multiple public multi-view datasets demonstrate that our method achieves competitive downstream performance and robustness compared to prevalent TMVC methods.

[120] CameraMaster: Unified Camera Semantic-Parameter Control for Photography Retouching

Qirui Yang, Yang Yang, Ying Zeng, Xiaobin Hu, Bo Li, Huanjing Yue, Jingyu Yang, Peng-Tao Jiang

Main category: cs.CV

TL;DR: CameraMaster is a unified camera-aware framework for image retouching that decouples camera directives and parameter embeddings to achieve precise camera control while maintaining semantic-parameter alignment.

Details

Motivation: Existing methods for text-guided image retouching either rely on ambiguous text prompts that hinder precise camera control, or train separate heads for parameter adjustment which compromises scalability and multi-parameter composition.

Method: CameraMaster explicitly decouples camera directive and parameter embeddings, modulates both camera directive and content semantics with parameter embeddings, injects modulated directive via cross-attention, and uses directive and camera embeddings as conditioning signals throughout the denoising process.

Result: CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods on a 78K image-prompt dataset.

Conclusion: The proposed CameraMaster framework successfully addresses the limitations of existing methods by providing precise camera parameter control while maintaining semantic consistency and enabling multi-parameter composition in image retouching.

Abstract: Text-guided diffusion models have greatly advanced image editing and generation. However, achieving physically consistent image retouching with precise parameter control (e.g., exposure, white balance, zoom) remains challenging. Existing methods either rely solely on ambiguous and entangled text prompts, which hinders precise camera control, or train separate heads/weights for parameter adjustment, which compromises scalability, multi-parameter composition, and sensitivity to subtle variations. To address these limitations, we propose CameraMaster, a unified camera-aware framework for image retouching. The key idea is to explicitly decouple the camera directive and then coherently integrate two critical information streams: a directive representation that captures the photographer’s intent, and a parameter embedding that encodes precise camera settings. CameraMaster first uses the camera parameter embedding to modulate both the camera directive and the content semantics. The modulated directive is then injected into the content features via cross-attention, yielding a strongly camera-sensitive semantic context. In addition, the directive and camera embeddings are injected as conditioning and gating signals into the time embedding, enabling unified, layer-wise modulation throughout the denoising process and enforcing tight semantic-parameter alignment. To train and evaluate CameraMaster, we construct a large-scale dataset of 78K image-prompt pairs annotated with camera parameters. Extensive experiments show that CameraMaster produces monotonic and near-linear responses to parameter variations, supports seamless multi-parameter composition, and significantly outperforms existing methods.

[121] CaptionQA: Is Your Caption as Useful as the Image Itself?

Shijia Yang, Yunong Liu, Bohan Zhai, Ximeng Sun, Zicheng Liu, Emad Barsoum, Manling Li, Chenfeng Xu

Main category: cs.CV

TL;DR: CaptionQA is a utility-based benchmark that evaluates image captions by measuring how well they support downstream tasks across 4 domains, revealing significant gaps between image and caption utility.

Details

Motivation: Current evaluation practices miss whether captions can effectively substitute for images in real downstream tasks, highlighting the need for a utility-focused benchmark.

Method: Built 33,027 densely annotated multiple-choice questions across 4 domains with fine-grained taxonomies; uses LLM to answer questions using captions alone to measure caption utility.

Result: Evaluation reveals substantial gaps between image and caption utility, with models showing up to 32% drop in performance when using captions instead of images.

Conclusion: CaptionQA provides a comprehensive framework for evaluating caption utility and reveals current captioning models’ limitations in preserving image-level information for downstream tasks.

Abstract: Image captions serve as efficient surrogates for visual content in multimodal systems such as retrieval, recommendation, and multi-step agentic inference pipelines. Yet current evaluation practices miss a fundamental question: Can captions stand-in for images in real downstream tasks? We propose a utility-based benchmark, CaptionQA, to evaluate model-generated captions, where caption quality is measured by how well it supports downstream tasks. CaptionQA is an extensible domain-dependent benchmark covering 4 domains–Natural, Document, E-commerce, and Embodied AI–each with fine-grained taxonomies (25 top-level and 69 subcategories) that identify useful information for domain-specific tasks. CaptionQA builds 33,027 densely annotated multiple-choice questions (50.3 per image on average) that explicitly require visual information to answer, providing a comprehensive probe of caption utility. In our evaluation protocol, an LLM answers these questions using captions alone, directly measuring whether captions preserve image-level utility and are utilizable by a downstream LLM. Evaluating state-of-the-art MLLMs reveals substantial gaps between the image and its caption utility. Notably, models nearly identical on traditional image-QA benchmarks lower by up to 32% in caption utility. We release CaptionQA along with an open-source pipeline for extension to new domains. The code is available at https://github.com/bronyayang/CaptionQA.

[122] FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation

Kaixing Yang, Xulong Tang, Ziqiao Peng, Xiangyue Zhang, Puwei Wang, Jun He, Hongyan Liu

Main category: cs.CV

TL;DR: FlowerDance is an efficient music-to-dance generation method that combines MeanFlow with Physical Consistency Constraints and uses a BiMamba-based backbone with Channel-Level Cross-Modal Fusion to generate high-quality dance motions with fast inference speed and low memory usage.

Details

Motivation: Existing music-to-dance generation methods have limited generation efficiency, leaving insufficient computational resources for high-fidelity 3D rendering, which constrains the expressiveness of 3D characters in real-world applications.

Method: Combines MeanFlow with Physical Consistency Constraints for high-quality motion generation with few sampling steps. Uses a BiMamba-based backbone with Channel-Level Cross-Modal Fusion for efficient non-autoregressive generation. Supports motion editing for interactive refinement.

Result: Achieves state-of-the-art results on AIST++ and FineDance datasets in both motion quality and generation efficiency, with significant improvements in inference speed and memory utilization.

Conclusion: FlowerDance provides an efficient solution for music-to-dance generation that enables high-quality motion with physical plausibility and artistic expressiveness while maintaining computational efficiency suitable for real-world applications.

Abstract: Music-to-dance generation aims to translate auditory signals into expressive human motion, with broad applications in virtual reality, choreography, and digital entertainment. Despite promising progress, the limited generation efficiency of existing methods leaves insufficient computational headroom for high-fidelity 3D rendering, thereby constraining the expressiveness of 3D characters during real-world applications. Thus, we propose FlowerDance, which not only generates refined motion with physical plausibility and artistic expressiveness, but also achieves significant generation efficiency on inference speed and memory utilization . Specifically, FlowerDance combines MeanFlow with Physical Consistency Constraints, which enables high-quality motion generation with only a few sampling steps. Moreover, FlowerDance leverages a simple but efficient model architecture with BiMamba-based backbone and Channel-Level Cross-Modal Fusion, which generates dance with efficient non-autoregressive manner. Meanwhile, FlowerDance supports motion editing, enabling users to interactively refine dance sequences. Extensive experiments on AIST++ and FineDance show that FlowerDance achieves state-of-the-art results in both motion quality and generation efficiency. Code will be released upon acceptance.

[123] LungNoduleAgent: A Collaborative Multi-Agent System for Precision Diagnosis of Lung Nodules

Cheng Yang, Hui Jin, Xinlei Yu, Zhipeng Wang, Yaoqun Liu, Fenglei Fan, Dajiang Lei, Gangyong Jia, Changmiao Wang, Ruiquan Ge

Main category: cs.CV

TL;DR: LungNoduleAgent is a collaborative multi-agent system for lung CT scan analysis that improves nodule description precision and malignancy grading through three specialized modules working sequentially.

Details

Motivation: Current multimodal LLMs struggle with accurate nodule morphology description and medical expertise integration, limiting clinical reliability. Multi-agent systems offer potential for balancing generality and precision in medical applications.

Method: Three-module system: Nodule Spotter coordinates detection models to identify nodules; Radiologist uses localized image description for CT reports; Doctor Agent System performs malignancy reasoning using images, reports, pathology knowledge base, and multi-agent framework.

Result: Outperforms mainstream vision-language models, agent systems, and expert models on two private datasets and LIDC-IDRI dataset, demonstrating superior performance in lung nodule diagnosis.

Conclusion: LungNoduleAgent shows the importance of region-level semantic alignment and multi-agent collaboration, serving as a promising foundational tool for clinical lung nodule analysis.

Abstract: Diagnosing lung cancer typically involves physicians identifying lung nodules in Computed tomography (CT) scans and generating diagnostic reports based on their morphological features and medical expertise. Although advancements have been made in using multimodal large language models for analyzing lung CT scans, challenges remain in accurately describing nodule morphology and incorporating medical expertise. These limitations affect the reliability and effectiveness of these models in clinical settings. Collaborative multi-agent systems offer a promising strategy for achieving a balance between generality and precision in medical applications, yet their potential in pathology has not been thoroughly explored. To bridge these gaps, we introduce LungNoduleAgent, an innovative collaborative multi-agent system specifically designed for analyzing lung CT scans. LungNoduleAgent streamlines the diagnostic process into sequential components, improving precision in describing nodules and grading malignancy through three primary modules. The first module, the Nodule Spotter, coordinates clinical detection models to accurately identify nodules. The second module, the Radiologist, integrates localized image description techniques to produce comprehensive CT reports. Finally, the Doctor Agent System performs malignancy reasoning by using images and CT reports, supported by a pathology knowledge base and a multi-agent system framework. Extensive testing on two private datasets and the public LIDC-IDRI dataset indicates that LungNoduleAgent surpasses mainstream vision-language models, agent systems, and advanced expert models. These results highlight the importance of region-level semantic alignment and multi-agent collaboration in diagnosing nodules. LungNoduleAgent stands out as a promising foundational tool for supporting clinical analyses of lung nodules.

[124] PG-ControlNet: A Physics-Guided ControlNet for Generative Spatially Varying Image Deblurring

Hakki Motorcu, Mujdat Cetin

Main category: cs.CV

TL;DR: A novel framework that combines generative models with explicit physical constraints for spatially varying image deblurring, achieving both physical accuracy and perceptual realism.

Details

Motivation: Current methods either produce over-smoothed results with artifacts (model-based approaches) or hallucinate details due to weak physical constraints (generative models). There's a need to bridge this gap between physical accuracy and perceptual quality.

Method: Models degradation as a dense continuum of high-dimensional compressed kernels to capture minute variations, then uses this descriptor field to condition a ControlNet architecture that strongly guides the diffusion sampling process.

Result: Outperforms state-of-the-art model-based methods and generative baselines in challenging, severely blurred scenarios, effectively bridging the gap between physical accuracy and perceptual realism.

Conclusion: The proposed framework successfully reconciles model-based and generative approaches by taming a powerful generative prior with explicit, dense physical constraints for superior spatially varying image deblurring.

Abstract: Spatially varying image deblurring remains a fundamentally ill-posed problem, especially when degradations arise from complex mixtures of motion and other forms of blur under significant noise. State-of-the-art learning-based approaches generally fall into two paradigms: model-based deep unrolling methods that enforce physical constraints by modeling the degradations, but often produce over-smoothed, artifact-laden textures, and generative models that achieve superior perceptual quality yet hallucinate details due to weak physical constraints. In this paper, we propose a novel framework that uniquely reconciles these paradigms by taming a powerful generative prior with explicit, dense physical constraints. Rather than oversimplifying the degradation field, we model it as a dense continuum of high-dimensional compressed kernels, ensuring that minute variations in motion and other degradation patterns are captured. We leverage this rich descriptor field to condition a ControlNet architecture, strongly guiding the diffusion sampling process. Extensive experiments demonstrate that our method effectively bridges the gap between physical accuracy and perceptual realism, outperforming state-of-the-art model-based methods as well as generative baselines in challenging, severely blurred scenarios.

[125] MUSE: Manipulating Unified Framework for Synthesizing Emotions in Images via Test-Time Optimization

Yingjie Xia, Xi Wang, Jinglei Shi, Vicky Kalogeiton, Jian Yang

Main category: cs.CV

TL;DR: MUSE is a unified framework for emotional image generation and editing that uses gradient-based optimization with off-the-shelf emotion classifiers, optimal timing selection, and multi-emotion loss to achieve superior emotional accuracy and semantic diversity.

Details

Motivation: Current Image Emotional Synthesis approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling.

Method: MUSE adopts a Test-Time Scaling strategy with three key components: (1) gradient-based optimization of emotional tokens using off-the-shelf emotion classifiers, (2) optimal timing identification using semantic similarity, and (3) multi-emotion loss to reduce interference from inherent and similar emotions.

Result: Experimental results show MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining optimal balance between content, text adherence, and realistic emotional expression.

Conclusion: MUSE establishes a new paradigm for emotion synthesis by unifying generation and editing tasks without requiring additional diffusion model updates or specialized emotional datasets.

Abstract: Images evoke emotions that profoundly influence perception, often prioritized over content. Current Image Emotional Synthesis (IES) approaches artificially separate generation and editing tasks, creating inefficiencies and limiting applications where these tasks naturally intertwine, such as therapeutic interventions or storytelling. In this work, we introduce MUSE, the first unified framework capable of both emotional generation and editing. By adopting a strategy conceptually aligned with Test-Time Scaling (TTS) that widely used in both LLM and diffusion model communities, it avoids the requirement for additional updating diffusion model and specialized emotional synthesis datasets. More specifically, MUSE addresses three key questions in emotional synthesis: (1) HOW to stably guide synthesis by leveraging an off-the-shelf emotion classifier with gradient-based optimization of emotional tokens; (2) WHEN to introduce emotional guidance by identifying the optimal timing using semantic similarity as a supervisory signal; and (3) WHICH emotion to guide synthesis through a multi-emotion loss that reduces interference from inherent and similar emotions. Experimental results show that MUSE performs favorably against all methods for both generation and editing, improving emotional accuracy and semantic diversity while maintaining an optimal balance between desired content, adherence to text prompts, and realistic emotional expression. It establishes a new paradigm for emotion synthesis.

[126] Long-Term Alzheimers Disease Prediction: A Novel Image Generation Method Using Temporal Parameter Estimation with Normal Inverse Gamma Distribution on Uneven Time Series

Xin Hong, Xinze Sun, Yinhao Li, Yen-Wei Chen

Main category: cs.CV

TL;DR: Proposes T-NIG model for long-term Alzheimer’s Disease prediction using image generation with temporal parameter estimation in Normal Inverse Gamma Distribution to handle irregular time intervals.

Details

Motivation: Address difficulties in maintaining disease-related characteristics during long-term AD predictions when dealing with irregular time intervals in sequential medical imaging data.

Method: Uses T-NIG model that estimates temporal parameter within Normal Inverse Gamma Distribution, employs coordinate neighborhoods for feature identification, and incorporates uncertainty estimation to reduce epistemic and aleatoric uncertainties.

Result: Demonstrates state-of-the-art performance in both short-term and long-term prediction tasks, proficient in forecasting disease progression while maintaining disease-related characteristics despite irregular temporal data distribution.

Conclusion: T-NIG model effectively handles irregular time intervals in medical image sequences for AD prediction by incorporating temporal parameters and uncertainty estimation, achieving superior performance in maintaining disease characteristics over time.

Abstract: Image generation can provide physicians with an imaging diagnosis basis in the prediction of Alzheimer’s Disease (AD). Recent research has shown that long-term AD predictions by image generation often face difficulties maintaining disease-related characteristics when dealing with irregular time intervals in sequential data. Considering that the time-related aspects of the distribution can reflect changes in disease-related characteristics when images are distributed unevenly, this research proposes a model to estimate the temporal parameter within the Normal Inverse Gamma Distribution (T-NIG) to assist in generating images over the long term. The T-NIG model employs brain images from two different time points to create intermediate brain images, forecast future images, and predict the disease. T-NIG is designed by identifying features using coordinate neighborhoods. It incorporates a time parameter into the normal inverse gamma distribution to understand how features change in brain imaging sequences that have varying time intervals. Additionally, T-NIG utilizes uncertainty estimation to reduce both epistemic and aleatoric uncertainties in the model, which arise from insufficient temporal data. In particular, the T-NIG model demonstrates state-of-the-art performance in both short-term and long-term prediction tasks within the dataset. Experimental results indicate that T-NIG is proficient in forecasting disease progression while maintaining disease-related characteristics, even when faced with an irregular temporal data distribution.

[127] MIRA: Multimodal Iterative Reasoning Agent for Image Editing

Ziyun Zeng, Hang Hua, Jiebo Luo

Main category: cs.CV

TL;DR: MIRA is a lightweight multimodal reasoning agent that improves instruction-guided image editing by using an iterative perception-reasoning-action loop to handle complex instructions more accurately than standard diffusion models.

Details

Motivation: Diffusion-based editing models often fail to accurately interpret complex user instructions involving compositional relationships, contextual cues, or referring expressions, leading to semantic drift and unintended edits.

Method: Proposes MIRA, a plug-and-play multimodal reasoning agent that performs editing through iterative perception-reasoning-action loops, predicting atomic edit instructions step by step using visual feedback. Trained on a 150K multimodal dataset (MIRA-Editing) using SFT + GRPO pipeline.

Result: When paired with open-source image editing models (Flux.1-Kontext, Step1X-Edit, Qwen-Image-Edit), MIRA significantly improves semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems like GPT-Image and Nano-Banana.

Conclusion: MIRA effectively addresses the limitations of current diffusion-based editing models by simulating multi-turn human-model interaction processes, enabling more accurate interpretation and execution of complex editing instructions.

Abstract: Instruction-guided image editing offers an intuitive way for users to edit images with natural language. However, diffusion-based editing models often struggle to accurately interpret complex user instructions, especially those involving compositional relationships, contextual cues, or referring expressions, leading to edits that drift semantically or fail to reflect the intended changes. We tackle this problem by proposing MIRA (Multimodal Iterative Reasoning Agent), a lightweight, plug-and-play multimodal reasoning agent that performs editing through an iterative perception-reasoning-action loop, effectively simulating multi-turn human-model interaction processes. Instead of issuing a single prompt or static plan, MIRA predicts atomic edit instructions step by step, using visual feedback to make its decisions. Our 150K multimodal tool-use dataset, MIRA-Editing, combined with a two-stage SFT + GRPO training pipeline, enables MIRA to perform reasoning and editing over complex editing instructions. When paired with open-source image editing models such as Flux.1-Kontext, Step1X-Edit, and Qwen-Image-Edit, MIRA significantly improves both semantic consistency and perceptual quality, achieving performance comparable to or exceeding proprietary systems such as GPT-Image and Nano-Banana.

[128] CLRecogEye : Curriculum Learning towards exploiting convolution features for Dynamic Iris Recognition

Geetanjali Sharma, Gaurav Jaswal, Aditya Nigam, Raghavendra Ramachandra

Main category: cs.CV

TL;DR: A novel iris authentication pipeline using 3D-CNN to capture spatio-temporal patterns, trained with curriculum learning and triplet/ArcFace loss for robust performance against rotation, scale, reflections, and blur.

Details

Motivation: Existing iris authentication methods lack robustness to variations like rotation, scale, reflections, and blur, and fail to effectively leverage spatio-temporal structure of iris patterns through simple point-to-point comparisons.

Method: Splits iris images into sequences of sub-images, processes with 3D-CNN to capture spatial-temporal features, and trains end-to-end with triplet and ArcFace loss in curriculum manner to embed temporal dependencies.

Result: The approach learns rich spatio-temporal representations that improve discriminability in deep metric space, yielding robust performance against various challenges.

Conclusion: Proposed framework provides a robust and generalizable solution for iris authentication by effectively modeling spatio-temporal feature dynamics through curriculum learning and 3D-CNN processing.

Abstract: Iris authentication algorithms have achieved impressive recognition performance, making them highly promising for real-world applications such as border control, citizen identification, and both criminal investigations and commercial systems. However, their robustness is still challenged by variations in rotation, scale, specular reflections, and defocus blur. In addition, most existing approaches rely on straightforward point-to-point comparisons, typically using cosine or L2 distance, without effectively leveraging the spatio-spatial-temporal structure of iris patterns. To address these limitations, we propose a novel and generalized matching pipeline that learns rich spatio-spatial-temporal representations of iris features. Our approach first splits each iris image along one dimension, generating a sequence of sub-images that serve as input to a 3D-CNN, enabling the network to capture both spatial and spatio-spatial-temporal cues. To further enhance the modeling of spatio-spatial-temporal feature dynamics, we train the model in curriculum manner. This design allows the network to embed temporal dependencies directly into the feature space, improving discriminability in the deep metric domain. The framework is trained end-to-end with triplet and ArcFace loss in a curriculum manner, enforcing highly discriminative embeddings despite challenges like rotation, scale, reflections, and blur. This design yields a robust and generalizable solution for iris authentication.Github code: https://github.com/GeetanjaliGTZ/CLRecogEye

[129] AnchorOPT: Towards Optimizing Dynamic Anchors for Adaptive Prompt Learning

Zheng Li, Yibing Song, Xin Zhang, Lei Luo, Xiang Li, Jian Yang

Main category: cs.CV

TL;DR: AnchorOPT introduces dynamic anchor-based prompt learning with learnable anchor values and adaptive positional relationships between anchors and soft tokens, achieving competitive performance without additional modules.

Details

Motivation: Existing prompt learning methods use static anchors that lack cross-task and stage-adaptive flexibility, limiting their generalization capabilities.

Method: Two-stage training: first learn dynamic anchor tokens from task-specific data, then freeze anchors and optimize soft tokens with a learnable position matrix that adapts to training stage and task context.

Result: Achieves performance comparable to or exceeding methods with additional learnable modules or regularization techniques, with consistent gains across diverse datasets.

Conclusion: AnchorOPT provides a plug-and-play dynamic anchor framework that enhances CLIP generalization through adaptive anchor values and positional relationships.

Abstract: Existing prompt learning methods, which are built upon CLIP models, leverage textual tokens as anchors to guide the learnable soft tokens. This guidance improves CLIP generalizations. However, these anchors-static in both value and position-lack cross-task and stage-adaptive flexibility. To address this limitation, we propose AnchorOPT, a dynamic anchor-based prompt learning framework. Specifically, AnchorOPT introduces dynamism in two key dimensions: (i) anchor values eschew handcrafted explicit textual tokens (e.g., “shape”, “color”), instead learning dynamically from task-specific data; and (ii) the positional relationship between anchor and soft tokens is no longer fixed but adaptively optimized via a learnable position matrix conditioned on the training stage and task context. Training occurs in two stages: we first learn the anchor tokens, then freeze and transfer them to the second stage for optimization of soft tokens and the position matrix. Extensive experiments demonstrate that using only a simple learnable anchor and position matrix achieves performance comparable to or exceeding some methods incorporating additional learnable modules or regularization techniques. As a plug-and-play module, AnchorOPT integrates seamlessly into existing frameworks, yielding consistent performance gains across diverse datasets. Code is publicly available at https://github.com/zhengli97/ATPrompt.

[130] Pygmalion Effect in Vision: Image-to-Clay Translation for Reflective Geometry Reconstruction

Gayoung Lee, Junho Kim, Jin-Hwa Kim, Junmo Kim

Main category: cs.CV

TL;DR: Pygmalion Effect in Vision framework suppresses reflections through image-to-clay translation to improve 3D reconstruction of reflective objects.

Details

Motivation: Reflection remains challenging in 3D reconstruction due to entanglement of appearance and geometry under view-dependent reflections.

Method: Dual-branch network with BRDF-based reflective branch and clay-guided branch, trained jointly using synthesized clay-like images as reflection-free supervision.

Result: Substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods on synthetic and real datasets.

Conclusion: Seeing by unshining (translating radiance into neutrality) serves as powerful inductive bias for reflective object geometry learning.

Abstract: Understanding reflection remains a long-standing challenge in 3D reconstruction due to the entanglement of appearance and geometry under view-dependent reflections. In this work, we present the Pygmalion Effect in Vision, a novel framework that metaphorically “sculpts” reflective objects into clay-like forms through image-to-clay translation. Inspired by the myth of Pygmalion, our method learns to suppress specular cues while preserving intrinsic geometric consistency, enabling robust reconstruction from multi-view images containing complex reflections. Specifically, we introduce a dual-branch network in which a BRDF-based reflective branch is complemented by a clay-guided branch that stabilizes geometry and refines surface normals. The two branches are trained jointly using the synthesized clay-like images, which provide a neutral, reflection-free supervision signal that complements the reflective views. Experiments on both synthetic and real datasets demonstrate substantial improvement in normal accuracy and mesh completeness over existing reflection-handling methods. Beyond technical gains, our framework reveals that seeing by unshining, translating radiance into neutrality, can serve as a powerful inductive bias for reflective object geometry learning.

[131] Scaling Foundation Models for Radar Scene Understanding

Pushkal Mishra, Kshitiz Bansal, Dinesh Bharadia

Main category: cs.CV

TL;DR: RadarFM is a radar foundation model that learns unified scene representations using structured spatial language supervision and hash-aware contrastive learning, enabling transfer across radar perception tasks.

Details

Motivation: Radar sensors provide reliable perception in adverse conditions, but existing radar approaches are fragmented and task-specific, preventing transfer across tasks. Foundation models have transformed vision and language but remain underexplored for radar sensing.

Method: Uses structured caption framework encoding vehicle distributions in radar coordinates and hash-aware contrastive learning that quantifies continuous scene similarity. Leverages CARLA simulator for large-scale annotated radar datasets across diverse driving scenarios.

Result: Proposes localization-aware metrics for spatial accuracy assessment beyond traditional detection measures. Demonstrates unified scene-level representations that enable transfer learning across radar perception tasks.

Conclusion: RadarFM provides a foundation model approach for radar sensing that overcomes task-specific fragmentation through unified representations and enables fine-grained spatial reasoning across diverse driving scenarios.

Abstract: Radar sensors provide reliable perception across adverse weather, lighting, and long-range conditions. Recent advances in foundation models have transformed visual and language understanding, yet their integration with radar sensing remains largely underexplored. Existing radar approaches are fragmented and task-specific; each downstream task employs distinct architectures and training objectives, preventing transfer across tasks. In this work, we introduce RadarFM: a radar foundation model that learns unified scene-level representations through structured spatial language supervision. We make two key contributions: (1) a structured caption framework that encodes vehicle distributions in native radar coordinates, and (2) a hash-aware contrastive learning objective that quantifies continuous scene similarity rather than binary matching, enabling fine-grained spatial reasoning. Leveraging the CARLA simulator, we generate large-scale, well-annotated radar datasets across diverse driving scenarios. We also propose localization-aware metrics that assess spatial accuracy beyond traditional detection measures.

[132] EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens

Ze Feng, Sen Yang, Boqiang Duan, Wankou Yang, Jingdong Wang

Main category: cs.CV

TL;DR: EM-KD enhances Efficient MLLMs with knowledge distillation by addressing unbalanced vision tokens through Manhattan distance calculation and Hungarian matching, followed by Vision-Language Affinity Distillation and Vision Semantic Distillation strategies.

Details

Motivation: Existing efficient MLLMs compress vision tokens to reduce resource consumption but lose visual information, degrading comprehension. Prior knowledge distillation methods overlook fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between student and teacher models.

Method: 1) Calculate Manhattan distance between teacher and student vision logits, then align them spatially using Hungarian matching algorithm. 2) Implement two distillation strategies: Vision-Language Affinity Distillation (VLAD) that minimizes smooth L1 distance between affinity matrices, and Vision Semantic Distillation (VSD) that uses reverse KL divergence to measure probability distributions of aligned vision logits.

Result: Comprehensive evaluation shows EM-KD trained models outperform prior Efficient MLLMs with large margins in both accuracy and efficiency. Compared to previous distillation methods with fair comparison using the proposed vision token matching strategy, EM-KD achieves better performance.

Conclusion: EM-KD effectively addresses the challenge of unbalanced vision tokens in knowledge distillation for efficient MLLMs, demonstrating superior performance through spatial alignment and dual distillation strategies.

Abstract: Efficient Multimodal Large Language Models (MLLMs) compress vision tokens to reduce resource consumption, but the loss of visual information can degrade comprehension capabilities. Although some priors introduce Knowledge Distillation to enhance student models, they overlook the fundamental differences in fine-grained vision comprehension caused by unbalanced vision tokens between the efficient student and vanilla teacher. In this paper, we propose EM-KD, a novel paradigm that enhances the Efficient MLLMs with Knowledge Distillation. To overcome the challenge of unbalanced vision tokens, we first calculate the Manhattan distance between the vision logits of teacher and student, and then align them in the spatial dimension with the Hungarian matching algorithm. After alignment, EM-KD introduces two distillation strategies: 1) Vision-Language Affinity Distillation (VLAD) and 2) Vision Semantic Distillation (VSD). Specifically, VLAD calculates the affinity matrix between text tokens and aligned vision tokens, and minimizes the smooth L1 distance of the student and the teacher affinity matrices. Considering the semantic richness of vision logits in the final layer, VSD employs the reverse KL divergence to measure the discrete probability distributions of the aligned vision logits over the vocabulary space. Comprehensive evaluation on diverse benchmarks demonstrates that EM-KD trained model outperforms prior Efficient MLLMs on both accuracy and efficiency with a large margin, validating its effectiveness. Compared with previous distillation methods, which are equipped with our proposed vision token matching strategy for fair comparison, EM-KD also achieves better performance.

[133] Do Reasoning Vision-Language Models Inversely Scale in Test-Time Compute? A Distractor-centric Empirical Analysis

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

Main category: cs.CV

TL;DR: Visual distractors in VLMs cause inverse scaling (reduced accuracy) but unlike textual distractors, don’t increase reasoning length. Tracking attribute counts in reasoning traces reveals how distractors affect performance.

Details

Motivation: To investigate whether visual distractors cause similar inverse scaling effects as observed in language models, and understand how distractors affect vision-language models' reasoning.

Method: Created Idis dataset with systematic visual distractors (semantic, numerical, spatial), analyzed reasoning traces by tracking attribute counts, and tested on Waterbirds benchmark with proposed prompting strategy.

Result: Visual distractors cause inverse scaling (accuracy decreases) but don’t increase reasoning length like textual distractors. Attribute count tracking reveals distractor-reasoning-accuracy relationships. Prompting strategy helps mitigate bias.

Conclusion: Visual distractors fundamentally differ from textual ones in their scaling effects. Understanding attribute tracking in reasoning traces provides insights for mitigating distractor effects and bias in VLMs.

Abstract: How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior studies on language models have reported an inverse scaling effect, where textual distractors lead to longer but less effective reasoning. To investigate whether similar phenomena occur in multimodal settings, we introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic, numerical, and spatial dimensions. Our analyses reveal that visual distractors differ fundamentally from textual ones: although inverse scaling persists, adding visual distractors reduces accuracy without increasing reasoning length. We further show that tracking attribute counts within reasoning traces provides key insights into how distractors, reasoning length, and accuracy interact. Finally, we demonstrate that these trends extend to established visual bias benchmarks such as Waterbirds, and we propose a simple prompting strategy to mitigate bias-driven predictions in reasoning models.

[134] FaithFusion: Harmonizing Reconstruction and Generation via Pixel-wise Information Gain

YuAn Wang, Xiaofan Li, Chi Huang, Wenhao Zhang, Hao Li, Bosheng Wang, Xun Sun, Jun Wang

Main category: cs.CV

TL;DR: FaithFusion is a 3DGS-diffusion fusion framework that uses Expected Information Gain (EIG) to maintain geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts in driving-scene reconstruction and 3D scene generation.

Details

Motivation: To address the challenges of fusing geometry-based 3DGS and appearance-driven diffusion models, which often lead to over-restoration and geometric drift due to the absence of pixel-wise, 3D-consistent editing criteria.

Method: Introduces Expected Information Gain (EIG) as a unified policy for coherent spatio-temporal synthesis. EIG guides diffusion as a spatial prior to refine high-uncertainty regions and uses pixel-level weighting to distill edits back into 3DGS, creating a plug-and-play system without extra prior conditions or structural modifications.

Result: Extensive experiments on the Waymo dataset demonstrate state-of-the-art performance across NTA-IoU, NTL-IoU, and FID metrics, maintaining an FID of 107.47 even at 6 meters lane shift.

Conclusion: FaithFusion effectively addresses the fusion challenges between 3DGS and diffusion models, achieving superior performance in controllable driving-scene reconstruction and 3D scene generation while maintaining geometric fidelity and visual plausibility.

Abstract: In controllable driving-scene reconstruction and 3D scene generation, maintaining geometric fidelity while synthesizing visually plausible appearance under large viewpoint shifts is crucial. However, effective fusion of geometry-based 3DGS and appearance-driven diffusion models faces inherent challenges, as the absence of pixel-wise, 3D-consistent editing criteria often leads to over-restoration and geometric drift. To address these issues, we introduce \textbf{FaithFusion}, a 3DGS-diffusion fusion framework driven by pixel-wise Expected Information Gain (EIG). EIG acts as a unified policy for coherent spatio-temporal synthesis: it guides diffusion as a spatial prior to refine high-uncertainty regions, while its pixel-level weighting distills the edits back into 3DGS. The resulting plug-and-play system is free from extra prior conditions and structural modifications.Extensive experiments on the Waymo dataset demonstrate that our approach attains SOTA performance across NTA-IoU, NTL-IoU, and FID, maintaining an FID of 107.47 even at 6 meters lane shift. Our code is available at https://github.com/wangyuanbiubiubiu/FaithFusion.

[135] Deformation-aware Temporal Generation for Early Prediction of Alzheimers Disease

Xin Honga, Jie Lin, Minghui Wang

Main category: cs.CV

TL;DR: DATGN is a novel deep learning method that generates future MRI images to predict Alzheimer’s disease progression by learning morphological changes and handling missing temporal data.

Details

Motivation: Early prediction of Alzheimer's disease can slow its progression, and current methods rely on manual feature extraction from brain images. There's a need for automated methods that can handle incomplete temporal MRI sequences.

Method: Deformation-Aware Temporal Generative Network (DATGN) first interpolates incomplete MRI sequences, then uses a bidirectional temporal deformation-aware module to generate future MRI images that follow disease progression patterns.

Result: DATGN achieved competitive PSNR and MMSE metrics on ADNI dataset. When synthetic data was used with classification methods, accuracy improved significantly: 6.21% to 16% for AD vs. NC, and 7.34% to 21.25% for AD vs. MCI vs. NC classification.

Conclusion: DATGN successfully generates realistic future MRI images that capture brain atrophy trends in Alzheimer’s disease, enabling early prediction and improving classification accuracy when used as synthetic training data.

Abstract: Alzheimer’s disease (AD), a degenerative brain condition, can benefit from early prediction to slow its progression. As the disease progresses, patients typically undergo brain atrophy. Current prediction methods for Alzheimers disease largely involve analyzing morphological changes in brain images through manual feature extraction. This paper proposes a novel method, the Deformation-Aware Temporal Generative Network (DATGN), to automate the learning of morphological changes in brain images about disease progression for early prediction. Given the common occurrence of missing data in the temporal sequences of MRI images, DATGN initially interpolates incomplete sequences. Subsequently, a bidirectional temporal deformation-aware module guides the network in generating future MRI images that adhere to the disease’s progression, facilitating early prediction of Alzheimer’s disease. DATGN was tested for the generation of temporal sequences of future MRI images using the ADNI dataset, and the experimental results are competitive in terms of PSNR and MMSE image quality metrics. Furthermore, when DATGN-generated synthetic data was integrated into the SVM vs. CNN vs. 3DCNN-based classification methods, significant improvements were achieved from 6. 21% to 16% in AD vs. NC classification accuracy and from 7. 34% to 21. 25% in AD vs. MCI vs. NC classification accuracy. The qualitative visualization results indicate that DATGN produces MRI images consistent with the brain atrophy trend in Alzheimer’s disease, enabling early disease prediction.

[136] Which Layer Causes Distribution Deviation? Entropy-Guided Adaptive Pruning for Diffusion and Flow Models

Changlin Li, Jiawei Zhang, Zeyi Shi, Zongxin Yang, Zhihui Li, Xiaojun Chang

Main category: cs.CV

TL;DR: EntPruner is an entropy-guided automatic progressive pruning framework for diffusion and flow models that reduces parameter redundancy while maintaining generation quality.

Details

Motivation: Large-scale vision generative models have significant parameter redundancy when transferred to downstream tasks, requiring efficient pruning methods that preserve output diversity and condition-fidelity.

Method: Uses entropy-guided pruning with Conditional Entropy Deviation (CED) metric to assess block importance, and a zero-shot adaptive pruning framework that automatically determines when and how much to prune during training.

Result: Achieves up to 2.22× inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets with DiT and SiT models.

Conclusion: EntPruner effectively reduces model redundancy in generative models while preserving performance, offering an efficient transfer learning solution for diffusion and flow models.

Abstract: Large-scale vision generative models, including diffusion and flow models, have demonstrated remarkable performance in visual generation tasks. However, transferring these pre-trained models to downstream tasks often results in significant parameter redundancy. In this paper, we propose EntPruner, an entropy-guided automatic progressive pruning framework for diffusion and flow models. First, we introduce entropy-guided pruning, a block-level importance assessment strategy specifically designed for generative models. Unlike discriminative models, generative models require preserving the diversity and condition-fidelity of the output distribution. As the importance of each module can vary significantly across downstream tasks, EntPruner prioritizes pruning of less important blocks using data-dependent Conditional Entropy Deviation (CED) as a guiding metric. CED quantifies how much the distribution diverges from the learned conditional data distribution after removing a block. Second, we propose a zero-shot adaptive pruning framework to automatically determine when and how much to prune during training. This dynamic strategy avoids the pitfalls of one-shot pruning, mitigating mode collapse, and preserving model performance. Extensive experiments on DiT and SiT models demonstrate the effectiveness of EntPruner, achieving up to 2.22$\times$ inference speedup while maintaining competitive generation quality on ImageNet and three downstream datasets.

[137] CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion

Dianbing Xi, Jiepeng Wang, Yuanzhi Liang, Xi Qiu, Jialun Liu, Hao Pan, Yuchi Huo, Rui Wang, Haibin Huang, Chi Zhang, Xuelong Li

Main category: cs.CV

TL;DR: CtrlVDiff is a unified diffusion model that handles both video understanding and controllable generation using multiple graphics-based modalities (depth, normals, segmentation, edges, intrinsics) with strong temporal coherence.

Details

Motivation: Geometry-only cues are insufficient for physically meaningful video edits as they under-constrain appearance, materials, and illumination, causing temporal drift. Additional graphics-based modalities provide complementary constraints for better understanding and precise control.

Method: Proposes CtrlVDiff with Hybrid Modality Control Strategy (HMCS) that routes and fuses features from multiple modalities. Uses MMVideo dataset with temporally aligned real-and-synthetic data across modalities and captions for training.

Result: Superior controllability and fidelity in video generation, enabling layer-wise edits (relighting, material adjustment, object insertion) while maintaining temporal coherence. Outperforms state-of-the-art baselines and remains robust to missing modalities.

Conclusion: Enriching video models with graphics-based modalities enables precise, predictable control for both understanding and generation tasks, overcoming limitations of geometry-only approaches while maintaining temporal consistency.

Abstract: We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework. Our key insights are two-fold: geometry-only cues (e.g., depth, edges) are insufficient: they specify layout but under-constrain appearance, materials, and illumination, limiting physically meaningful edits such as relighting or material swaps and often causing temporal drift. Enriching the model with additional graphics-based modalities (intrinsics and semantics) provides complementary constraints that both disambiguate understanding and enable precise, predictable control during generation. However, building a single model that uses many heterogeneous cues introduces two core difficulties. Architecturally, the model must accept any subset of modalities, remain robust to missing inputs, and inject control signals without sacrificing temporal consistency. Data-wise, training demands large-scale, temporally aligned supervision that ties real videos to per-pixel multimodal annotations. We then propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic), and re-renders videos from any chosen subset with strong temporal coherence. To enable this, we build MMVideo, a hybrid real-and-synthetic dataset aligned across modalities and captions. Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.

[138] G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang

Main category: cs.CV

TL;DR: G²VLM is a geometry-grounded vision-language model that bridges 3D spatial reconstruction and spatial understanding by leveraging learned 3D visual geometry features to enhance spatial reasoning tasks.

Details

Motivation: Current Vision-Language Models lack robustness in spatial intelligence due to the absence of visual geometry learning capable of reconstructing 3D space from 2D images.

Method: G²VLM natively leverages learned 3D visual geometry features to predict 3D attributes and enhance spatial reasoning via in-context learning and interleaved reasoning. It trains on abundant multi-view image and video data while benefiting from 3D visual priors.

Result: G²VLM achieves comparable results to state-of-the-art feed-forward 3D reconstruction models and achieves better or competitive results across spatial understanding and reasoning tasks.

Conclusion: By unifying a semantically strong VLM with low-level 3D vision tasks, G²VLM serves as a strong baseline for the community and unlocks future applications like 3D scene editing.

Abstract: Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM, a geometry grounded vision-language model that bridges two fundamental aspects of spatial intelligence: spatial 3D reconstruction and spatial understanding. G$^2$VLM natively leverages learned 3D visual geometry features to directly predict 3D attributes and enhance spatial reasoning tasks via in-context learning and interleaved reasoning. Our unified design is highly scalable for spatial understanding: it trains on abundant multi-view image and video data, while simultaneously leveraging the benefits of 3D visual priors that are typically only derived from hard-to-collect annotations. Experimental results demonstrate G$^2$VLM is proficient in both tasks, achieving comparable results to state-of-the-art feed-forward 3D reconstruction models and achieving better or competitive results across spatial understanding and reasoning tasks. By unifying a semantically strong VLM with low-level 3D vision tasks, we hope G$^2$VLM can serve as a strong baseline for the community and unlock more future applications, such as 3D scene editing.

[139] DeepRFTv2: Kernel-level Learning for Image Deblurring

Xintian Mao, Haofei Song, Yin-Nian Liu, Qingli Li, Yan Wang

Main category: cs.CV

TL;DR: Proposes Fourier Kernel Estimator (FKE) that learns blur kernels in Fourier space for improved image deblurring, achieving state-of-the-art results with physically meaningful kernel learning.

Details

Motivation: Current deep networks only perform pixel-level restoration and fail to understand the essence of blur at the kernel level, which is crucial for effective deblurring.

Method: Uses Fourier Kernel Estimator to convert spatial convolution to Fourier multiplication, applies kernel convolution to features instead of images, and employs decoupled multi-scale architecture with reversible sub-unets.

Result: Achieves state-of-the-art motion deblurring performance, learns physically meaningful kernels, and shows potential for other kernel-related problems.

Conclusion: The proposed FKE enables effective kernel-level blur process learning with low complexity and no additional supervision, significantly improving deblurring performance.

Abstract: It is well-known that if a network aims to learn how to deblur, it should understand the blur process. Blurring is naturally caused by the convolution of the sharp image with the blur kernel. Thus, allowing the network to learn the blur process in the kernel-level can significantly improve the image deblurring performance. But, current deep networks are still at the pixel-level learning stage, either performing end-to-end pixel-level restoration or stage-wise pseudo kernel-level restoration, failing to enable the deblur model to understand the essence of the blur. To this end, we propose Fourier Kernel Estimator (FKE), which considers the activation operation in Fourier space and converts the convolution problem in the spatial domain to a multiplication problem in Fourier space. Our FKE, jointly optimized with the deblur model, enables the network to learn the kernel-level blur process with low complexity and without any additional supervision. Furthermore, we change the convolution object of the kernel from image" to network extracted feature", whose rich semantic and structural information is more suitable to blur process learning. With the convolution of the feature and the estimated kernel, our model can learn the essence of blur in kernel-level. To further improve the efficiency of feature extraction, we design a decoupled multi-scale architecture with multiple hierarchical sub-unets with a reversible strategy, which allows better multi-scale encoding and decoding in low training memory. Extensive experiments indicate that our method achieves state-of-the-art motion deblurring results and show potential for handling other kernel-related problems. Analysis also shows our kernel estimator is able to learn physically meaningful kernels. The code will be available at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

[140] Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning

Changlin Li, Jiawei Zhang, Shuhao Liu, Sihao Lin, Zeyi Shi, Zhihui Li, Xiaojun Chang

Main category: cs.CV

TL;DR: Ent-Prog is an efficient training framework for human video generation using diffusion models that reduces training time and GPU memory consumption while maintaining performance through entropy-guided prioritized progressive learning.

Details

Motivation: Human video generation with diffusion models faces challenges of high computational cost and substantial memory consumption when training on high-resolution, multi-frame data.

Method: Uses Conditional Entropy Inflation (CEI) to assess component importance for prioritized training, and an adaptive progressive schedule that increases computational complexity based on convergence efficiency.

Result: Achieves up to 2.2× training speedup and 2.4× GPU memory reduction without compromising generative performance across three datasets.

Conclusion: Ent-Prog provides an effective framework for efficient training of diffusion models in human video generation, significantly reducing computational requirements while maintaining model quality.

Abstract: Human video generation has advanced rapidly with the development of diffusion models, but the high computational cost and substantial memory consumption associated with training these models on high-resolution, multi-frame data pose significant challenges. In this paper, we propose Entropy-Guided Prioritized Progressive Learning (Ent-Prog), an efficient training framework tailored for diffusion models on human video generation. First, we introduce Conditional Entropy Inflation (CEI) to assess the importance of different model components on the target conditional generation task, enabling prioritized training of the most critical components. Second, we introduce an adaptive progressive schedule that adaptively increases computational complexity during training by measuring the convergence efficiency. Ent-Prog reduces both training time and GPU memory consumption while maintaining model performance. Extensive experiments across three datasets, demonstrate the effectiveness of Ent-Prog, achieving up to 2.2$\times$ training speedup and 2.4$\times$ GPU memory reduction without compromising generative performance.

[141] Referring Video Object Segmentation with Cross-Modality Proxy Queries

Baoli Sun, Xinzhu Ma, Ning Wang, Zhihui Wang, Zhiyong Wang

Main category: cs.CV

TL;DR: ProxyFormer introduces proxy queries to integrate visual and text semantics for referring video object segmentation, addressing limitations in cross-modality alignment and inter-frame dependency modeling.

Details

Motivation: Existing RVOS methods have two key limitations: (1) conditional queries lack inter-frame dependency modeling, making target tracking difficult amid frame variations, and (2) textual constraints are integrated too late, causing video features to potentially focus on non-referred objects.

Method: ProxyFormer uses proxy queries to integrate visual and text semantics, progressively updating them across video feature encoder stages. It decouples cross-modality interactions into temporal and spatial dimensions for efficiency, and employs Joint Semantic Consistency training to align proxy queries with video-text pairs.

Result: Comprehensive experiments on four RVOS benchmarks demonstrate ProxyFormer’s superiority over state-of-the-art methods, showing improved accuracy and coherence in object tracking.

Conclusion: ProxyFormer effectively addresses cross-modality alignment challenges in RVOS through proxy queries that establish inter-frame dependencies and ensure video features focus on referred objects, achieving state-of-the-art performance.

Abstract: Referring video object segmentation (RVOS) is an emerging cross-modality task that aims to generate pixel-level maps of the target objects referred by given textual expressions. The main concept involves learning an accurate alignment of visual elements and language expressions within a semantic space. Recent approaches address cross-modality alignment through conditional queries, tracking the target object using a query-response based mechanism built upon transformer structure. However, they exhibit two limitations: (1) these conditional queries lack inter-frame dependency and variation modeling, making accurate target tracking challenging amid significant frame-to-frame variations; and (2) they integrate textual constraints belatedly, which may cause the video features potentially focus on the non-referred objects. Therefore, we propose a novel RVOS architecture called ProxyFormer, which introduces a set of proxy queries to integrate visual and text semantics and facilitate the flow of semantics between them. By progressively updating and propagating proxy queries across multiple stages of video feature encoder, ProxyFormer ensures that the video features are focused on the object of interest. This dynamic evolution also enables the establishment of inter-frame dependencies, enhancing the accuracy and coherence of object tracking. To mitigate high computational costs, we decouple cross-modality interactions into temporal and spatial dimensions. Additionally, we design a Joint Semantic Consistency (JSC) training strategy to align semantic consensus between the proxy queries and the combined video-text pairs. Comprehensive experiments on four widely used RVOS benchmarks demonstrate the superiority of our ProxyFormer to the state-of-the-art methods.

[142] TEAR: Temporal-aware Automated Red-teaming for Text-to-Video Models

Jiaming He, Guanyu Hou, Hongwei Li, Zhicong Huang, Kangjie Chen, Yi Yu, Wenbo Jiang, Guowen Xu, Tianwei Zhang

Main category: cs.CV

TL;DR: TEAR is an automated red-teaming framework that uncovers safety risks in Text-to-Video models by exploiting temporal dynamics through specially crafted prompts that appear innocuous but elicit policy-violating videos.

Details

Motivation: Existing safety evaluation methods for static images and text are insufficient for capturing the complex temporal dynamics in video generation, creating critical safety challenges in T2V models.

Method: TEAR uses a temporal-aware test generator with two-stage optimization (initial training and temporal-aware online preference learning) to create textually innocuous prompts that exploit temporal sequencing, with cyclical refinement to improve stealthiness and effectiveness.

Result: TEAR achieves over 80% attack success rate across open-source and commercial T2V systems, significantly outperforming prior best results of 57%.

Conclusion: The framework effectively uncovers safety vulnerabilities in T2V models by focusing on temporal dynamics, demonstrating the need for specialized safety evaluation methods for video generation systems.

Abstract: Text-to-Video (T2V) models are capable of synthesizing high-quality, temporally coherent dynamic video content, but the diverse generation also inherently introduces critical safety challenges. Existing safety evaluation methods,which focus on static image and text generation, are insufficient to capture the complex temporal dynamics in video generation. To address this, we propose a TEmporal-aware Automated Red-teaming framework, named TEAR, an automated framework designed to uncover safety risks specifically linked to the dynamic temporal sequencing of T2V models. TEAR employs a temporal-aware test generator optimized via a two-stage approach: initial generator training and temporal-aware online preference learning, to craft textually innocuous prompts that exploit temporal dynamics to elicit policy-violating video output. And a refine model is adopted to improve the prompt stealthiness and adversarial effectiveness cyclically. Extensive experimental evaluation demonstrates the effectiveness of TEAR across open-source and commercial T2V systems with over 80% attack success rate, a significant boost from prior best result of 57%.

[143] LLaVA-UHD v3: Progressive Visual Compression for Efficient Native-Resolution Encoding in MLLMs

Shichu Sun, Yichen Zhang, Haolin Song, Zonghao Guo, Chi Chen, Yidan Zhang, Yuan Yao, Zhiyuan Liu, Maosong Sun

Main category: cs.CV

TL;DR: LLaVA-UHD v3 introduces Progressive Visual Compression (PVC) to enable efficient native-resolution visual encoding in MLLMs, reducing computational overhead while maintaining competitive performance.

Details

Motivation: Global native-resolution visual encoding in MLLMs enhances capability but incurs significant computational overhead, creating a need for more efficient methods.

Method: Progressive Visual Compression (PVC) with two modules: refined patch embedding for flexible patch-size scaling, and windowed token compression hierarchically deployed across ViT layers to progressively aggregate local token representations.

Result: ViT-UHD achieves competitive performance with MoonViT while reducing TTFT by 2.4x. LLaVA-UHD v3 achieves competitive performance to Qwen2-VL while reducing TTFT by 1.9x.

Conclusion: PVC enables efficient native-resolution encoding in ViTs while preserving generality, offering a promising approach for developing efficient MLLMs with reduced computational costs.

Abstract: Visual encoding followed by token condensing has become the standard architectural paradigm in multi-modal large language models (MLLMs). Many recent MLLMs increasingly favor global native- resolution visual encoding over slice-based methods. To investigate this trend, we systematically compare their behavior on vision-language understanding and attention patterns, revealing that global encoding enhances overall capability but at the expense of greater computational overhead. To address this issue, we present LLaVA-UHD v3, an MLLM centered upon our proposed Progressive Visual Compression (PVC) method, which can be seamlessly integrated into standard Vision Transformer (ViT) to enable efficient native-resolution encoding. The PVC approach consists of two key modules: (i) refined patch embedding, which supports flexible patch-size scaling for fine-grained visual model- ing, (ii) windowed token compression, hierarchically deployed across ViT layers to progressively aggregate local token representations. Jointly modulated by these two modules, a widely pretrained ViT can be reconfigured into an efficient architecture while largely preserving generality. Evaluated across extensive benchmarks, the transformed ViT, termed ViT-UHD, demonstrates competitive performance with MoonViT while reducing TTFT (time-to-first-token) by 2.4x, when developed within an identical MLLM architecture. Building upon ViT-UHD, LLaVA-UHD v3 also achieves competitive performance to Qwen2-VL, while further reducing TTFT by 1.9x. We will release all code and checkpoints to support future research on efficient MLLMs.

[144] Progress by Pieces: Test-Time Scaling for Autoregressive Image Generation

Joonhyung Park, Hyeongwon Jang, Joowon Kim, Eunho Yang

Main category: cs.CV

TL;DR: GridAR is a test-time scaling framework for visual autoregressive models that uses grid-partitioned progressive generation with early pruning and layout-specified prompt reformulation to improve image generation quality while reducing computational costs.

Details

Motivation: Test-time computation scaling has been successful in natural language tasks but remains unexplored for visual AR models. Naive approaches like Best-of-N are suboptimal due to wasted computation on erroneous trajectories and lack of canvas blueprint in raster-scan decoding.

Method: GridAR employs grid-partitioned progressive generation where multiple partial candidates are generated per position, infeasible ones are pruned early, and viable ones become anchors for subsequent decoding. It also uses layout-specified prompt reformulation to infer feasible layouts from partial views.

Result: With N=4, GridAR outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also shows comparable edit quality and 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

Conclusion: GridAR effectively addresses the challenges of test-time scaling for visual AR models, achieving higher quality results with lower computational costs and generalizing well to image editing tasks.

Abstract: Recent visual autoregressive (AR) models have shown promising capabilities in text-to-image generation, operating in a manner similar to large language models. While test-time computation scaling has brought remarkable success in enabling reasoning-enhanced outputs for challenging natural language tasks, its adaptation to visual AR models remains unexplored and poses unique challenges. Naively applying test-time scaling strategies such as Best-of-N can be suboptimal: they consume full-length computation on erroneous generation trajectories, while the raster-scan decoding scheme lacks a blueprint of the entire canvas, limiting scaling benefits as only a few prompt-aligned candidates are generated. To address these, we introduce GridAR, a test-time scaling framework designed to elicit the best possible results from visual AR models. GridAR employs a grid-partitioned progressive generation scheme in which multiple partial candidates for the same position are generated within a canvas, infeasible ones are pruned early, and viable ones are fixed as anchors to guide subsequent decoding. Coupled with this, we present a layout-specified prompt reformulation strategy that inspects partial views to infer a feasible layout for satisfying the prompt. The reformulated prompt then guides subsequent image generation to mitigate the blueprint deficiency. Together, GridAR achieves higher-quality results under limited test-time scaling: with N=4, it even outperforms Best-of-N (N=8) by 14.4% on T2I-CompBench++ while reducing cost by 25.6%. It also generalizes to autoregressive image editing, showing comparable edit quality and a 13.9% gain in semantic preservation on PIE-Bench over larger-N baselines.

[145] Scenes as Tokens: Multi-Scale Normal Distributions Transform Tokenizer for General 3D Vision-Language Understanding

Yutao Tang, Cheng Zhao, Gaurav Mittal, Rohith Kukkala, Rama Chellappa, Cheng Peng, Mei Chen

Main category: cs.CV

TL;DR: NDTokenizer3D is a generalist 3D vision-language model that uses multi-scale NDT representation and a novel tokenization pipeline to unify various 3D scene understanding tasks including referring segmentation, VQA, and dense captioning.

Details

Motivation: Current 3D VLMs struggle with effectively tokenizing 3D scenes into holistic tokens and leveraging them across diverse understanding tasks, while also lacking natural support for human interactions.

Method: Three-stage scene tokenization pipeline using Multi-Scale Normal Distributions Transform (NDT) representation and Multi-Scale NDT Decoder (MSDec) that fuses cross-scale features to produce holistic scene tokens. MSDec also serves as interface for human-interactive prompting and segmentation decoding.

Result: Achieves remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning tasks.

Conclusion: NDTokenizer3D provides a compact, unified architecture that bridges language reasoning with 3D spatial understanding, offering a fine-grained, general-purpose 3D VLM solution.

Abstract: Recent advances in 3D vision-language models (VLMs) highlight a strong potential for 3D scene understanding and reasoning. However, effectively tokenizing 3D scenes into holistic scene tokens, and leveraging these tokens across diverse 3D understanding tasks, remain highly challenging. We present NDTokenizer3D, a generalist 3D VLM that performs a wide range of 3D scene understanding tasks while naturally supporting human interactions, thereby bridging language-level reasoning with 3D spatial understanding. The core of our approach is a novel three-stage scene tokenization pipeline built upon a Multi-Scale Normal Distributions Transform (NDT) representation, paired with a Multi-Scale NDT Decoder (MSDec). Specifically, NDTokenizer3D first constructs a multi-scale NDT representation from raw high-resolution point clouds, preserving both global context and fine-grained geometric details. Next, the MSDec progressively fuses cross-scale NDT features, producing holistic scene tokens consumable by LLM endpoints. Beyond tokenization, MSDec is repurposed as a general interface for human-interactive prompting (points, boxes, masks) and segmentation-mask decoding, unifying diverse 3D scene understanding tasks within a single architecture. With this compact and unified design, NDTokenizer3D offers a fine-grained, general-purpose 3D VLM, achieving remarkable improvements in 3D Referring Segmentation, 3D Visual Question Answering, and 3D Dense Captioning.

[146] When Robots Obey the Patch: Universal Transferable Patch Attacks on Vision-Language-Action Models

Hui Lu, Yi Yu, Yiming Yang, Chenyu Yi, Qixin Zhang, Bingquan Shen, Alex C. Kot, Xudong Jiang

Main category: cs.CV

TL;DR: UPA-RFAS is a universal adversarial patch framework that attacks Vision-Language-Action models across different architectures, finetuned variants, and sim-to-real scenarios through robust feature, attention, and semantic manipulation.

Details

Motivation: Current adversarial patches for VLA models overfit to specific models and fail in black-box settings, highlighting the need for universal and transferable attacks that work across unknown architectures and real-world conditions.

Method: UPA-RFAS combines feature-space objectives with L1 deviation prior and repulsive InfoNCE loss, uses a two-phase min-max procedure with invisible perturbations, and employs VLA-specific losses for attention hijacking and semantic misalignment.

Result: The method consistently transfers across diverse VLA models, manipulation tasks, and physical executions, demonstrating effectiveness in cross-model, cross-task, and cross-viewpoint scenarios.

Conclusion: UPA-RFAS exposes a practical patch-based attack surface for VLA-driven robots and establishes a strong baseline for future defense research against universal adversarial patches.

Abstract: Vision-Language-Action (VLA) models are vulnerable to adversarial attacks, yet universal and transferable attacks remain underexplored, as most existing patches overfit to a single model and fail in black-box settings. To address this gap, we present a systematic study of universal, transferable adversarial patches against VLA-driven robots under unknown architectures, finetuned variants, and sim-to-real shifts. We introduce UPA-RFAS (Universal Patch Attack via Robust Feature, Attention, and Semantics), a unified framework that learns a single physical patch in a shared feature space while promoting cross-model transfer. UPA-RFAS combines (i) a feature-space objective with an $\ell_1$ deviation prior and repulsive InfoNCE loss to induce transferable representation shifts, (ii) a robustness-augmented two-phase min-max procedure where an inner loop learns invisible sample-wise perturbations and an outer loop optimizes the universal patch against this hardened neighborhood, and (iii) two VLA-specific losses: Patch Attention Dominance to hijack text$\to$vision attention and Patch Semantic Misalignment to induce image-text mismatch without labels. Experiments across diverse VLA models, manipulation suites, and physical executions show that UPA-RFAS consistently transfers across models, tasks, and viewpoints, exposing a practical patch-based attack surface and establishing a strong baseline for future defenses.

[147] You Can Trust Your Clustering Model: A Parameter-free Self-Boosting Plug-in for Deep Clustering

Hanyang Li, Yuheng Jia, Hui Liu, Junhui Hou

Main category: cs.CV

TL;DR: DCBoost is a parameter-free plug-in that enhances deep clustering models by using reliable local structural cues to improve global feature structures, significantly boosting clustering performance.

Details

Motivation: Existing deep clustering methods suffer from disparity between local and global feature structures - local features show strong consistency but global features have intertwined boundaries and poor cluster separation.

Method: Uses adaptive k-nearest neighbors-based consistency filtering to identify high-confidence samples as trustworthy anchors, then computes a discriminative loss that promotes intra-class compactness and inter-class separability to guide network optimization.

Result: Significantly improves clustering performance across various benchmarks, boosting state-of-the-art baselines by over 3% and amplifying silhouette coefficient by over 7x.

Conclusion: DCBoost effectively enhances global feature structures in deep clustering models using reliable local cues, demonstrating substantial performance improvements without requiring additional parameters.

Abstract: Recent deep clustering models have produced impressive clustering performance. However, a common issue with existing methods is the disparity between global and local feature structures. While local structures typically show strong consistency and compactness within class samples, global features often present intertwined boundaries and poorly separated clusters. Motivated by this observation, we propose DCBoost, a parameter-free plug-in designed to enhance the global feature structures of current deep clustering models. By harnessing reliable local structural cues, our method aims to elevate clustering performance effectively. Specifically, we first identify high-confidence samples through adaptive $k$-nearest neighbors-based consistency filtering, aiming to select a sufficient number of samples with high label reliability to serve as trustworthy anchors for self-supervision. Subsequently, these samples are utilized to compute a discriminative loss, which promotes both intra-class compactness and inter-class separability, to guide network optimization. Extensive experiments across various benchmark datasets showcase that our DCBoost significantly improves the clustering performance of diverse existing deep clustering models. Notably, our method improves the performance of current state-of-the-art baselines (e.g., ProPos) by more than 3% and amplifies the silhouette coefficient by over $7\times$. Code is available at https://github.com/l-h-y168/DCBoost.

[148] BotaCLIP: Contrastive Learning for Botany-Aware Representation of Earth Observation Data

Selene Cerna, Sara Si-Moussi, Wilfried Thuiller, Hadrien Hendrikx, Vincent Miele

Main category: cs.CV

TL;DR: BotaCLIP adapts pre-trained foundation models for ecological applications by aligning aerial imagery with botanical data through contrastive learning, improving performance on biodiversity tasks without expensive retraining.

Details

Motivation: To adapt foundation models for domain-specific ecological knowledge without computational costs of full retraining, enabling effective biodiversity modeling in data-scarce settings.

Method: Lightweight multimodal contrastive framework that aligns high-resolution aerial imagery with botanical relevés, using regularization to prevent catastrophic forgetting of pre-trained knowledge.

Result: Consistent improvements over DOFA foundation model and supervised baselines in plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation.

Conclusion: Domain-aware adaptation of foundation models can effectively inject expert knowledge into data-scarce ecological applications, enabling frugal representation learning for biodiversity modeling.

Abstract: Foundation models have demonstrated a remarkable ability to learn rich, transferable representations across diverse modalities such as images, text, and audio. In modern machine learning pipelines, these representations often replace raw data as the primary input for downstream tasks. In this paper, we address the challenge of adapting a pre-trained foundation model to inject domain-specific knowledge, without retraining from scratch or incurring significant computational costs. To this end, we introduce BotaCLIP, a lightweight multimodal contrastive framework that adapts a pre-trained Earth Observation foundation model (DOFA) by aligning high-resolution aerial imagery with botanical relevés. Unlike generic embeddings, BotaCLIP internalizes ecological structure through contrastive learning with a regularization strategy that mitigates catastrophic forgetting. Once trained, the resulting embeddings serve as transferable representations for downstream predictors. Motivated by real-world applications in biodiversity modeling, we evaluated BotaCLIP representations in three ecological tasks: plant presence prediction, butterfly occurrence modeling, and soil trophic group abundance estimation. The results showed consistent improvements over those derived from DOFA and supervised baselines. More broadly, this work illustrates how domain-aware adaptation of foundation models can inject expert knowledge into data-scarce settings, enabling frugal representation learning.

[149] Towards an Effective Action-Region Tracking Framework for Fine-grained Video Action Recognition

Baoli Sun, Yihan Wang, Xinzhu Ma, Zhihui Wang, Kun Lu, Zhiyong Wang

Main category: cs.CV

TL;DR: The paper proposes ART framework for fine-grained action recognition using query-response mechanism to track distinctive local details across video frames, achieving state-of-the-art performance.

Details

Motivation: Current action recognition methods capture coarse-grained motion patterns but struggle to identify subtle details in local regions that evolve over time, which is crucial for distinguishing similar fine-grained actions.

Method: Uses Action-Region Tracking (ART) framework with region-specific semantic activation module, text-constrained queries from VLMs, action tracklets linking responses across frames, multi-level tracklet contrastive constraints, and task-specific fine-tuning.

Result: Comprehensive experiments on widely used action recognition benchmarks demonstrate superiority to previous state-of-the-art baselines.

Conclusion: The ART framework effectively captures and tracks distinctive local details for fine-grained action recognition through query-response mechanism and optimized tracklets.

Abstract: Fine-grained action recognition (FGAR) aims to identify subtle and distinctive differences among fine-grained action categories. However, current recognition methods often capture coarse-grained motion patterns but struggle to identify subtle details in local regions evolving over time. In this work, we introduce the Action-Region Tracking (ART) framework, a novel solution leveraging a query-response mechanism to discover and track the dynamics of distinctive local details, enabling effective distinction of similar actions. Specifically, we propose a region-specific semantic activation module that employs discriminative and text-constrained semantics as queries to capture the most action-related region responses in each video frame, facilitating interaction among spatial and temporal dimensions with corresponding video features. The captured region responses are organized into action tracklets, which characterize region-based action dynamics by linking related responses across video frames in a coherent sequence. The text-constrained queries encode nuanced semantic representations derived from textual descriptions of action labels extracted by language branches within Visual Language Models (VLMs). To optimize the action tracklets, we design a multi-level tracklet contrastive constraint among region responses at spatial and temporal levels, enabling effective discrimination within each frame and correlation between adjacent frames. Additionally, a task-specific fine-tuning mechanism refines textual semantics such that semantic representations encoded by VLMs are preserved while optimized for task preferences. Comprehensive experiments on widely used action recognition benchmarks demonstrate the superiority to previous state-of-the-art baselines.

[150] From Diffusion to One-Step Generation: A Comparative Study of Flow-Based Models with Application to Image Inpainting

Umang Agarwal, Rudraksh Sangore, Sumit Laddha

Main category: cs.CV

TL;DR: Comparative study of DDPM, CFM, and MeanFlow generative models showing CFM outperforms DDPM (FID 24.15 vs 402.98), while MeanFlow enables 50X faster single-step generation with FID 29.15. CFM extended to inpainting achieves significant quality improvements.

Details

Motivation: To compare different generative modeling paradigms and their sampling efficiency, particularly focusing on reducing inference time while maintaining generation quality.

Method: Implemented DDPM, CFM, and MeanFlow using unified TinyUNet architecture on CIFAR-10. Extended CFM to image inpainting with four mask types and mask-guided sampling.

Result: CFM achieved FID 24.15 (50 steps) vs DDPM’s 402.98. MeanFlow achieved FID 29.15 with single-step sampling. Inpainting improved PSNR from 4.95 to 8.57 dB (+73%) and SSIM from 0.289 to 0.418 (+45%).

Conclusion: CFM significantly outperforms DDPM in generation quality, while MeanFlow enables efficient single-step generation. Inpainting-aware training substantially improves reconstruction quality across various mask types.

Abstract: We present a comprehensive comparative study of three generative modeling paradigms: Denoising Diffusion Probabilistic Models (DDPM), Conditional Flow Matching (CFM), and MeanFlow. While DDPM and CFM require iterative sampling, MeanFlow enables direct one-step generation by modeling the average velocity over time intervals. We implement all three methods using a unified TinyUNet architecture (<1.5M parameters) on CIFAR-10, demonstrating that CFM achieves an FID of 24.15 with 50 steps, significantly outperforming DDPM (FID 402.98). MeanFlow achieves FID 29.15 with single-step sampling – a 50X reduction in inference time. We further extend CFM to image inpainting, implementing mask-guided sampling with four mask types (center, random bbox, irregular, half). Our fine-tuned inpainting model achieves substantial improvements: PSNR increases from 4.95 to 8.57 dB on center masks (+73%), and SSIM improves from 0.289 to 0.418 (+45%), demonstrating the effectiveness of inpainting-aware training.

[151] The More, the Merrier: Contrastive Fusion for Higher-Order Multimodal Alignment

Stefanos Koutoupis, Michaela Areti Zervou, Konstantinos Kontras, Maarten De Vos, Panagiotis Tsakalides, Grigorios Tsagatakis

Main category: cs.CV

TL;DR: ConFu is a multimodal learning framework that jointly embeds individual modalities and their fused combinations into a unified space, capturing both pairwise relationships and higher-order dependencies through contrastive learning.

Details

Motivation: Existing multimodal methods either focus on pairwise alignment or insufficiently preserve pairwise relationships when capturing higher-order interactions, limiting their effectiveness on single-modality tasks.

Method: Extends pairwise contrastive objectives with an additional fused-modality contrastive term that aligns modality pairs with a third modality, enabling capture of higher-order dependencies while maintaining pairwise correspondence.

Result: Competitive performance on retrieval and classification tasks across synthetic and real-world benchmarks, supporting unified one-to-one and two-to-one retrieval within a single framework.

Conclusion: ConFu successfully captures higher-order dependencies that cannot be recovered through pairwise alignment alone while maintaining strong pairwise correspondence, demonstrating scalability with increasing multimodal complexity.

Abstract: Learning joint representations across multiple modalities remains a central challenge in multimodal machine learning. Prevailing approaches predominantly operate in pairwise settings, aligning two modalities at a time. While some recent methods aim to capture higher-order interactions among multiple modalities, they often overlook or insufficiently preserve pairwise relationships, limiting their effectiveness on single-modality tasks. In this work, we introduce Contrastive Fusion (ConFu), a framework that jointly embeds both individual modalities and their fused combinations into a unified representation space, where modalities and their fused counterparts are aligned. ConFu extends traditional pairwise contrastive objectives with an additional fused-modality contrastive term, encouraging the joint embedding of modality pairs with a third modality. This formulation enables ConFu to capture higher-order dependencies, such as XOR-like relationships, that cannot be recovered through pairwise alignment alone, while still maintaining strong pairwise correspondence. We evaluate ConFu on synthetic and real-world multimodal benchmarks, assessing its ability to exploit cross-modal complementarity, capture higher-order dependencies, and scale with increasing multimodal complexity. Across these settings, ConFu demonstrates competitive performance on retrieval and classification tasks, while supporting unified one-to-one and two-to-one retrieval within a single contrastive framework.

[152] 3-Tracer: A Tri-level Temporal-Aware Framework for Audio Forgery Detection and Localization

Shuhan Xia, Xuannan Liu, Xing Cui, Peipei Li

Main category: cs.CV

TL;DR: T3-Tracer is a novel framework for detecting partial audio forgeries by jointly analyzing audio at frame, segment, and audio levels using two complementary modules to capture transient and sustained anomalies.

Details

Motivation: Partial audio forgeries selectively modify critical frames while maintaining overall authenticity, making them difficult to detect with existing methods that only analyze single frames independently without hierarchical temporal structure.

Method: T3-Tracer uses two core modules: FA-FAM for frame-level authenticity detection combining frame and audio temporal information, and SMDAM for segment-level forgery boundary detection using dual-branch multi-scale temporal analysis of frame features and inter-frame differences.

Result: Extensive experiments on three challenging datasets demonstrate state-of-the-art performance in detecting partial audio forgeries.

Conclusion: The hierarchical multi-level approach effectively captures both transient and sustained forgery anomalies across different temporal scales, providing comprehensive detection of partial audio manipulation.

Abstract: Recently, partial audio forgery has emerged as a new form of audio manipulation. Attackers selectively modify partial but semantically critical frames while preserving the overall perceptual authenticity, making such forgeries particularly difficult to detect. Existing methods focus on independently detecting whether a single frame is forged, lacking the hierarchical structure to capture both transient and sustained anomalies across different temporal levels. To address these limitations, We identify three key levels relevant to partial audio forgery detection and present T3-Tracer, the first framework that jointly analyzes audio at the frame, segment, and audio levels to comprehensively detect forgery traces. T3-Tracer consists of two complementary core modules: the Frame-Audio Feature Aggregation Module (FA-FAM) and the Segment-level Multi-Scale Discrepancy-Aware Module (SMDAM). FA-FAM is designed to detect the authenticity of each audio frame. It combines both frame-level and audio-level temporal information to detect intra-frame forgery cues and global semantic inconsistencies. To further refine and correct frame detection, we introduce SMDAM to detect forgery boundaries at the segment level. It adopts a dual-branch architecture that jointly models frame features and inter-frame differences across multi-scale temporal windows, effectively identifying abrupt anomalies that appeared on the forged boundaries. Extensive experiments conducted on three challenging datasets demonstrate that our approach achieves state-of-the-art performance.

[153] Hybrid SIFT-SNN for Efficient Anomaly Detection of Traffic Flow-Control Infrastructure

Munish Rathee, Boris Bačić, Maryam Doborjeh

Main category: cs.CV

TL;DR: SIFT-SNN framework combines SIFT features with SNN for real-time structural anomaly detection, achieving 92.3% accuracy with 9.5ms latency and low power consumption.

Details

Motivation: Need for low-latency, low-power real-time detection of structural anomalies in transport infrastructure that preserves spatial feature grounding and operates efficiently on embedded hardware.

Method: Hybrid pipeline integrating SIFT for spatial feature encoding, latency-driven spike conversion layer, and LIF Spiking Neural Network for classification, tested on Auckland Harbour Bridge dataset with 6,000 labeled frames.

Result: 92.3% classification accuracy with 9.5ms per-frame inference time and 8.1% sparse spike activity, enabling real-time edge deployment.

Conclusion: SIFT-SNN framework provides efficient, interpretable structural safety monitoring with validated prototype, though generalization to unseen conditions needs further validation.

Abstract: This paper presents the SIFT-SNN framework, a low-latency neuromorphic signal-processing pipeline for real-time detection of structural anomalies in transport infrastructure. The proposed approach integrates Scale-Invariant Feature Transform (SIFT) for spatial feature encoding with a latency-driven spike conversion layer and a Leaky Integrate-and-Fire (LIF) Spiking Neural Network (SNN) for classification. The Auckland Harbour Bridge dataset is recorded under various weather and lighting conditions, comprising 6,000 labelled frames that include both real and synthetically augmented unsafe cases. The presented system achieves a classification accuracy of 92.3% (+- 0.8%) with a per-frame inference time of 9.5 ms. Achieved sub-10 millisecond latency, combined with sparse spike activity (8.1%), enables real-time, low-power edge deployment. Unlike conventional CNN-based approaches, the hybrid SIFT-SNN pipeline explicitly preserves spatial feature grounding, enhances interpretability, supports transparent decision-making, and operates efficiently on embedded hardware. Although synthetic augmentation improved robustness, generalisation to unseen field conditions remains to be validated. The SIFT-SNN framework is validated through a working prototype deployed on a consumer-grade system and framed as a generalisable case study in structural safety monitoring for movable concrete barriers, which, as a traffic flow-control infrastructure, is deployed in over 20 cities worldwide.

[154] FIELDS: Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision

Chen Ling, Henglin Shi, Hedvig Kjellström

Main category: cs.CV

TL;DR: FIELDS improves 3D face reconstruction by using direct 3D expression supervision from 4D scans and emotion recognition to capture subtle emotional details that existing methods miss.

Details

Motivation: Existing 3D face reconstruction methods fail to capture subtle emotional details due to reliance on 2D supervision and lack of 3D ground truth, limiting their ability to convey authentic emotional information.

Method: Extends self-supervised 2D image consistency with direct 3D expression parameter supervision from spontaneous 4D facial scans and an auxiliary emotion recognition branch with intensity-aware emotion loss.

Result: Produces emotion-rich 3D face models with highly realistic expressions from single images, significantly improving facial expression recognition performance while maintaining naturalness.

Conclusion: The dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, enabling high-fidelity 3D reconstructions that preserve subtle emotional cues.

Abstract: Facial expressions convey the bulk of emotional information in human communication, yet existing 3D face reconstruction methods often miss subtle affective details due to reliance on 2D supervision and lack of 3D ground truth. We propose FIELDS (Face reconstruction with accurate Inference of Expression using Learning with Direct Supervision) to address these limitations by extending self-supervised 2D image consistency cues with direct 3D expression parameter supervision and an auxiliary emotion recognition branch. Our encoder is guided by authentic expression parameters from spontaneous 4D facial scans, while an intensity-aware emotion loss encourages the 3D expression parameters to capture genuine emotion content without exaggeration. This dual-supervision strategy bridges the 2D/3D domain gap and mitigates expression-intensity bias, yielding high-fidelity 3D reconstructions that preserve subtle emotional cues. From a single image, FIELDS produces emotion-rich face models with highly realistic expressions, significantly improving in-the-wild facial expression recognition performance without sacrificing naturalness.

[155] SurgMLLMBench: A Multimodal Large Language Model Benchmark Dataset for Surgical Scene Understanding

Tae-Min Choi, Tae Kyeong Jeong, Garam Kim, Jaemin Lee, Yeongyoon Koh, In Cheul Choi, Jae-Ho Chung, Jong Woong Park, Juyoun Park

Main category: cs.CV

TL;DR: SurgMLLMBench is a unified multimodal benchmark for surgical scene understanding that integrates pixel-level segmentation and VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy.

Details

Motivation: Existing surgical datasets use heterogeneous taxonomies and lack pixel-level segmentation support, limiting consistent evaluation and applicability of multimodal LLMs in surgical applications.

Method: Created SurgMLLMBench with the MAVIS dataset, integrating pixel-level instrument segmentation masks and structured VQA annotations across multiple surgical domains under a unified taxonomy.

Result: A single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets, enabling comprehensive evaluation beyond traditional VQA tasks.

Conclusion: SurgMLLMBench serves as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

Abstract: Recent advances in multimodal large language models (LLMs) have highlighted their potential for medical and surgical applications. However, existing surgical datasets predominantly adopt a Visual Question Answering (VQA) format with heterogeneous taxonomies and lack support for pixel-level segmentation, limiting consistent evaluation and applicability. We present SurgMLLMBench, a unified multimodal benchmark explicitly designed for developing and evaluating interactive multimodal LLMs for surgical scene understanding, including the newly collected Micro-surgical Artificial Vascular anastomosIS (MAVIS) dataset. It integrates pixel-level instrument segmentation masks and structured VQA annotations across laparoscopic, robot-assisted, and micro-surgical domains under a unified taxonomy, enabling comprehensive evaluation beyond traditional VQA tasks and richer visual-conversational interactions. Extensive baseline experiments show that a single model trained on SurgMLLMBench achieves consistent performance across domains and generalizes effectively to unseen datasets. SurgMLLMBench will be publicly released as a robust resource to advance multimodal surgical AI research, supporting reproducible evaluation and development of interactive surgical reasoning models.

[156] Shift-Equivariant Complex-Valued Convolutional Neural Networks

Quentin Gabot, Teck-Yian Lim, Jérémy Fix, Joana Frontera-Pons, Chengfang Ren, Jean-Philippe Ovarlez

Main category: cs.CV

TL;DR: Extends Learnable Polyphase Sampling (LPS) to complex-valued neural networks to achieve shift equivariance and invariance, with a new projection layer from complex to real space before Gumbel Softmax.

Details

Motivation: Traditional CNNs lack shift equivariance and invariance due to downsampling/upsampling operations. While data augmentation helps empirically, a systematic theoretical approach is needed to guarantee these properties.

Method: Extends LPS to complex-valued networks with a novel projection layer from C to R before Gumbel Softmax, maintaining theoretical guarantees for shift equivariance and invariance.

Result: Evaluated on computer vision tasks using polarimetric SAR images, achieving shift invariance in classification and shift equivariance in reconstruction and semantic segmentation.

Conclusion: Successfully extends LPS framework to complex-valued neural networks, providing theoretical guarantees for shift equivariance and invariance while maintaining performance on real-world vision tasks.

Abstract: Convolutional neural networks have shown remarkable performance in recent years on various computer vision problems. However, the traditional convolutional neural network architecture lacks a critical property: shift equivariance and invariance, broken by downsampling and upsampling operations. Although data augmentation techniques can help the model learn the latter property empirically, a consistent and systematic way to achieve this goal is by designing downsampling and upsampling layers that theoretically guarantee these properties by construction. Adaptive Polyphase Sampling (APS) introduced the cornerstone for shift invariance, later extended to shift equivariance with Learnable Polyphase up/downsampling (LPS) applied to real-valued neural networks. In this paper, we extend the work on LPS to complex-valued neural networks both from a theoretical perspective and with a novel building block of a projection layer from $\mathbb{C}$ to $\mathbb{R}$ before the Gumbel Softmax. We finally evaluate this extension on several computer vision problems, specifically for either the invariance property in classification tasks or the equivariance property in both reconstruction and semantic segmentation problems, using polarimetric Synthetic Aperture Radar images.

[157] AVFakeBench: A Comprehensive Audio-Video Forgery Detection Benchmark for AV-LMMs

Shuhan Xia, Peipei Li, Xuannan Liu, Dongsen Zhang, Xinyu Guo, Zekun Li

Main category: cs.CV

TL;DR: AVFakeBench is the first comprehensive audio-video forgery detection benchmark covering diverse forgery types beyond human-centric deepfakes, with 12K questions across 7 forgery types and 4 annotation levels.

Details

Motivation: Existing benchmarks are limited to DeepFake-based forgeries and single-granularity annotations, failing to capture the diversity and complexity of real-world audio-video forgery scenarios.

Method: Proposed a multi-stage hybrid forgery framework integrating proprietary models for task planning with expert generative models for precise manipulation. Created AVFakeBench with 12K audio-video questions covering 7 forgery types and 4 annotation levels.

Result: Evaluated 11 Audio-Video Large Language Models and 2 detection methods, showing AV-LMMs’ potential as emerging forgery detectors but revealing weaknesses in fine-grained perception and reasoning.

Conclusion: AVFakeBench addresses the gap in comprehensive audio-video forgery detection benchmarks and demonstrates both the promise and limitations of current AV-LMMs for forgery detection tasks.

Abstract: The threat of Audio-Video (AV) forgery is rapidly evolving beyond human-centric deepfakes to include more diverse manipulations across complex natural scenes. However, existing benchmarks are still confined to DeepFake-based forgeries and single-granularity annotations, thus failing to capture the diversity and complexity of real-world forgery scenarios. To address this, we introduce AVFakeBench, the first comprehensive audio-video forgery detection benchmark that spans rich forgery semantics across both human subject and general subject. AVFakeBench comprises 12K carefully curated audio-video questions, covering seven forgery types and four levels of annotations. To ensure high-quality and diverse forgeries, we propose a multi-stage hybrid forgery framework that integrates proprietary models for task planning with expert generative models for precise manipulation. The benchmark establishes a multi-task evaluation framework covering binary judgment, forgery types classification, forgery detail selection, and explanatory reasoning. We evaluate 11 Audio-Video Large Language Models (AV-LMMs) and 2 prevalent detection methods on AVFakeBench, demonstrating the potential of AV-LMMs as emerging forgery detectors while revealing their notable weaknesses in fine-grained perception and reasoning.

[158] LaGen: Towards Autoregressive LiDAR Scene Generation

Sizhuo Zhou, Xiaosong Jia, Fanrui Zhang, Junjie Li, Juyong Zhang, Yukang Feng, Jianwen Sun, Songbur Wong, Junqi You, Junchi Yan

Main category: cs.CV

TL;DR: LaGen is the first framework for frame-by-frame autoregressive generation of long-horizon LiDAR scenes, enabling interactive generation from single-frame input using bounding box conditions.

Details

Motivation: Existing LiDAR generation methods only support single frame generation, while prediction approaches require multiple historical frames and lack interactivity, failing to support long-horizon interactive generation.

Method: LaGen uses a single-frame LiDAR input with bounding box conditions, includes scene decoupling estimation for object-level interactive generation, and noise modulation to reduce error accumulation in long-horizon generation.

Result: LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on later frames, as demonstrated on nuScenes-based evaluation protocol.

Conclusion: LaGen successfully enables long-horizon interactive LiDAR scene generation with high fidelity, addressing limitations of existing approaches.

Abstract: Generative world models for autonomous driving (AD) have become a trending topic. Unlike the widely studied image modality, in this work we explore generative world models for LiDAR data. Existing generation methods for LiDAR data only support single frame generation, while existing prediction approaches require multiple frames of historical input and can only deterministically predict multiple frames at once, lacking interactivity. Both paradigms fail to support long-horizon interactive generation. To this end, we introduce LaGen, which to the best of our knowledge is the first framework capable of frame-by-frame autoregressive generation of long-horizon LiDAR scenes. LaGen is able to take a single-frame LiDAR input as a starting point and effectively utilize bounding box information as conditions to generate high-fidelity 4D scene point clouds. In addition, we introduce a scene decoupling estimation module to enhance the model’s interactive generation capability for object-level content, as well as a noise modulation module to mitigate error accumulation during long-horizon generation. We construct a protocol based on nuScenes for evaluating long-horizon LiDAR scene generation. Experimental results comprehensively demonstrate LaGen outperforms state-of-the-art LiDAR generation and prediction models, especially on the later frames.

[159] Unlocking Zero-shot Potential of Semi-dense Image Matching via Gaussian Splatting

Juncheng Chen, Chao Xu, Yanjun Cao

Main category: cs.CV

TL;DR: MatchGS corrects 3DGS geometric inaccuracies to generate precise correspondence labels for training robust zero-shot image matchers, achieving up to 17.7% performance gains.

Details

Motivation: Learning-based image matching needs large, diverse, and geometrically accurate training data. 3DGS enables photorealistic novel-view synthesis but has geometric inaccuracies that prevent robust correspondence labeling.

Method: Twofold approach: (1) geometrically-faithful data generation pipeline that refines 3DGS geometry for precise correspondence labels, (2) 2D-3D representation alignment strategy that infuses 3DGS’ explicit 3D knowledge into 2D matchers.

Result: Generated ground-truth correspondences reduce epipolar error by up to 40x, enable supervision under extreme viewpoint changes, and provide self-supervisory signals. Matchers trained on this data achieve up to 17.7% zero-shot performance gains on public benchmarks.

Conclusion: With proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source for training robust zero-shot image matchers.

Abstract: Learning-based image matching critically depends on large-scale, diverse, and geometrically accurate training data. 3D Gaussian Splatting (3DGS) enables photorealistic novel-view synthesis and thus is attractive for data generation. However, its geometric inaccuracies and biased depth rendering currently prevent robust correspondence labeling. To address this, we introduce MatchGS, the first framework designed to systematically correct and leverage 3DGS for robust, zero-shot image matching. Our approach is twofold: (1) a geometrically-faithful data generation pipeline that refines 3DGS geometry to produce highly precise correspondence labels, enabling the synthesis of a vast and diverse range of viewpoints without compromising rendering fidelity; and (2) a 2D-3D representation alignment strategy that infuses 3DGS’ explicit 3D knowledge into the 2D matcher, guiding 2D semi-dense matchers to learn viewpoint-invariant 3D representations. Our generated ground-truth correspondences reduce the epipolar error by up to 40 times compared to existing datasets, enable supervision under extreme viewpoint changes, and provide self-supervisory signals through Gaussian attributes. Consequently, state-of-the-art matchers trained solely on our data achieve significant zero-shot performance gains on public benchmarks, with improvements of up to 17.7%. Our work demonstrates that with proper geometric refinement, 3DGS can serve as a scalable, high-fidelity, and structurally-rich data source, paving the way for a new generation of robust zero-shot image matchers.

[160] Monet: Reasoning in Latent Visual Space Beyond Images and Language

Qixun Wang, Yang Shi, Yifei Wang, Yuanxing Zhang, Pengfei Wan, Kun Gai, Xianghua Ying, Yisen Wang

Main category: cs.CV

TL;DR: Monet is a training framework that enables MLLMs to reason directly in latent visual space using continuous embeddings as visual thoughts, addressing computational cost and supervision challenges through a three-stage SFT pipeline and VLPO reinforcement learning.

Details

Motivation: Existing visual reasoning methods lack human-like abstract visual thinking due to limitations of external tools, creating a need for direct reasoning in latent visual space.

Method: Three-stage distillation-based SFT pipeline with VLPO (Visual-latent Policy Optimization) reinforcement learning, using Monet-SFT-125K dataset of text-image interleaved CoTs.

Result: Monet-7B shows consistent gains across perception and reasoning benchmarks with strong out-of-distribution generalization on abstract visual reasoning tasks.

Conclusion: The framework successfully enables latent visual reasoning in MLLMs and provides insights for future developments in this area.

Abstract: “Thinking with images” has emerged as an effective paradigm for advancing visual reasoning, extending beyond text-only chains of thought by injecting visual evidence into intermediate reasoning steps. However, existing methods fall short of human-like abstract visual thinking, as their flexibility is fundamentally limited by external tools. In this work, we introduce Monet, a training framework that enables multimodal large language models (MLLMs) to reason directly within the latent visual space by generating continuous embeddings that function as intermediate visual thoughts. We identify two core challenges in training MLLMs for latent visual reasoning: high computational cost in latent-vision alignment and insufficient supervision over latent embeddings, and address them with a three-stage distillation-based supervised fine-tuning (SFT) pipeline. We further reveal a limitation of applying GRPO to latent reasoning: it primarily enhances text-based reasoning rather than latent reasoning. To overcome this, we propose VLPO (Visual-latent Policy Optimization), a reinforcement learning method that explicitly incorporates latent embeddings into policy gradient updates. To support SFT, we construct Monet-SFT-125K, a high-quality text-image interleaved CoT dataset containing 125K real-world, chart, OCR, and geometry CoTs. Our model, Monet-7B, shows consistent gains across real-world perception and reasoning benchmarks and exhibits strong out-of-distribution generalization on challenging abstract visual reasoning tasks. We also empirically analyze the role of each training component and discuss our early unsuccessful attempts, providing insights for future developments in visual latent reasoning. Our model, data, and code are available at https://github.com/NOVAglow646/Monet.

[161] Co-Training Vision Language Models for Remote Sensing Multi-task Learning

Qingyun Li, Shuran Ma, Junwei Luo, Yi Yu, Yue Zhou, Fengxiang Wang, Xudong Lu, Xiaoxing Wang, Xin He, Yushi Chen, Xue Yang, Junchi Yan

Main category: cs.CV

TL;DR: RSCoVLM is a vision-language model baseline for remote sensing multi-task learning that addresses diverse image scales, computational burdens, and achieves state-of-the-art performance across various tasks.

Details

Motivation: To create a unified model for multiple remote sensing tasks through multi-task learning, leveraging vision-language models' text-based interface and addressing challenges in RS data environment and diverse image scales.

Method: Proposes data curation engine for vision-language conversations, unified dynamic-resolution strategy for diverse image scales, Zoom-in Chain mechanism for ultra-high-resolution images, and enhanced object detection capability with novel evaluation protocol.

Result: Achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and rivaling specialized expert models.

Conclusion: RSCoVLM serves as a strong baseline that promotes progress toward general-purpose RS models, with all tools, models, and datasets open-sourced for reproducibility.

Abstract: With Transformers achieving outstanding performance on individual remote sensing (RS) tasks, we are now approaching the realization of a unified model that excels across multiple tasks through multi-task learning (MTL). Compared to single-task approaches, MTL methods offer improved generalization, enhanced scalability, and greater practical applicability. Recently, vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning, respectively. Moreover, the unified text-based interface demonstrates significant potential for MTL. Hence, in this work, we present RSCoVLM, a simple yet flexible VLM baseline for RS MTL. Firstly, we create the data curation engine, including data acquisition, offline processing and integrating, as well as online loading and weighting. This data engine effectively addresses complex RS data enviroment and generates flexible vision-language conversations. Furthermore, we propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery. For UHR images, we introduce the Zoom-in Chain mechanism together with its corresponding dataset, LRS-VQA-Zoom. The strategies are flexible and effectively mitigate the computational burdens. Additionally, we significantly enhance the model’s object detection capability and propose a novel evaluation protocol that ensures fair comparison between VLMs and conventional detection models. Extensive experiments demonstrate that RSCoVLM achieves state-of-the-art performance across diverse tasks, outperforming existing RS VLMs and even rivaling specialized expert models. All the training and evaluating tools, model weights, and datasets have been fully open-sourced to support reproducibility. We expect that this baseline will promote further progress toward general-purpose RS models.

[162] PathMamba: A Hybrid Mamba-Transformer for Topologically Coherent Road Segmentation in Satellite Imagery

Jules Decaestecker, Nicolas Vigne

Main category: cs.CV

TL;DR: PathMamba is a hybrid architecture combining Mamba’s linear-time efficiency for modeling continuous road structures with Transformer’s global reasoning, achieving state-of-the-art road segmentation with superior topological continuity.

Details

Motivation: Current Vision Transformers for road segmentation have quadratic complexity that limits deployment on resource-constrained platforms, while road networks inherently require modeling of long, continuous structures.

Method: Hybrid architecture integrating Mamba blocks for sequential modeling of continuous road networks and Transformer blocks for global context refinement, combining their complementary strengths.

Result: Achieves new state-of-the-art on DeepGlobe Road Extraction and Massachusetts Roads datasets, significantly improving topological continuity (APLS metric) while remaining computationally competitive.

Conclusion: PathMamba demonstrates that combining Mamba’s efficient sequential modeling with Transformer’s global reasoning yields topologically superior road segmentation without prohibitive scaling costs.

Abstract: Achieving both high accuracy and topological continuity in road segmentation from satellite imagery is a critical goal for applications ranging from urban planning to disaster response. State-of-the-art methods often rely on Vision Transformers, which excel at capturing global context, yet their quadratic complexity is a significant barrier to efficient deployment, particularly for on-board processing in resource-constrained platforms. In contrast, emerging State Space Models like Mamba offer linear-time efficiency and are inherently suited to modeling long, continuous structures. We posit that these architectures have complementary strengths. To this end, we introduce PathMamba, a novel hybrid architecture that integrates Mamba’s sequential modeling with the Transformer’s global reasoning. Our design strategically uses Mamba blocks to trace the continuous nature of road networks, preserving topological structure, while integrating Transformer blocks to refine features with global context. This approach yields topologically superior segmentation maps without the prohibitive scaling costs of pure attention-based models. Our experiments on the DeepGlobe Road Extraction and Massachusetts Roads datasets demonstrate that PathMamba sets a new state-of-the-art. Notably, it significantly improves topological continuity, as measured by the APLS metric, setting a new benchmark while remaining computationally competitive.

[163] SAM Guided Semantic and Motion Changed Region Mining for Remote Sensing Change Captioning

Futian Wang, Mengqi Wang, Xiao Wang, Haowen Wang, Jin Tang

Main category: cs.CV

TL;DR: This paper proposes a novel remote sensing change captioning method that uses SAM foundation model for region-level representation and knowledge graphs to enhance change detection and description accuracy.

Details

Motivation: Existing methods have weak region awareness and limited temporal alignment in remote sensing change captioning, which aims to describe changes between two images in natural language.

Method: Uses CNN/Transformer for global features, SAM foundation model to delineate semantic/motion change regions, knowledge graphs for object information, and cross-attention fusion with Transformer decoder for caption generation.

Result: Achieves state-of-the-art performance across multiple benchmark datasets.

Conclusion: The proposed method effectively addresses region awareness and temporal alignment issues in remote sensing change captioning by leveraging SAM foundation model and knowledge graphs.

Abstract: Remote sensing change captioning is an emerging and popular research task that aims to describe, in natural language, the content of interest that has changed between two remote sensing images captured at different times. Existing methods typically employ CNNs/Transformers to extract visual representations from the given images or incorporate auxiliary tasks to enhance the final results, with weak region awareness and limited temporal alignment. To address these issues, this paper explores the use of the SAM (Segment Anything Model) foundation model to extract region-level representations and inject region-of-interest knowledge into the captioning framework. Specifically, we employ a CNN/Transformer model to extract global-level vision features, leverage the SAM foundation model to delineate semantic- and motion-level change regions, and utilize a specially constructed knowledge graph to provide information about objects of interest. These heterogeneous sources of information are then fused via cross-attention, and a Transformer decoder is used to generate the final natural language description of the observed changes. Extensive experimental results demonstrate that our method achieves state-of-the-art performance across multiple widely used benchmark datasets. The source code of this paper will be released on https://github.com/Event-AHU/SAM_ChangeCaptioning

[164] CaliTex: Geometry-Calibrated Attention for View-Coherent 3D Texture Generation

Chenyu Liu, Hongze Chen, Jingzhi Bao, Lingting Zhu, Runze Zhang, Weikai Chen, Zeyu Hu, Yingda Yin, Keyang Luo, Xin Wang

Main category: cs.CV

TL;DR: CaliTex introduces geometry-calibrated attention to solve cross-view inconsistency in 3D texture generation by enforcing spatial alignment and geometry-conditioned routing.

Details

Motivation: Current 3D texture generation suffers from cross-view inconsistency due to attention ambiguity, where unstructured full attention causes geometric confusion and unstable appearance-structure coupling.

Method: Two modules: Part-Aligned Attention for spatial alignment across semantically matched parts, and Condition-Routed Attention for routing appearance information through geometry-conditioned pathways. Uses two-stage diffusion transformer.

Result: Produces seamless and view-consistent textures, outperforming both open-source and commercial baselines.

Conclusion: CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization.

Abstract: Despite major advances brought by diffusion-based models, current 3D texture generation systems remain hindered by cross-view inconsistency – textures that appear convincing from one viewpoint often fail to align across others. We find that this issue arises from attention ambiguity, where unstructured full attention is applied indiscriminately across tokens and modalities, causing geometric confusion and unstable appearance-structure coupling. To address this, we introduce CaliTex, a framework of geometry-calibrated attention that explicitly aligns attention with 3D structure. It introduces two modules: Part-Aligned Attention that enforces spatial alignment across semantically matched parts, and Condition-Routed Attention which routes appearance information through geometry-conditioned pathways to maintain spatial fidelity. Coupled with a two-stage diffusion transformer, CaliTex makes geometric coherence an inherent behavior of the network rather than a byproduct of optimization. Empirically, CaliTex produces seamless and view-consistent textures and outperforms both open-source and commercial baselines.

[165] From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings

Jiajie Zhang, Sören Schwertfeger, Alexander Kleiner

Main category: cs.CV

TL;DR: Unsupervised framework for extracting structured VLA pre-training data from industrial videos using motion tokenization and latent action energy-based segmentation.

Details

Motivation: To unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action model pre-training in manufacturing settings.

Method: Trains lightweight motion tokenizer to encode motion dynamics, then uses unsupervised action segmenter with novel “Latent Action Energy” metric to discover semantically coherent action primitives.

Result: Effective segmentation of key tasks on public benchmarks and proprietary electric motor assembly dataset, with semantic coherence confirmed through clustering and VLM assessment.

Conclusion: First fully automated end-to-end system for extracting VLA pre-training data from unstructured industrial videos, providing scalable solution for embodied AI in manufacturing.

Abstract: We present a novel unsupervised framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training. Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter leveraging a novel “Latent Action Energy” metric to discover and segment semantically coherent action primitives. The pipeline outputs both segmented video clips and their corresponding latent action sequences, providing structured data directly suitable for VLA pre-training. Evaluations on public benchmarks and a proprietary electric motor assembly dataset demonstrate effective segmentation of key tasks performed by humans at workstations. Further clustering and quantitative assessment via a Vision-Language Model confirm the semantic coherence of the discovered action primitives. To our knowledge, this is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.

[166] HTTM: Head-wise Temporal Token Merging for Faster VGGT

Weitian Wang, Lukas Meiner, Rai Shubham, Cecilia De La Parra, Akash Kumar

Main category: cs.CV

TL;DR: HTTM is a training-free 3D token merging method that accelerates VGGT by merging tokens at multi-head granularity, preserving feature uniqueness while achieving up to 7x speedup with minimal performance loss.

Details

Motivation: VGGT's joint inference of 3D attributes requires global attention layers with all-to-all computation, causing significant latency bottlenecks for large scenes with long-sequence inputs.

Method: Head-wise temporal merging (HTTM) merges tokens in multi-head granularity instead of uniform merging across heads, leveraging spatial locality and temporal correspondence at head level for higher merging ratios with lower costs.

Result: HTTM achieves up to 7x acceleration with negligible performance drops in GPU-based inference compared to existing merging techniques.

Conclusion: HTTM effectively addresses the latency bottleneck in VGGT while maintaining model representational ability through head-wise token merging.

Abstract: The Visual Geometry Grounded Transformer (VGGT) marks a significant leap forward in 3D scene reconstruction, as it is the first model that directly infers all key 3D attributes (camera poses, depths, and dense geometry) jointly in one pass. However, this joint inference mechanism requires global attention layers that perform all-to-all attention computation on tokens from all views. For reconstruction of large scenes with long-sequence inputs, this causes a significant latency bottleneck. In this paper, we propose head-wise temporal merging (HTTM), a training-free 3D token merging method for accelerating VGGT. Existing merging techniques merge tokens uniformly across different attention heads, resulting in identical tokens in the layers’ output, which hinders the model’s representational ability. HTTM tackles this problem by merging tokens in multi-head granularity, which preserves the uniqueness of feature tokens after head concatenation. Additionally, this enables HTTM to leverage the spatial locality and temporal correspondence observed at the head level to achieve higher merging ratios with lower merging costs compared to existing methods. Thus, HTTM achieves up to 7x acceleration with negligible performance drops in a GPU-based inference.

[167] EvRainDrop: HyperGraph-guided Completion for Effective Frame and Event Stream Aggregation

Futian Wang, Fan Zhang, Xiao Wang, Mengqi Wang, Dexing Huang, Jin Tang

Main category: cs.CV

TL;DR: Proposes a hypergraph-guided spatio-temporal event stream completion method to address spatial sparsity in event cameras by connecting event tokens across time and space via hypergraphs, enabling multi-modal feature fusion.

Details

Motivation: Event cameras produce spatially sparse but temporally dense data, and existing methods using event frames/voxels struggle with undersampling from spatial sparsity.

Method: Uses hypergraphs to connect event tokens across different times and spatial locations, enabling contextual message passing for event completion. Can incorporate RGB tokens for multi-modal hypergraph-based information completion, followed by self-attention for temporal aggregation.

Result: Extensive experiments on single- and multi-label event classification tasks fully validated the effectiveness of the proposed framework.

Conclusion: The hypergraph-guided completion mechanism successfully addresses spatial sparsity in event streams and enables effective multi-modal feature learning and fusion.

Abstract: Event cameras produce asynchronous event streams that are spatially sparse yet temporally dense. Mainstream event representation learning algorithms typically use event frames, voxels, or tensors as input. Although these approaches have achieved notable progress, they struggle to address the undersampling problem caused by spatial sparsity. In this paper, we propose a novel hypergraph-guided spatio-temporal event stream completion mechanism, which connects event tokens across different times and spatial locations via hypergraphs and leverages contextual information message passing to complete these sparse events. The proposed method can flexibly incorporate RGB tokens as nodes in the hypergraph within this completion framework, enabling multi-modal hypergraph-based information completion. Subsequently, we aggregate hypergraph node information across different time steps through self-attention, enabling effective learning and fusion of multi-modal features. Extensive experiments on both single- and multi-label event classification tasks fully validated the effectiveness of our proposed framework. The source code of this paper will be released on https://github.com/Event-AHU/EvRainDrop.

[168] PFF-Net: Patch Feature Fitting for Point Cloud Normal Estimation

Qing Li, Huifang Feng, Kanle Shi, Yue Gao, Yi Fang, Yu-Shen Liu, Zhizhong Han

Main category: cs.CV

TL;DR: A novel feature extraction method for robust normal estimation in point clouds using multi-scale feature fusion to address patch size selection issues.

Details

Motivation: Existing methods struggle with selecting appropriate neighborhood sizes for different point cloud data and geometries, leading to inaccurate normal estimation.

Method: Proposes Patch Feature Fitting (PFF) with multi-scale feature aggregation and cross-scale feature compensation modules to approximate optimal geometric descriptions.

Result: Achieves state-of-the-art performance on synthetic and real-world datasets with fewer parameters and reduced running time.

Conclusion: The multi-scale feature fusion approach enables scale adaptation and delivers optimal feature descriptions for robust normal estimation across various point clouds.

Abstract: Estimating the normal of a point requires constructing a local patch to provide center-surrounding context, but determining the appropriate neighborhood size is difficult when dealing with different data or geometries. Existing methods commonly employ various parameter-heavy strategies to extract a full feature description from the input patch. However, they still have difficulties in accurately and efficiently predicting normals for various point clouds. In this work, we present a new idea of feature extraction for robust normal estimation of point clouds. We use the fusion of multi-scale features from different neighborhood sizes to address the issue of selecting reasonable patch sizes for various data or geometries. We seek to model a patch feature fitting (PFF) based on multi-scale features to approximate the optimal geometric description for normal estimation and implement the approximation process via multi-scale feature aggregation and cross-scale feature compensation. The feature aggregation module progressively aggregates the patch features of different scales to the center of the patch and shrinks the patch size by removing points far from the center. It not only enables the network to precisely capture the structure characteristic in a wide range, but also describes highly detailed geometries. The feature compensation module ensures the reusability of features from earlier layers of large scales and reveals associated information in different patch sizes. Our approximation strategy based on aggregating the features of multiple scales enables the model to achieve scale adaptation of varying local patches and deliver the optimal feature description. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets with fewer network parameters and running time.

[169] Endo-G$^{2}$T: Geometry-Guided & Temporally Aware Time-Embedded 4DGS For Endoscopic Scenes

Yangle Liu, Fengze Li, Kan Liu, Jieming Ma

Main category: cs.CV

TL;DR: Endo-G²T is a geometry-guided training scheme for 4D Gaussian splatting that addresses view-dependent effects in endoscopic videos by using monocular depth priors, temporal consistency modeling, and keyframe-constrained streaming.

Details

Motivation: Endoscopic videos suffer from strong view-dependent effects like specularities and occlusions, causing photometric supervision to misalign with geometry and trigger early geometric drift that becomes hard to correct during densification.

Method: Three key components: 1) Geo-guided prior distillation using confidence-gated monocular depth with scale-invariant losses, 2) Time-embedded Gaussian field with rotor-like rotation for temporal coherence, 3) Keyframe-constrained streaming with max-points budget for efficiency and stability.

Result: Achieves state-of-the-art results on EndoNeRF and StereoMIS-P1 datasets among monocular reconstruction baselines.

Conclusion: Endo-G²T successfully anchors geometry early in 4D Gaussian splatting while maintaining temporal consistency and efficiency in dynamic endoscopic scenes.

Abstract: Endoscopic (endo) video exhibits strong view-dependent effects such as specularities, wet reflections, and occlusions. Pure photometric supervision misaligns with geometry and triggers early geometric drift, where erroneous shapes are reinforced during densification and become hard to correct. We ask how to anchor geometry early for 4D Gaussian splatting (4DGS) while maintaining temporal consistency and efficiency in dynamic endoscopic scenes. Thus, we present Endo-G$^{2}$T, a geometry-guided and temporally aware training scheme for time-embedded 4DGS. First, geo-guided prior distillation converts confidence-gated monocular depth into supervision with scale-invariant depth and depth-gradient losses, using a warm-up-to-cap schedule to inject priors softly and avoid early overfitting. Second, a time-embedded Gaussian field represents dynamics in XYZT with a rotor-like rotation parameterization, yielding temporally coherent geometry with lightweight regularization that favors smooth motion and crisp opacity boundaries. Third, keyframe-constrained streaming improves efficiency and long-horizon stability through keyframe-focused optimization under a max-points budget, while non-keyframes advance with lightweight updates. Across EndoNeRF and StereoMIS-P1 datasets, Endo-G$^{2}$T achieves state-of-the-art results among monocular reconstruction baselines.

[170] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Xin Gu, Haoji Zhang, Qihang Fan, Jingxuan Niu, Zhipeng Zhang, Libo Zhang, Guang Chen, Fan Chen, Longyin Wen, Sijie Zhu

Main category: cs.CV

TL;DR: STVG-o1 enables off-the-shelf multimodal large language models (MLLMs) to achieve state-of-the-art spatio-temporal video grounding performance without architectural changes, using bounding-box chain-of-thought and multi-dimensional reinforcement rewards.

Details

Motivation: Multimodal large language models (MLLMs) underperform on spatio-temporal video grounding (STVG) due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders.

Method: Proposes a bounding-box chain-of-thought mechanism for explicit spatio-temporal reasoning and a multi-dimensional reinforcement reward function with format, consistency, temporal, spatial, and think rewards for geometry-aware supervision.

Result: Sets new state-of-the-art on HCSTVG-v1/v2 with 7.3% m_tIoU improvement over best task-specific method, matches specialized models on VidSTG, and surpasses all existing MLLM-based approaches by large margins.

Conclusion: Establishes MLLMs as viable and powerful backbones for precise spatio-temporal grounding with strong open-vocabulary generalization across datasets.

Abstract: Spatio-temporal video grounding (STVG) requires localizing a target object in untrimmed videos both temporally and spatially from natural language descriptions. Despite their strong language understanding, multimodal large language models (MLLMs) underperform on STVG due to misaligned training objectives and weak fine-grained region-word alignment in standard visual encoders. To address this, we propose STVG-o1, the first framework that enables off-the-shelf MLLMs to achieve state-of-the-art STVG performance without any architectural modifications. Our method introduces a bounding-box chain-of-thought mechanism that explicitly reasons about spatio-temporal locations in an intermediate step before producing the final prediction. We further design a multi-dimensional reinforcement reward function consisting of format, consistency, temporal, spatial, and think rewards, which provides geometry-aware supervision through reinforcement fine-tuning. Evaluated on HCSTVG-v1/v2 and VidSTG, STVG-o1 sets new state-of-the-art results on HCSTVG, outperforming the best task-specific method by 7.3% m_tIoU on HCSTVG-v1, matching specialized models on VidSTG, and surpassing all existing MLLM-based approaches by large margins. It also demonstrates strong open-vocabulary generalization across datasets, establishing MLLMs as viable and powerful backbones for precise spatio-temporal grounding. Our code and models will be released.

[171] Frequency-Aware Token Reduction for Efficient Vision Transformer

Dong-Jae Lee, Jiwan Hur, Jaehyun Choi, Jaemyung Yu, Junmo Kim

Main category: cs.CV

TL;DR: A frequency-aware token reduction strategy for Vision Transformers that partitions tokens into high-frequency and low-frequency components, selectively preserving high-frequency tokens while aggregating low-frequency ones to improve efficiency and mitigate rank collapsing.

Details

Motivation: Vision Transformers suffer from quadratic computational complexity with token length, and existing token reduction methods overlook frequency characteristics like rank collapsing and over-smoothing in self-attention.

Method: Partitions tokens into high-frequency and low-frequency tokens, selectively preserves high-frequency tokens, and aggregates low-frequency tokens into a compact direct current token to retain essential low-frequency components.

Result: Significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over-smoothing through extensive experiments and analysis.

Conclusion: The proposed frequency-aware token reduction strategy effectively addresses computational efficiency issues in Vision Transformers while preserving performance and mitigating frequency-related problems in self-attention.

Abstract: Vision Transformers have demonstrated exceptional performance across various computer vision tasks, yet their quadratic computational complexity concerning token length remains a significant challenge. To address this, token reduction methods have been widely explored. However, existing approaches often overlook the frequency characteristics of self-attention, such as rank collapsing and over-smoothing phenomenon. In this paper, we propose a frequency-aware token reduction strategy that improves computational efficiency while preserving performance by mitigating rank collapsing. Our method partitions tokens into high-frequency tokens and low-frequency tokens. high-frequency tokens are selectively preserved, while low-frequency tokens are aggregated into a compact direct current token to retain essential low-frequency components. Through extensive experiments and analysis, we demonstrate that our approach significantly improves accuracy while reducing computational overhead and mitigating rank collapsing and over smoothing. Furthermore, we analyze the previous methods, shedding light on their implicit frequency characteristics and limitations.

[172] DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models

Mingue Park, Prin Phunyaphibarn, Phillip Y. Lee, Minhyuk Sung

Main category: cs.CV

TL;DR: DiverseVAR enhances diversity in text-conditioned visual autoregressive models without retraining by combining text-embedding noise injection with scale-travel latent refinement.

Details

Motivation: VAR models suffer from limited diversity, producing nearly identical images for simple prompts, while current research focuses mainly on image quality rather than diversity.

Method: Two-stage approach: 1) Inject noise into text embeddings to increase diversity, 2) Use scale-travel latent refinement (resuming generation at intermediate stages via multi-scale autoencoder) to preserve image quality.

Result: Combining text-embedding noise injection with scale-travel refinement significantly enhances diversity while minimizing quality degradation, achieving a new Pareto frontier in diversity-quality trade-off.

Conclusion: DiverseVAR provides an effective test-time solution to improve VAR model diversity without computational overhead, addressing a critical limitation in current visual autoregressive models.

Abstract: We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.

[173] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

Taehoon Kim, Donghwan Jang, Bohyung Han

Main category: cs.CV

TL;DR: Merge-and-Bound (M&B) is a novel Class Incremental Learning approach that directly optimizes model weights through inter-task and intra-task weight merging with bounded updates to prevent catastrophic forgetting.

Details

Motivation: To address catastrophic forgetting in Class Incremental Learning by directly manipulating model weights in parameter space rather than relying on architectural changes or revised learning objectives.

Method: Two-stage weight merging: inter-task merging averages weights from previous models, and intra-task merging combines parameters within current stage. Bounded update technique minimizes cumulative updates while preserving previous knowledge.

Result: Superior performance compared to state-of-the-art methods on standard CIL benchmarks, demonstrating effective knowledge retention and learning of new tasks.

Conclusion: M&B provides an effective weight manipulation approach for CIL that can be seamlessly integrated into existing methods without architectural modifications, offering a practical solution to catastrophic forgetting.

Abstract: We present a novel training approach, named Merge-and-Bound (M&B) for Class Incremental Learning (CIL), which directly manipulates model weights in the parameter space for optimization. Our algorithm involves two types of weight merging: inter-task weight merging and intra-task weight merging. Inter-task weight merging unifies previous models by averaging the weights of models from all previous stages. On the other hand, intra-task weight merging facilitates the learning of current task by combining the model parameters within current stage. For reliable weight merging, we also propose a bounded update technique that aims to optimize the target model with minimal cumulative updates and preserve knowledge from previous tasks; this strategy reveals that it is possible to effectively obtain new models near old ones, reducing catastrophic forgetting. M&B is seamlessly integrated into existing CIL methods without modifying architecture components or revising learning objectives. We extensively evaluate our algorithm on standard CIL benchmarks and demonstrate superior performance compared to state-of-the-art methods.

[174] E-M3RF: An Equivariant Multimodal 3D Re-assembly Framework

Adeela Islam, Stefano Fiorini, Manuel Lecha, Theodore Tsesmelis, Stuart James, Pietro Morerio, Alessio Del Bue

Main category: cs.CV

TL;DR: E-M3RF is an equivariant multimodal 3D reassembly framework that uses both geometric and color features to reassemble fractured fragments, outperforming existing methods by reducing rotation error by 23.1% and translation error by 13.2%.

Details

Motivation: Traditional 3D reassembly methods rely primarily on geometric features, which struggle when geometry is insufficient or ambiguous (e.g., small, eroded, or symmetric fragments). Additionally, existing solutions lack physical constraints to prevent overlapping assemblies.

Method: Uses SE(3) flow matching to predict transformations for reassembly. Combines rotation-equivariant geometric features from 3D point positions with color features encoded by a transformer, creating multimodal representations from both geometry and color data.

Result: On the RePAIR dataset, E-M3RF reduces rotation error by 23.1%, translation error by 13.2%, and Chamfer Distance by 18.4% compared to competing methods. Validated on four datasets including synthetic (Breaking Bad, Fantastic Breaks) and real-world cultural heritage datasets (RePAIR, Presious).

Conclusion: E-M3RF effectively addresses limitations of geometry-only approaches by incorporating multimodal features (geometry + color) and physical constraints, achieving superior performance in 3D reassembly tasks, particularly for challenging cases where geometric features alone are insufficient.

Abstract: 3D reassembly is a fundamental geometric problem, and in recent years it has increasingly been challenged by deep learning methods rather than classical optimization. While learning approaches have shown promising results, most still rely primarily on geometric features to assemble a whole from its parts. As a result, methods struggle when geometry alone is insufficient or ambiguous, for example, for small, eroded, or symmetric fragments. Additionally, solutions do not impose physical constraints that explicitly prevent overlapping assemblies. To address these limitations, we introduce E-M3RF, an equivariant multimodal 3D reassembly framework that takes as input the point clouds, containing both point positions and colors of fractured fragments, and predicts the transformations required to reassemble them using SE(3) flow matching. Each fragment is represented by both geometric and color features: i) 3D point positions are encoded as rotationconsistent geometric features using a rotation-equivariant encoder, ii) the colors at each 3D point are encoded with a transformer. The two feature sets are then combined to form a multimodal representation. We experimented on four datasets: two synthetic datasets, Breaking Bad and Fantastic Breaks, and two real-world cultural heritage datasets, RePAIR and Presious, demonstrating that E-M3RF on the RePAIR dataset reduces rotation error by 23.1% and translation error by 13.2%, while Chamfer Distance decreases by 18.4% compared to competing methods.

[175] MobileI2V: Fast and High-Resolution Image-to-Video on Mobile Devices

Shuai Zhang, Bao Tang, Siyuan Yu, Yueting Zhu, Jingfeng Yao, Ya Zou, Shanglin Yuan, Li Yu, Wenyu Liu, Xinggang Wang

Main category: cs.CV

TL;DR: MobileI2V is a 270M lightweight diffusion model for real-time 720p image-to-video generation on mobile devices, achieving 10x speed-up through step distillation and mobile optimizations.

Details

Motivation: The substantial computational complexity and slow generation speed of diffusion models pose challenges for real-time, high-resolution video generation on resource-constrained mobile devices.

Method: Proposed linear hybrid architecture denoiser, time-step distillation strategy reducing sampling steps from 20+ to 2, and mobile-specific attention optimizations for 2x speed-up.

Result: Enables fast 720p image-to-video generation on mobile devices with quality comparable to existing models, achieving generation speed of less than 100ms per frame under one-step conditions.

Conclusion: MobileI2V demonstrates the feasibility of real-time high-quality video generation on mobile devices through efficient model design and optimization techniques.

Abstract: Recently, video generation has witnessed rapid advancements, drawing increasing attention to image-to-video (I2V) synthesis on mobile devices. However, the substantial computational complexity and slow generation speed of diffusion models pose significant challenges for real-time, high-resolution video generation on resource-constrained mobile devices. In this work, we propose MobileI2V, a 270M lightweight diffusion model for real-time image-to-video generation on mobile devices. The core lies in: (1) We analyzed the performance of linear attention modules and softmax attention modules on mobile devices, and proposed a linear hybrid architecture denoiser that balances generation efficiency and quality. (2) We design a time-step distillation strategy that compresses the I2V sampling steps from more than 20 to only two without significant quality loss, resulting in a 10-fold increase in generation speed. (3) We apply mobile-specific attention optimizations that yield a 2-fold speed-up for attention operations during on-device inference. MobileI2V enables, for the first time, fast 720p image-to-video generation on mobile devices, with quality comparable to existing models. Under one-step conditions, the generation speed of each frame of 720p video is less than 100 ms. Our code is available at: https://github.com/hustvl/MobileI2V.

[176] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Xiang Gu, Liming Lu, Xu Zheng, Anan Du, Yongbin Zhou, Shuchao Pang

Main category: cs.CV

TL;DR: MRPD is an efficient teacher-student framework that uses multimodal knowledge distillation to create robust 3D point cloud models against adversarial attacks, with no inference overhead.

Details

Motivation: Existing defense methods for 3D point cloud models suffer from high computational overhead and poor generalization across different attack types, limiting their practical deployment in security-sensitive applications.

Method: Proposes Multimodal Robust Prompt Distillation (MRPD) - a teacher-student framework that learns lightweight prompts by aligning student model features with robust embeddings from three teachers: vision model (depth projections), high-performance 3D model, and text encoder, guided by confidence-gated mechanism.

Result: MRPD substantially outperforms state-of-the-art defense methods against white-box and black-box attacks, while achieving better performance on clean data, with no additional computational cost at inference.

Conclusion: Presents a practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge through distillation during training only.

Abstract: Adversarial attacks pose a significant threat to learning-based 3D point cloud models, critically undermining their reliability in security-sensitive applications. Existing defense methods often suffer from (1) high computational overhead and (2) poor generalization ability across diverse attack types. To bridge these gaps, we propose a novel yet efficient teacher-student framework, namely Multimodal Robust Prompt Distillation (MRPD) for distilling robust 3D point cloud model. It learns lightweight prompts by aligning student point cloud model’s features with robust embeddings from three distinct teachers: a vision model processing depth projections, a high-performance 3D model, and a text encoder. To ensure a reliable knowledge transfer, this distillation is guided by a confidence-gated mechanism which dynamically balances the contribution of all input modalities. Notably, since the distillation is all during the training stage, there is no additional computational cost at inference. Extensive experiments demonstrate that MRPD substantially outperforms state-of-the-art defense methods against a wide range of white-box and black-box attacks, while even achieving better performance on clean data. Our work presents a new, practical paradigm for building robust 3D vision systems by efficiently harnessing multimodal knowledge.

[177] CanKD: Cross-Attention-based Non-local operation for Feature-based Knowledge Distillation

Shizhe Sun, Wataru Ohyama

Main category: cs.CV

TL;DR: Cross-Attention-based Non-local Knowledge Distillation (CanKD) uses cross-attention mechanisms for feature-based knowledge distillation, enabling dynamic pixel-wise knowledge transfer between teacher and student models.

Details

Motivation: Traditional self-attention distillation methods align teacher and student features independently, lacking comprehensive pixel-wise relationship capture between models.

Method: Introduces cross-attention mechanism where each student pixel dynamically considers all teacher pixels, implemented as an additional loss function without architectural changes.

Result: Outperforms state-of-the-art feature and hybrid distillation methods on object detection and image segmentation tasks.

Conclusion: CanKD demonstrates superior performance and represents a new paradigm for attention-guided knowledge distillation in computer vision.

Abstract: We propose Cross-Attention-based Non-local Knowledge Distillation (CanKD), a novel feature-based knowledge distillation framework that leverages cross-attention mechanisms to enhance the knowledge transfer process. Unlike traditional self-attention-based distillation methods that align teacher and student feature maps independently, CanKD enables each pixel in the student feature map to dynamically consider all pixels in the teacher feature map. This non-local knowledge transfer more thoroughly captures pixel-wise relationships, improving feature representation learning. Our method introduces only an additional loss function to achieve superior performance compared with existing attention-guided distillation methods. Extensive experiments on object detection and image segmentation tasks demonstrate that CanKD outperforms state-of-the-art feature and hybrid distillation methods. These experimental results highlight CanKD’s potential as a new paradigm for attention-guided distillation in computer vision tasks. Code is available at https://github.com/tori-hotaru/CanKD

[178] Generalized Design Choices for Deepfake Detectors

Lorenzo Pellegrini, Serafino Pandolfini, Davide Maltoni, Matteo Ferrara, Marco Prati, Marco Ramilli

Main category: cs.CV

TL;DR: Systematic investigation of design choices in deepfake detection reveals that implementation details like data preprocessing and augmentation significantly impact performance more than core architecture, leading to architecture-agnostic best practices.

Details

Motivation: Deepfake detection effectiveness depends heavily on implementation details rather than core design, making fair comparisons difficult and obscuring true performance factors.

Method: Systematic investigation isolating the impact of individual design choices related to training, inference, and incremental updates in deepfake detection models.

Result: Identified a set of design choices that consistently improve deepfake detection and achieve state-of-the-art performance on the AI-GenBench benchmark.

Conclusion: Established robust, architecture-agnostic best practices for future deepfake detection systems by understanding how different implementation factors influence accuracy and generalization.

Abstract: The effectiveness of deepfake detection methods often depends less on their core design and more on implementation details such as data preprocessing, augmentation strategies, and optimization techniques. These factors make it difficult to fairly compare detectors and to understand which factors truly contribute to their performance. To address this, we systematically investigate how different design choices influence the accuracy and generalization capabilities of deepfake detection models, focusing on aspects related to training, inference, and incremental updates. By isolating the impact of individual factors, we aim to establish robust, architecture-agnostic best practices for the design and development of future deepfake detection systems. Our experiments identify a set of design choices that consistently improve deepfake detection and enable state-of-the-art performance on the AI-GenBench benchmark.

[179] Self-Paced Learning for Images of Antinuclear Antibodies

Yiyang Jiang, Guangwu Qian, Jiaxin Wu, Qi Huang, Qing Li, Yongkang Wu, Xiao-Yong Wei

Main category: cs.CV

TL;DR: A novel multi-instance multi-label (MIML) framework for automated ANA detection that uses unaltered microscope images without manual preprocessing, achieving state-of-the-art performance with up to +7.0% F1-Macro and +12.6% mAP gains.

Details

Motivation: Manual ANA testing is slow, labor-intensive, and requires extensive training. Automation is challenging due to over 100 coexisting antibody types and complex fluorescent patterns, requiring MIML learning approaches.

Method: Framework with three components: instance sampler (suppresses low-confidence instances), probabilistic pseudo-label dispatcher (adaptive label assignment), and self-paced weight learning rate coefficients (adjusts training based on label observations).

Result: Achieved up to +7.0% F1-Macro and +12.6% mAP gains on ANA dataset over prior methods. Ranked top-2 on public medical MIML benchmarks, reducing Hamming loss by 18.2% and one-error by 26.9%.

Conclusion: The proposed framework effectively handles MIML complexities in ANA detection, supports end-to-end optimization, and sets new state-of-the-art results for automated autoimmune disorder diagnosis.

Abstract: Antinuclear antibody (ANA) testing is a crucial method for diagnosing autoimmune disorders, including lupus, Sjögren’s syndrome, and scleroderma. Despite its importance, manual ANA detection is slow, labor-intensive, and demands years of training. ANA detection is complicated by over 100 coexisting antibody types, resulting in vast fluorescent pattern combinations. Although machine learning and deep learning have enabled automation, ANA detection in real-world clinical settings presents unique challenges as it involves multi-instance, multi-label (MIML) learning. In this paper, a novel framework for ANA detection is proposed that handles the complexities of MIML tasks using unaltered microscope images without manual preprocessing. Inspired by human labeling logic, it identifies consistent ANA sub-regions and assigns aggregated labels accordingly. These steps are implemented using three task-specific components: an instance sampler, a probabilistic pseudo-label dispatcher, and self-paced weight learning rate coefficients. The instance sampler suppresses low-confidence instances by modeling pattern confidence, while the dispatcher adaptively assigns labels based on instance distinguishability. Self-paced learning adjusts training according to empirical label observations. Our framework overcomes limitations of traditional MIML methods and supports end-to-end optimization. Extensive experiments on one ANA dataset and three public medical MIML benchmarks demonstrate the superiority of our framework. On the ANA dataset, our model achieves up to +7.0% F1-Macro and +12.6% mAP gains over the best prior method, setting new state-of-the-art results. It also ranks top-2 across all key metrics on public datasets, reducing Hamming loss and one-error by up to 18.2% and 26.9%, respectively. The source code can be accessed at https://github.com/fletcherjiang/ANA-SelfPacedLearning.

[180] Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan Liu, Dunjie Lu, Ruilin Luo, Chenxu Lv, Rui Men, Lingchen Meng, Xuancheng Ren, Xingzhang Ren, Sibo Song, Yuchong Sun, Jun Tang, Jianhong Tu, Jianqiang Wan, Peng Wang, Pengfei Wang, Qiuyue Wang, Yuxuan Wang, Tianbao Xie, Yiheng Xu, Haiyang Xu, Jin Xu, Zhibo Yang, Mingkun Yang, Jianxin Yang, An Yang, Bowen Yu, Fei Zhang, Hang Zhang, Xi Zhang, Bo Zheng, Humen Zhong, Jingren Zhou, Fan Zhou, Jing Zhou, Yuanzhi Zhu, Ke Zhu

Main category: cs.CV

TL;DR: Qwen3-VL is the most advanced vision-language model in the Qwen series, featuring native support for 256K token interleaved contexts, multiple model sizes (2B-235B), and superior performance across text, image, and video tasks with enhanced architectural upgrades.

Details

Motivation: To create a comprehensive vision-language model that bridges the gap between text, image, and video understanding while supporting long-context processing and diverse deployment scenarios through both dense and mixture-of-experts architectures.

Method: Introduces three key architectural upgrades: enhanced interleaved-MRoPE for spatial-temporal modeling, DeepStack integration for multi-level ViT features, and text-based time alignment for video. Supports both dense (2B-32B) and MoE (30B-235B) variants with native 256K token context windows.

Result: Achieves superior performance across multimodal benchmarks including MMMU, MathVista, and MathVision. Demonstrates stronger pure-text understanding than comparable text-only models, robust long-context comprehension, and advanced multimodal reasoning across single-image, multi-image, and video tasks.

Conclusion: Qwen3-VL serves as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence, offering superior performance under comparable token budgets and latency constraints across both dense and MoE architectures.

Abstract: We introduce Qwen3-VL, the most capable vision-language model in the Qwen series to date, achieving superior performance across a broad range of multimodal benchmarks. It natively supports interleaved contexts of up to 256K tokens, seamlessly integrating text, images, and video. The model family includes both dense (2B/4B/8B/32B) and mixture-of-experts (30B-A3B/235B-A22B) variants to accommodate diverse latency-quality trade-offs. Qwen3-VL delivers three core pillars: (i) markedly stronger pure-text understanding, surpassing comparable text-only backbones in several cases; (ii) robust long-context comprehension with a native 256K-token window for both text and interleaved multimodal inputs, enabling faithful retention, retrieval, and cross-referencing across long documents and videos; and (iii) advanced multimodal reasoning across single-image, multi-image, and video tasks, demonstrating leading performance on comprehensive evaluations such as MMMU and visual-math benchmarks (e.g., MathVista and MathVision). Architecturally, we introduce three key upgrades: (i) an enhanced interleaved-MRoPE for stronger spatial-temporal modeling across images and video; (ii) DeepStack integration, which effectively leverages multi-level ViT features to tighten vision-language alignment; and (iii) text-based time alignment for video, evolving from T-RoPE to explicit textual timestamp alignment for more precise temporal grounding. Under comparable token budgets and latency constraints, Qwen3-VL achieves superior performance in both dense and Mixture-of-Experts (MoE) architectures. We envision Qwen3-VL serving as a foundational engine for image-grounded reasoning, agentic decision-making, and multimodal code intelligence in real-world workflows.

[181] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

Pierre Adorni, Minh-Tan Pham, Stéphane May, Sébastien Lefèvre

Main category: cs.CV

TL;DR: An Ensemble-of-Specialists framework for building efficient Remote Sensing Foundation Models that decomposes training into lightweight task-specific specialists, offering advantages in efficiency, interpretability, and extensibility.

Details

Motivation: Current foundation model strategies require prohibitive computational resources and large datasets, limiting accessibility and contradicting sustainable AI principles due to immense carbon footprints.

Method: Decomposes training into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused, supporting federated training, pruning, and continuous specialist integration.

Result: The framework provides a scalable and efficient alternative to large foundation models, particularly suitable for collaborative and resource-constrained settings.

Conclusion: Sets a new direction for building scalable and efficient Remote Sensing Foundation Models through a modular Ensemble-of-Specialists approach.

Abstract: Recent advances in foundation models have shown great promise in domains such as natural language processing and computer vision, and similar efforts are now emerging in the Earth Observation community. These models aim to generalize across tasks with limited supervision, reducing the need for training separate models for each task. However, current strategies, which largely focus on scaling model size and dataset volume, require prohibitive computational and data resources, limiting accessibility to only a few large institutions. Moreover, this paradigm of ever-larger models stands in stark contrast with the principles of sustainable and environmentally responsible AI, as it leads to immense carbon footprints and resource inefficiency. In this work, we present a novel and efficient alternative: an Ensemble-of-Specialists framework for building Remote Sensing Foundation Models (RSFMs). Our method decomposes the training process into lightweight, task-specific ConvNeXtV2 specialists that can be frozen and reused. This modular approach offers strong advantages in efficiency, interpretability, and extensibility. Moreover, it naturally supports federated training, pruning, and continuous specialist integration, making it particularly well-suited for collaborative and resource-constrained settings. Our framework sets a new direction for building scalable and efficient RSFMs.

[182] Continual Error Correction on Low-Resource Devices

Kirill Paramonov, Mete Ozay, Aristeidis Mystakidis, Nikolaos Tsalikidis, Dimitrios Sotos, Anastasios Drosou, Dimitrios Tzovaras, Hyunjun Kim, Kiseok Chang, Sangdok Mo, Namwoong Kim, Woojong Yoo, Jijoong Moon, Umberto Michieli

Main category: cs.CV

TL;DR: A system for efficient AI error correction on resource-constrained devices using few-shot learning with server-side foundation model training and on-device prototype-based classification.

Details

Motivation: Address prediction errors in AI models on everyday devices where existing solutions lack efficient correction mechanisms, especially for resource-constrained environments.

Method: Combines server-side foundation model training with knowledge distillation to transfer features to device-compatible architectures, plus on-device prototype-based classification for efficient error correction through prototype updates.

Result: Achieved over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets with minimal forgetting (<0.02%) and negligible computational overhead.

Conclusion: The system proves practical for real-world scenarios through Android validation, enabling efficient AI error correction on resource-constrained devices.

Abstract: The proliferation of AI models in everyday devices has highlighted a critical challenge: prediction errors that degrade user experience. While existing solutions focus on error detection, they rarely provide efficient correction mechanisms, especially for resource-constrained devices. We present a novel system enabling users to correct AI misclassifications through few-shot learning, requiring minimal computational resources and storage. Our approach combines server-side foundation model training with on-device prototype-based classification, enabling efficient error correction through prototype updates rather than model retraining. The system consists of two key components: (1) a server-side pipeline that leverages knowledge distillation to transfer robust feature representations from foundation models to device-compatible architectures, and (2) a device-side mechanism that enables ultra-efficient error correction through prototype adaptation. We demonstrate our system’s effectiveness on both image classification and object detection tasks, achieving over 50% error correction in one-shot scenarios on Food-101 and Flowers-102 datasets while maintaining minimal forgetting (less than 0.02%) and negligible computational overhead. Our implementation, validated through an Android demonstration app, proves the system’s practicality in real-world scenarios.

[183] The Age-specific Alzheimer ’s Disease Prediction with Characteristic Constraints in Nonuniform Time Span

Xin Hong, Kaifeng Huang

Main category: cs.CV

TL;DR: This paper presents a novel method for generating sequential MRI images to predict Alzheimer’s disease progression, using quantitative metrics and age-scaling factors to improve accuracy.

Details

Motivation: Timely identification of Alzheimer's disease is crucial for personalized treatment, but existing methods struggle with irregular time intervals in input sequences and accurate representation of disease characteristics.

Method: Innovative sequential image generation methodology guided by quantitative metrics, with integration of age-scaling factor to produce age-specific MRI images for predicting advanced disease stages.

Result: Ablation study showed quantitative metrics significantly improve MRI image synthesis accuracy. Age-scaled pixel loss enhanced iterative generation. Structural Similarity Index reached 0.882 for long-term disease prognosis.

Conclusion: The proposed approach effectively generates sequential MRI images that maintain disease progression features, enabling improved prediction of Alzheimer’s disease advancement through quantitative guidance and age-specific scaling.

Abstract: Alzheimer’s disease is a debilitating disorder marked by a decline in cognitive function. Timely identification of the disease is essential for the development of personalized treatment strategies that aim to mitigate its progression. The application of generated images for the prediction of Alzheimer’s disease poses challenges, particularly in accurately representing the disease’s characteristics when input sequences are captured at irregular time intervals. This study presents an innovative methodology for sequential image generation, guided by quantitative metrics, to maintain the essential features indicative of disease progression. Furthermore, an age-scaling factor is integrated into the process to produce age-specific MRI images, facilitating the prediction of advanced stages of the disease. The results obtained from the ablation study suggest that the inclusion of quantitative metrics significantly improves the accuracy of MRI image synthesis. Furthermore, the application of age-scaled pixel loss contributed to the enhanced iterative generation of MRI images. In terms of long-term disease prognosis, the Structural Similarity Index reached a peak value of 0.882, indicating a substantial degree of similarity in the synthesized images.

[184] Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng, Wentao Mo, Peiqiang Wang, Nan Zhang

Main category: cs.CV

TL;DR: ADVLA is an efficient adversarial attack framework for Vision-Language-Action models that applies perturbations directly in feature space, achieving high success rates with minimal visible patches and low computational cost.

Details

Motivation: Existing adversarial attack methods for VLA models require expensive end-to-end training and generate noticeable perturbation patches, limiting their practical applicability.

Method: ADVLA applies adversarial perturbations on features projected from visual encoder to textual feature space, using attention guidance and three strategies: sensitivity enhancement, sparsity enforcement, and perturbation concentration.

Result: Under L∞=4/255 constraint, ADVLA with Top-K masking modifies <10% of patches while achieving ~100% attack success rate. Perturbations are concentrated on critical regions, nearly imperceptible, and take only ~0.06 seconds per iteration.

Conclusion: ADVLA effectively weakens VLA model action predictions under low-amplitude and sparse conditions, avoiding high training costs and conspicuous perturbations of traditional methods, demonstrating practical value for feature space attacks.

Abstract: In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that directly applies adversarial perturbations on features projected from the visual encoder into the textual feature space. ADVLA efficiently disrupts downstream action predictions under low-amplitude constraints, and attention guidance allows the perturbations to be both focused and sparse. We introduce three strategies that enhance sensitivity, enforce sparsity, and concentrate perturbations. Experiments demonstrate that under an $L_{\infty}=4/255$ constraint, ADVLA combined with Top-K masking modifies less than 10% of the patches while achieving an attack success rate of nearly 100%. The perturbations are concentrated on critical regions, remain almost imperceptible in the overall image, and a single-step iteration takes only about 0.06 seconds, significantly outperforming conventional patch-based attacks. In summary, ADVLA effectively weakens downstream action predictions of VLA models under low-amplitude and locally sparse conditions, avoiding the high training costs and conspicuous perturbations of traditional patch attacks, and demonstrates unique effectiveness and practical value for attacking VLA feature spaces.

[185] Video Generation Models Are Good Latent Reward Models

Xiaoyue Mi, Wenqing Yu, Jiesong Lian, Shibo Jie, Ruizhe Zhong, Zijun Liu, Guozhen Zhang, Zixiang Zhou, Zhiyong Xu, Yuan Zhou, Qinglin Lu, Fan Tang

Main category: cs.CV

TL;DR: PRFL is a latent-space reward feedback learning framework for video generation that avoids VAE decoding, enabling efficient optimization throughout the denoising process while improving human preference alignment.

Details

Motivation: Existing video reward models require pixel-space inputs, leading to high memory overhead, slow training, and limited optimization to late-stage visual quality rather than fundamental motion dynamics.

Method: Leverages pre-trained video generation models as reward models in noisy latent space, enabling preference optimization throughout the full denoising chain without VAE decoding.

Result: PRFL significantly improves alignment with human preferences while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

Conclusion: Video generation models are naturally suited for latent-space reward modeling, enabling efficient and effective preference optimization throughout the entire denoising process.

Abstract: Reward feedback learning (ReFL) has proven effective for aligning image generation with human preferences. However, its extension to video generation faces significant challenges. Existing video reward models rely on vision-language models designed for pixel-space inputs, confining ReFL optimization to near-complete denoising steps after computationally expensive VAE decoding. This pixel-space approach incurs substantial memory overhead and increased training time, and its late-stage optimization lacks early-stage supervision, refining only visual quality rather than fundamental motion dynamics and structural coherence. In this work, we show that pre-trained video generation models are naturally suited for reward modeling in the noisy latent space, as they are explicitly designed to process noisy latent representations at arbitrary timesteps and inherently preserve temporal information through their sequential modeling capabilities. Accordingly, we propose Process Reward Feedback Learning~(PRFL), a framework that conducts preference optimization entirely in latent space, enabling efficient gradient backpropagation throughout the full denoising chain without VAE decoding. Extensive experiments demonstrate that PRFL significantly improves alignment with human preferences, while achieving substantial reductions in memory consumption and training time compared to RGB ReFL.

[186] UAVLight: A Benchmark for Illumination-Robust 3D Reconstruction in Unmanned Aerial Vehicle (UAV) Scenes

Kang Du, Xue Liao, Junpeng Xia, Chaozheng Guo, Yi Gu, Yirui Guan, Duotun Wang, ShengHuang, Zeyu Wang

Main category: cs.CV

TL;DR: UAVLight is a new benchmark for illumination-robust 3D reconstruction that captures scenes at multiple times of day along repeatable flight paths, providing natural lighting variation while maintaining consistent geometry and viewpoints.

Details

Motivation: Illumination inconsistency in multi-view 3D reconstruction causes geometry drift, color inconsistency, and shadow imprinting. Existing datasets either lack meaningful illumination diversity or have confounding geometric/semantic changes, making it difficult to study lighting robustness in isolation.

Method: The benchmark captures each scene along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints.

Result: UAVLight provides controlled-yet-real illumination variation for outdoor 3D reconstruction, enabling standardized evaluation across different lighting conditions.

Conclusion: UAVLight offers a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments, addressing the critical challenge of illumination inconsistency in UAV-based reconstruction.

Abstract: Illumination inconsistency is a fundamental challenge in multi-view 3D reconstruction. Variations in sunlight direction, cloud cover, and shadows break the constant-lighting assumption underlying both classical multi-view stereo (MVS) and structure from motion (SfM) pipelines and recent neural rendering methods, leading to geometry drift, color inconsistency, and shadow imprinting. This issue is especially critical in UAV-based reconstruction, where long flight durations and outdoor environments make lighting changes unavoidable. However, existing datasets either restrict capture to short time windows, thus lacking meaningful illumination diversity, or span months and seasons, where geometric and semantic changes confound the isolated study of lighting robustness. We introduce UAVLight, a controlled-yet-real benchmark for illumination-robust 3D reconstruction. Each scene is captured along repeatable, geo-referenced flight paths at multiple fixed times of day, producing natural lighting variation under consistent geometry, calibration, and viewpoints. With standardized evaluation protocols across lighting conditions, UAVLight provides a reliable foundation for developing and benchmarking reconstruction methods that are consistent, faithful, and relightable in real outdoor environments.

[187] Enhanced Landmark Detection Model in Pelvic Fluoroscopy using 2D/3D Registration Loss

Chou Mo, Yehyun Suh, J. Ryan Martin, Daniel Moyer

Main category: cs.CV

TL;DR: A novel framework combining 2D/3D landmark registration with U-Net training improves pelvic landmark detection accuracy under variable patient pose conditions in fluoroscopy.

Details

Motivation: Current pelvic landmark detection methods assume fixed Antero-Posterior views, but real intra-operative imaging often has variable patient orientation due to repositioning of imaging units or anatomical structures.

Method: Proposed framework incorporates 2D/3D landmark registration into U-Net training, comparing baseline U-Net, U-Net with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic variable pose conditions.

Result: The framework analyzes performance differences in landmark detection accuracy between the three approaches when patient pose is variable.

Conclusion: Incorporating pose estimation into landmark detection training improves robustness to variable patient positioning in intra-operative pelvic fluoroscopy.

Abstract: Automated landmark detection offers an efficient approach for medical professionals to understand patient anatomic structure and positioning using intra-operative imaging. While current detection methods for pelvic fluoroscopy demonstrate promising accuracy, most assume a fixed Antero-Posterior view of the pelvis. However, orientation often deviates from this standard view, either due to repositioning of the imaging unit or of the target structure itself. To address this limitation, we propose a novel framework that incorporates 2D/3D landmark registration into the training of a U-Net landmark prediction model. We analyze the performance difference by comparing landmark detection accuracy between the baseline U-Net, U-Net trained with Pose Estimation Loss, and U-Net fine-tuned with Pose Estimation Loss under realistic intra-operative conditions where patient pose is variable.

[188] Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy

Teng Hu, Zhentao Yu, Guozhen Zhang, Zihan Su, Zhengguang Zhou, Youliang Zhang, Yuan Zhou, Qinglin Lu, Ran Yi

Main category: cs.CV

TL;DR: Harmony is a novel framework that addresses audio-visual synchronization challenges in joint diffusion models through cross-task synergy training, global-local decoupled interaction, and synchronization-enhanced CFG.

Details

Motivation: Existing open-source models struggle with robust audio-video alignment due to correspondence drift, inefficient attention mechanisms, and intra-modal bias in CFG, which hinders fine-grained synchronization.

Method: Proposes three key components: Cross-Task Synergy training to mitigate drift, Global-Local Decoupled Interaction Module for efficient temporal alignment, and Synchronization-Enhanced CFG (SyncCFG) to amplify alignment signals.

Result: Extensive experiments show Harmony establishes new state-of-the-art performance, significantly outperforming existing methods in both generation fidelity and fine-grained audio-visual synchronization.

Conclusion: Harmony successfully overcomes fundamental challenges in joint diffusion processes for audio-visual content synthesis, achieving superior synchronization through its mechanistic enforcement approach.

Abstract: The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.

[189] Deep Learning-Based Multiclass Classification of Oral Lesions with Stratified Augmentation

Joy Naoum, Revana Salama, Ali Hamdi

Main category: cs.CV

TL;DR: Deep learning model for multiclass classification of 16 oral lesions using data augmentation and oversampling to address imbalanced datasets, achieving 83.33% accuracy for early oral cancer detection.

Details

Motivation: Oral cancer is often diagnosed late due to visual similarity between benign, precancerous, and malignant lesions. Early computer-aided diagnosis systems can improve clinical outcomes.

Method: Combines stratified data splitting with advanced data augmentation and oversampling techniques to handle limited and imbalanced datasets for multiclass classification of 16 oral lesions.

Result: Achieved 83.33% accuracy, 89.12% precision, and 77.31% recall, demonstrating superiority over state-of-the-art methods with notable minority class classification performance.

Conclusion: The framework shows promise as a first step toward trustworthy computer-aided diagnostic systems for early oral cancer detection in clinical settings, effectively demonstrating the value of oversampling and augmentation strategies.

Abstract: Oral cancer is highly common across the globe and is mostly diagnosed during the later stages due to the close visual similarity to benign, precancerous, and malignant lesions in the oral cavity. Implementing computer aided diagnosis systems early on has the potential to greatly improve clinical outcomes. This research intends to use deep learning to build a multiclass classifier for sixteen different oral lesions. To overcome the challenges of limited and imbalanced datasets, the proposed technique combines stratified data splitting and advanced data augmentation and oversampling to perform the classification. The experimental results, which achieved 83.33 percent accuracy, 89.12 percent precision, and 77.31 percent recall, demonstrate the superiority of the suggested model over state of the art methods now in use. The suggested model effectively conveys the effectiveness of oversampling and augmentation strategies in situations where the minority class classification performance is noteworthy. As a first step toward trustworthy computer aided diagnostic systems for the early detection of oral cancer in clinical settings, the suggested framework shows promise.

[190] CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

Zhihang Liu, Chen-Wei Xie, Bin Wen, Feiwu Yu, Jixuan Chen, Pandeng Li, Boqiang Zhang, Nianzu Yang, Yinglu Li, Zuan Gao, Yun Zheng, Hongtao Xie

Main category: cs.CV

TL;DR: CAPability is a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views, addressing limitations of outdated benchmarks and incomplete visual element coverage in existing evaluations.

Details

Motivation: Traditional visual captioning benchmarks have become outdated with modern MLLMs, as brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. Recent benchmarks remain limited to vague-view or object-view analyses with incomplete visual element coverage.

Method: Introduced CAPability benchmark with nearly 11K human-annotated images and videos with visual element annotations. Uses precision and hit metrics to assess correctness and thoroughness of captions. Also introduces heuristic metric ‘know but cannot tell’ (K¬T) by converting annotations to QA pairs to measure performance gap between QA and caption capabilities.

Result: The benchmark provides stable assessment of caption correctness and thoroughness across 12 dimensions. Identifies significant performance gap between QA and caption capabilities through the K¬T metric, revealing MLLMs’ strengths and weaknesses across various captioning dimensions.

Conclusion: CAPability enables holistic analysis of MLLMs’ captioning abilities, identifying their strengths and weaknesses across various dimensions, which guides future research to enhance specific aspects of their capabilities.

Abstract: Visual captioning benchmarks have become outdated with the emergence of modern multimodal large language models (MLLMs), as the brief ground-truth sentences and traditional metrics fail to assess detailed captions effectively. While recent benchmarks attempt to address this by focusing on keyword extraction or object-centric evaluation, they remain limited to vague-view or object-view analyses and incomplete visual element coverage. In this paper, we introduce CAPability, a comprehensive multi-view benchmark for evaluating visual captioning across 12 dimensions spanning six critical views. We curate nearly 11K human-annotated images and videos with visual element annotations to evaluate the generated captions. CAPability stably assesses both the correctness and thoroughness of captions with \textit{precision} and \textit{hit} metrics. By converting annotations to QA pairs, we further introduce a heuristic metric, \textit{know but cannot tell} ($K\bar{T}$), indicating a significant performance gap between QA and caption capabilities. Our work provides a holistic analysis of MLLMs’ captioning abilities, as we identify their strengths and weaknesses across various dimensions, guiding future research to enhance specific aspects of their capabilities.

[191] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman, Jinrong Xie, Yongxin Chen

Main category: cs.CV

TL;DR: MoGAN is a motion-centric post-training framework that improves motion realism in video diffusion models without reward models or human preference data, achieving significant motion quality improvements while maintaining visual fidelity.

Details

Motivation: Video diffusion models achieve strong frame-level fidelity but struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. The standard denoising MSE objective provides no direct supervision on temporal consistency.

Method: Built atop a 3-step distilled video diffusion model, MoGAN trains a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity.

Result: On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or better aesthetic and image-quality scores. Human study shows preference for MoGAN’s motion quality (52% vs. 38% for teacher; 56% vs. 29% for DMD).

Conclusion: MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation.

Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.

[192] LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?

Maoyuan Ye, Haibin He, Qihuang Zhong, Jing Zhang, Juhua Liu, Bo Du

Main category: cs.CV

TL;DR: LogicOCR is a benchmark for evaluating LMMs’ complex logical reasoning on text-rich images, with 2780 questions across generated and real-world images. It reveals LMMs’ limitations in multimodal reasoning and proposes TextCue, a training-free method that improves performance by enhancing text cue perception.

Details

Motivation: To address the underexplored area of complex logical reasoning performance of Large Multimodal Models (LMMs) on text-rich images, as current advances focus more on general reasoning and OCR capabilities.

Method: Created LogicOCR benchmark with two subsets: LogicOCR-Gen (1100 multi-choice questions on generated images using GPT-Image-1) and LogicOCR-Real (1680 free-form questions on real-world images). Proposed TextCue method that uses attention maps and text segmentation to identify and enlarge important text regions.

Result: Evaluation revealed LMMs still lag in multimodal reasoning compared to text-only inputs, showing they haven’t fully bridged visual reading with reasoning. TextCue method achieved 1.8% accuracy gain over LLaVA-OV-1.5-8B under Chain-of-Thought setting.

Conclusion: LMMs have significant room for improvement in complex logical reasoning on text-rich images. The proposed TextCue method effectively enhances text cue perception without additional training, and the LogicOCR benchmark provides a valuable tool for future research in this area.

Abstract: Recent advances in Large Multimodal Models (LMMs) have revolutionized their reasoning and Optical Character Recognition (OCR) capabilities. However, their complex logical reasoning performance on text-rich images remains underexplored. To bridge this gap, we introduce LogicOCR, a benchmark comprising 2780 questions with two subsets, i.e., LogicOCR-Gen with 1100 multi-choice questions on generated images, and LogicOCR-Real with 1680 meticulously designed free-form questions on real-world images. For constructing LogicOCR-Gen, we first curate a text corpus from the Chinese National Civil Servant Examination, and customize an automatic pipeline to steer GPT-Image-1 to generate images with varied layouts and fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified. We evaluate a range of representative LMMs under Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. Moreover, we propose TextCue, a training-free method that enhances LMMs’ perception of image regions containing important text cues for solving questions. We leverage LMMs’ attention maps and an off-the-shelf text segmentation specialist to determine the region, which is then cropped and enlarged to augment the original image. Experiments show its effectiveness, e.g., a 1.8% accuracy gain over LLaVA-OV-1.5-8B under the CoT setting. Our benchmark is available at https://github.com/MiliLab/LogicOCR.

[193] ReSAM: Refine, Requery, and Reinforce: Self-Prompting Point-Supervised Segmentation for Remote Sensing Images

M. Naseer Subhani

Main category: cs.CV

TL;DR: A self-prompting framework that adapts SAM to remote sensing imagery using only sparse point annotations, achieving better performance than pretrained SAM and other point-supervised methods.

Details

Motivation: SAM performs poorly on remote sensing imagery due to domain shift and lack of dense annotations, creating a need for efficient adaptation methods that don't require full-mask supervision.

Method: Uses a Refine-Requery-Reinforce loop: generates coarse masks from points (Refine), improves them with self-constructed box prompts (Requery), and aligns embeddings across iterations to reduce bias (Reinforce).

Result: Consistently outperforms pretrained SAM and recent point-supervised methods on three RSI benchmarks (WHU, HRSID, NWPU VHR-10).

Conclusion: Self-prompting and semantic alignment provide an efficient path for scalable point-level adaptation of foundation segmentation models to remote sensing applications.

Abstract: Interactive segmentation models such as the Segment Anything Model (SAM) have demonstrated remarkable generalization on natural images, but perform suboptimally on remote sensing imagery (RSI) due to severe domain shift and the scarcity of dense annotations. To address this, we propose a self-prompting, point-supervised framework that adapts SAM to RSIs using only sparse point annotations. Our method employs a Refine-Requery-Reinforce loop, where coarse pseudo-masks are generated from initial points (Refine), improved with self-constructed box prompts (Requery), and embeddings are aligned across iterations to reduce confirmation bias (Reinforce). Without relying on full-mask supervision, our approach progressively enhances SAM’s segmentation quality and domain robustness through self-guided prompt adaptation . We evaluate our proposed method on three RSI benchmark datasets, including WHU, HRSID, and NWPU VHR-10, showing that our method consistently surpasses pretrained SAM and recent point-supervised segmentation methods. Our results demonstrate that self-prompting and semantic alignment provide an efficient path towards scalable, point-level adaptation of foundation segmentation models for remote sensing applications.

[194] Active Learning for GCN-based Action Recognition

Hichem Sahbi

Main category: cs.CV

TL;DR: A label-efficient GCN model for skeleton-based action recognition that uses adversarial exemplar selection and bidirectional GCN architectures to reduce dependency on large labeled datasets.

Details

Motivation: GCNs for skeleton-based action recognition often require large labeled datasets, which are scarce in practical settings, creating a need for more label-efficient approaches.

Method: 1) Novel adversarial acquisition function to select informative exemplars balancing representativeness, diversity, and uncertainty; 2) Bidirectional and stable GCN architectures for better mapping between ambient and latent spaces.

Result: Extensive evaluations on two challenging benchmarks show significant improvements over prior work in label-efficient skeleton-based action recognition.

Conclusion: The proposed label-efficient GCN model effectively addresses the data scarcity problem in skeleton-based action recognition through intelligent exemplar selection and enhanced network architectures.

Abstract: Despite the notable success of graph convolutional networks (GCNs) in skeleton-based action recognition, their performance often depends on large volumes of labeled data, which are frequently scarce in practical settings. To address this limitation, we propose a novel label-efficient GCN model. Our work makes two primary contributions. First, we develop a novel acquisition function that employs an adversarial strategy to identify a compact set of informative exemplars for labeling. This selection process balances representativeness, diversity, and uncertainty. Second, we introduce bidirectional and stable GCN architectures. These enhanced networks facilitate a more effective mapping between the ambient and latent data spaces, enabling a better understanding of the learned exemplar distribution. Extensive evaluations on two challenging skeleton-based action recognition benchmarks reveal significant improvements achieved by our label-efficient GCNs compared to prior work.

[195] CaFlow: Enhancing Long-Term Action Quality Assessment with Causal Counterfactual Flow

Ruisheng Han, Kanglei Zhou, Shuang Chen, Amir Atapour-Abarghouei, Hubert P. H. Shum

Main category: cs.CV

TL;DR: CaFlow is a unified framework for long-term Action Quality Assessment that combines counterfactual de-confounding with bidirectional temporal modeling to improve robustness and representation coherence.

Details

Motivation: Long-term AQA is challenging due to extended temporal dynamics and contextual confounders. Existing approaches suffer from spurious correlations and unstable long-term representations due to unidirectional modeling and costly annotations.

Method: Proposes CaFlow with two modules: Causal Counterfactual Regularization (CCR) for self-supervised disentanglement of causal/confounding features, and BiT-Flow for bidirectional temporal modeling with cycle-consistency constraints.

Result: Extensive experiments on multiple long-term AQA benchmarks show state-of-the-art performance.

Conclusion: CaFlow effectively addresses long-term AQA challenges by integrating causal robustness with bidirectional temporal modeling, achieving superior performance on benchmark datasets.

Abstract: Action Quality Assessment (AQA) predicts fine-grained execution scores from action videos and is widely applied in sports, rehabilitation, and skill evaluation. Long-term AQA, as in figure skating or rhythmic gymnastics, is especially challenging since it requires modeling extended temporal dynamics while remaining robust to contextual confounders. Existing approaches either depend on costly annotations or rely on unidirectional temporal modeling, making them vulnerable to spurious correlations and unstable long-term representations. To this end, we propose CaFlow, a unified framework that integrates counterfactual de-confounding with bidirectional time-conditioned flow. The Causal Counterfactual Regularization (CCR) module disentangles causal and confounding features in a self-supervised manner and enforces causal robustness through counterfactual interventions, while the BiT-Flow module models forward and backward dynamics with a cycle-consistency constraint to produce smoother and more coherent representations. Extensive experiments on multiple long-term AQA benchmarks demonstrate that CaFlow achieves state-of-the-art performance. Code is available at https://github.com/Harrison21/CaFlow

[196] Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following

Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang

Main category: cs.CV

TL;DR: Multi-Crit is a benchmark for evaluating multimodal models’ ability to follow diverse, fine-grained evaluation criteria, revealing that current models struggle with pluralistic criterion adherence, especially in open-ended tasks.

Details

Motivation: Large multimodal models are increasingly used as judges in evaluation systems, but their capacity to follow diverse, fine-grained evaluation criteria remains underexplored, creating a need for systematic assessment.

Method: Developed Multi-Crit benchmark through rigorous data curation pipeline with challenging response pairs and multi-criterion human annotations, introducing three novel metrics for pluralistic adherence, criterion-switching flexibility, and conflict recognition.

Result: Analysis of 25 LMMs shows proprietary models struggle with consistent pluralistic criteria adherence (especially in open-ended evaluation), open-source models lag further in flexibility, and critic fine-tuning improves visual grounding but fails to generalize to pluralistic judgment.

Conclusion: Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation by systematically probing the limits of current multimodal judges and highlighting key challenges in criterion-level judgment.

Abstract: Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria–especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.

[197] Revolutionizing Glioma Segmentation & Grading Using 3D MRI - Guided Hybrid Deep Learning Models

Pandiyaraju V, Sreya Mynampati, Abishek Karthik, Poovarasan L, D. Saraswathi

Main category: cs.CV

TL;DR: Hybrid deep learning model combining U-Net segmentation with DenseNet-VGG classification using multihead attention achieves 98% Dice coefficient for tumor segmentation and 99% accuracy for glioma classification.

Details

Motivation: Early and accurate diagnosis of gliomas is crucial due to their high mortality rate, requiring improved automated diagnostic tools for therapeutic intervention.

Method: Developed hybrid framework with U-Net for 3D MRI tumor segmentation and hybrid DenseNet-VGG classifier with multihead attention and spatial-channel attention mechanisms, using preprocessing steps including normalization, resampling, and data augmentation.

Result: Achieved 98% Dice coefficient for tumor segmentation and 99% classification accuracy, outperforming traditional CNN models and attention-free methods.

Conclusion: The framework shows great potential for timely and reliable glioma diagnosis and grading, enabling better treatment planning through enhanced interpretability and accuracy.

Abstract: Gliomas are brain tumor types that have a high mortality rate which means early and accurate diagnosis is important for therapeutic intervention for the tumors. To address this difficulty, the proposed research will develop a hybrid deep learning model which integrates U-Net based segmentation and a hybrid DenseNet-VGG classification network with multihead attention and spatial-channel attention capabilities. The segmentation model will precisely demarcate the tumors in a 3D volume of MRI data guided by spatial and contextual information. The classification network which combines a branch of both DenseNet and VGG, will incorporate the demarcated tumor on which features with attention mechanisms would be focused on clinically relevant features. High-dimensional 3D MRI data could successfully be utilized in the model through preprocessing steps which are normalization, resampling, and data augmentation. Through a variety of measures the framework is evaluated: measures of performance in segmentation are Dice coefficient and Mean Intersection over Union (IoU) and measures of performance in classification are accuracy precision, recall, and F1-score. The hybrid framework that has been proposed has demonstrated through physical testing that it has the capability of obtaining a Dice coefficient of 98% in tumor segmentation, and 99% on classification accuracy, outperforming traditional CNN models and attention-free methods. Utilizing multi-head attention mechanisms enhances notions of priority in aspects of the tumor that are clinically significant, and enhances interpretability and accuracy. The results suggest a great potential of the framework in facilitating the timely and reliable diagnosis and grading of glioma by clinicians is promising, allowing for better planning of patient treatment.

[198] Seeing without Pixels: Perception from Camera Trajectories

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han

Main category: cs.CV

TL;DR: Camera trajectories alone can reveal video content through contrastive learning with language embeddings, enabling robust video understanding without pixel data.

Details

Motivation: To investigate whether camera movement patterns alone can reveal video content, challenging the assumption that pixel-level visual information is necessary for video understanding.

Method: Proposed CamFormer - a contrastive learning framework that projects camera pose trajectories into joint embedding space aligned with natural language descriptions.

Result: Camera trajectories are surprisingly informative about video content, enabling tasks like cross-modal alignment, classification, and temporal analysis across different pose estimation methods.

Conclusion: Camera trajectory serves as a lightweight, robust, and versatile modality for perceiving video content, demonstrating that ‘how you move’ reveals ‘what you are doing or observing’.

Abstract: Can one perceive a video’s content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer, a dedicated encoder that projects camera pose trajectories into a joint embedding space, aligning them with natural language. We find that, contrary to its apparent simplicity, the camera trajectory is a remarkably informative signal to uncover video content. In other words, “how you move” can indeed reveal “what you are doing” (egocentric) or “observing” (exocentric). We demonstrate the versatility of our learned CamFormer embeddings on a diverse suite of downstream tasks, ranging from cross-modal alignment to classification and temporal analysis. Importantly, our representations are robust across diverse camera pose estimation methods, including both high-fidelity multi-sensored and standard RGB-only estimators. Our findings establish camera trajectory as a lightweight, robust, and versatile modality for perceiving video content.

[199] UniChange: Unifying Change Detection with Multimodal Large Language Model

Xu Zhang, Danyang Li, Xiaohang Dong, Tianhao Wu, Hualong Yu, Jianye Wang, Qicheng Li, Xiang Li

Main category: cs.CV

TL;DR: UniChange is the first MLLM-based unified change detection model that integrates both binary change detection (BCD) and semantic change detection (SCD) tasks using special tokens and text prompts, achieving state-of-the-art performance across multiple benchmarks.

Details

Motivation: Current change detection models have limited knowledge acquisition from single-type annotated data and cannot leverage diverse BCD and SCD datasets simultaneously, leading to poor generalization and limited versatility.

Method: Leverages MLLMs’ language priors and unification capabilities with three special tokens ([T1], [T2], [CHANGE]) and text prompts to guide change category identification, eliminating predefined classification heads and enabling knowledge acquisition from multi-source datasets with conflicting class definitions.

Result: Achieved SOTA performance on four benchmarks: WHU-CD (90.41 IoU), S2Looking (53.04 IoU), LEVIR-CD+ (78.87 IoU), and SECOND (57.62 IoU), surpassing all previous methods.

Conclusion: UniChange successfully unifies BCD and SCD tasks through MLLM integration, demonstrating superior generalization and versatility while effectively handling multi-source datasets with conflicting class definitions.

Abstract: Change detection (CD) is a fundamental task for monitoring and analyzing land cover dynamics. While recent high performance models and high quality datasets have significantly advanced the field, a critical limitation persists. Current models typically acquire limited knowledge from single-type annotated data and cannot concurrently leverage diverse binary change detection (BCD) and semantic change detection (SCD) datasets. This constraint leads to poor generalization and limited versatility. The recent advancements in Multimodal Large Language Models (MLLMs) introduce new possibilities for a unified CD framework. We leverage the language priors and unification capabilities of MLLMs to develop UniChange, the first MLLM-based unified change detection model. UniChange integrates generative language abilities with specialized CD functionalities. Our model successfully unifies both BCD and SCD tasks through the introduction of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange utilizes text prompts to guide the identification of change categories, eliminating the reliance on predefined classification heads. This design allows UniChange to effectively acquire knowledge from multi-source datasets, even when their class definitions conflict. Experiments on four public benchmarks (WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance, achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively, surpassing all previous methods. The code is available at https://github.com/Erxucomeon/UniChange.

[200] Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang

Main category: cs.CV

TL;DR: Canvas-to-Image is a unified framework that integrates multiple control signals (text, subject references, spatial arrangements, poses, layouts) into a single canvas interface for high-fidelity image generation with better compositional control.

Details

Motivation: Current diffusion models struggle with simultaneous multi-modal control when users specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations together.

Method: Encode diverse control signals into a single composite canvas image and use Multi-Task Canvas Training to optimize diffusion models for joint understanding of heterogeneous controls within a unified learning paradigm.

Result: Significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

Conclusion: Canvas-to-Image enables faithful reflection of user intent through integrated visual-spatial reasoning across multiple control modalities, generalizing well to multi-control scenarios during inference.

Abstract: While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

[201] Think Visually, Reason Textually: Vision-Language Synergy in ARC

Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang

Main category: cs.CV

TL;DR: The paper introduces a vision-language synergy approach for abstract reasoning in ARC-AGI tasks, achieving 4.33% improvement over text-only baselines by leveraging complementary strengths of vision and language across different reasoning stages.

Details

Motivation: Current foundation models fail at inferring structured transformation rules from minimal examples, which is a key hallmark of human intelligence. Existing methods treat ARC-AGI as purely textual reasoning, overlooking humans' reliance on visual abstraction.

Method: Two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR) decomposes ARC-AGI into modality-aligned subtasks, and (2) Modality-Switch Self-Correction (MSSC) uses vision to verify text-based reasoning for error correction.

Result: The approach yields up to 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks.

Conclusion: Unifying visual abstraction with linguistic reasoning is crucial for achieving generalizable, human-like intelligence in future foundation models.

Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.

[202] TimeViper: A Hybrid Mamba-Transformer Vision-Language Model for Efficient Long Video Understanding

Boshen Xu, Zihan Xiao, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Qin Jin

Main category: cs.CV

TL;DR: TimeViper is a hybrid Mamba-Transformer model for long video understanding that processes 10,000+ frames through vision token compression and transfer mechanisms.

Details

Motivation: Long video understanding requires efficient architectures and effective temporal context handling, with current models struggling with vision token redundancy in extended sequences.

Method: Uses hybrid Mamba-Transformer backbone with TransV module to transfer and compress vision tokens into instruction tokens while maintaining multimodal understanding capabilities.

Result: Achieves competitive performance with state-of-the-art models while processing hour-long videos exceeding 10,000 frames, with analysis of attention behaviors in hybrid architectures.

Conclusion: Represents initial progress in developing, interpreting, and compressing hybrid Mamba-Transformer models for long video understanding tasks.

Abstract: We introduce TimeViper, a hybrid vision-language model designed to tackle challenges of long video understanding. Processing long videos demands both an efficient model architecture and an effective mechanism for handling extended temporal contexts. To this end, TimeViper adopts a hybrid Mamba-Transformer backbone that combines the efficiency of state-space models with the expressivity of attention mechanisms. Through this hybrid design, we reveal the vision-to-text information aggregation phenomenon, where information progressively flows from vision tokens to text tokens across increasing LLM depth, resulting in severe vision token redundancy. Motivated by this observation, we propose TransV, a token information transfer module that transfers and compresses vision tokens into instruction tokens while maintaining multimodal understanding capabilities. This design enables TimeViper to process hour-long videos exceeding 10,000 frames. Extensive experiments across multiple benchmarks demonstrate that TimeViper competes with state-of-the-art models while extending frame numbers. We further analyze attention behaviors of both Mamba and Transformer layers, offering new insights into hybrid model interpretability. This work represents an initial step towards developing, interpreting, and compressing hybrid Mamba-Transformer architectures.

[203] ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

Mengqi Xue, Qihan Huang, Haofei Zhang, Jingwen Hu, Jie Song, Mingli Song, Canghong Jin

Main category: cs.CV

TL;DR: ProtoPFormer addresses the ‘distraction’ problem in transformer-based prototype networks by introducing global and local prototypes that mutually guide each other to focus on foreground objects and improve interpretability.

Details

Motivation: When applying ProtoPNet to vision transformers (ViTs), prototypes get distracted by background features and pay less attention to foreground objects, impairing interpretability due to ViT's long-range dependency modeling.

Method: Proposes global and local prototypes that work together - global prototypes provide object-level guidance to help local prototypes focus on foreground, while local prototypes capture specific visual parts with explicit supervision.

Result: The method achieves superior performance and visualization results over state-of-the-art prototype-based baselines, with global and local prototypes mutually correcting each other for transparent decision-making.

Conclusion: ProtoPFormer effectively applies prototype-based methods to ViTs by leveraging global and local prototypes that jointly reason decisions from whole and local perspectives, improving both performance and interpretability.

Abstract: Prototypical part network (ProtoPNet) has drawn wide attention and boosted many follow-up studies due to its self-explanatory property for explainable artificial intelligence (XAI). However, when directly applying ProtoPNet on vision transformer (ViT) backbones, learned prototypes have a “distraction” problem: they have a relatively high probability of being activated by the background and pay less attention to the foreground. The powerful capability of modeling long-term dependency makes the transformer-based ProtoPNet hard to focus on prototypical parts, thus severely impairing its inherent interpretability. This paper proposes prototypical part transformer (ProtoPFormer) for appropriately and effectively applying the prototype-based method with ViTs for interpretable image recognition. The proposed method introduces global and local prototypes for capturing and highlighting the representative holistic and partial features of targets according to the architectural characteristics of ViTs. The global prototypes are adopted to provide the global view of objects to guide local prototypes to concentrate on the foreground while eliminating the influence of the background. Afterwards, local prototypes are explicitly supervised to concentrate on their respective prototypical visual parts, increasing the overall interpretability. Extensive experiments demonstrate that our proposed global and local prototypes can mutually correct each other and jointly make final decisions, which faithfully and transparently reason the decision-making processes associatively from the whole and local perspectives, respectively. Moreover, ProtoPFormer consistently achieves superior performance and visualization results over the state-of-the-art (SOTA) prototype-based baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.

[204] LTD: Low Temperature Distillation for Gradient Masking-free Adversarial Training

Erh-Chung Chen, Che-Rung Lee

Main category: cs.CV

TL;DR: LTD (Low-Temperature Distillation) improves adversarial robustness by refining one-hot labels using low teacher temperature while keeping student temperature fixed, addressing data ambiguity and gradient masking issues.

Details

Motivation: One-hot label representations in image classification are imprecise due to real-world data ambiguity where samples exhibit characteristics of multiple classes, leading to model vulnerabilities against adversarial attacks.

Method: Introduces Low-Temperature Distillation (LTD) that uses a relatively low temperature in the teacher model while maintaining a fixed temperature for the student model during both training and inference, refining label representations.

Result: Achieved robust accuracy of 58.19% on CIFAR-10, 31.13% on CIFAR-100, and 42.08% on ImageNet without additional data, showing improved adversarial robustness.

Conclusion: LTD effectively addresses data distribution assumptions, enhances model robustness, and avoids gradient masking problems in defensive distillation, demonstrating superior performance when combined with existing frameworks.

Abstract: Adversarial training is a widely adopted strategy to bolster the robustness of neural network models against adversarial attacks. This paper revisits the fundamental assumptions underlying image classification and suggests that representing data as one-hot labels is a key factor that leads to vulnerabilities. However, in real-world datasets, data ambiguity often arises, with samples exhibiting characteristics of multiple classes, rendering one-hot label representations imprecise. To address this, we introduce a novel approach, Low-Temperature Distillation (LTD), designed to refine label representations. Unlike previous approaches, LTD incorporates a relatively low temperature in the teacher model, while maintaining a fixed temperature for the student model during both training and inference. This strategy not only refines assumptions about data distribution but also strengthens model robustness and avoids the gradient masking problem commonly encountered in defensive distillation. Experimental results demonstrate the efficacy of the proposed method when combined with existing frameworks, achieving robust accuracy rates of 58.19%, 31.13%, and 42.08% on the CIFAR-10, CIFAR-100, and ImageNet datasets, respectively, without the need for additional data.

[205] AMLP: Adjustable Masking Lesion Patches for Self-Supervised Medical Image Segmentation

Xiangtao Wang, Ruizhi Wang, Thomas Lukasiewicz, Zhenghua Xu

Main category: cs.CV

TL;DR: AMLP is a self-supervised medical image segmentation framework that addresses challenges in applying masked image modeling to medical images through adjustable masking strategies and specialized loss functions.

Details

Motivation: Direct application of masked image modeling (MIM) to medical images fails due to their complexity, distinct contour features, and conventional fixed masking ratios that limit learnable information.

Method: Proposes Adjustable Masking Lesion Patches (AMLP) with Masked Patch Selection, Relative Reconstruction Loss, Category Consistency Loss, and Adjustable Masking Ratio to identify lesion patches and improve reconstruction.

Result: Extensive experiments on two medical segmentation datasets show superior performance compared to state-of-the-art self-supervised methods.

Conclusion: AMLP effectively addresses masked modeling challenges for medical images and captures accurate lesion details crucial for segmentation tasks.

Abstract: Self-supervised masked image modeling (MIM) methods have shown promising performances on analyzing natural images. However, directly applying such methods to medical image segmentation tasks still cannot achieve satisfactory results. The challenges arise from the facts that (i) medical images are inherently more complex compared to natural images, and the subjects in medical images often exhibit more distinct contour features; (ii) moreover, the conventional high and fixed masking ratio in MIM is likely to mask the background, limiting the scope of learnable information. To address these problems, we propose a new self-supervised medical image segmentation framework, called Adjustable Masking Lesion Patches (AMLP), which employs Masked Patch Selection~(MPS) strategy to identify patches with high probabilities of containing lesions to help model achieve precise lesion reconstruction. To improve the categorization of patches in MPS, we further introduce Relative Reconstruction Loss (RRL) to better learn hard-to-reconstruct lesion patches. Then, Category Consistency Loss (CCL) is proposed to refine patch categorization based on reconstruction difficulty, enhancing difference between lesions and backgrounds. Moreover, an Adjustable Masking Ratio (AMR) strategy is proposed to gradually increase the masking ratio over training to expand~~the scope of learnable mutual information. Extensive~~experiments on two medical segmentation datasets demonstrate the superior performances of the proposed AMLP w.r.t. the SOTA self-supervised methods; the results prove that AMLP effectively addresses the challenges of applying masked modeling to medical images and capturing accurate lesion details that are crucial for segmentation tasks.

[206] Restoration-Oriented Video Frame Interpolation with Region-Distinguishable Priors from SAM

Yan Han, Xiaogang Xu, Yingqi Lin, Jiafei Wu, Zhe Liu, Ming-Hsuan Yang

Main category: cs.CV

TL;DR: The paper introduces Region-Distinguishable Priors (RDPs) using SAM2 segmentation to improve motion estimation in Video Frame Interpolation by distinguishing regions before motion estimation, achieving better intermediate frame synthesis.

Details

Motivation: Existing VFI methods struggle with motion estimation accuracy due to ambiguity in identifying corresponding areas between frames. Enhancing accuracy by distinguishing different regions before motion estimation is crucial.

Method: Uses SAM2 segmentation to create RDPs as spatial-varying Gaussian mixtures, integrated via Hierarchical Region-aware Feature Fusion Module (HRFFM) with RDP-guided Feature Normalization in a residual learning manner.

Result: HRFFM consistently enhances VFI performance across various scenes by making features exhibit similar representations for matched regions in neighboring frames.

Conclusion: The proposed RDP and HRFFM approach effectively improves motion estimation accuracy and intermediate frame synthesis in VFI methods.

Abstract: In existing restoration-oriented Video Frame Interpolation (VFI) approaches, the motion estimation between neighboring frames plays a crucial role. However, the estimation accuracy in existing methods remains a challenge, primarily due to the inherent ambiguity in identifying corresponding areas in adjacent frames for interpolation. Therefore, enhancing accuracy by distinguishing different regions before motion estimation is of utmost importance. In this paper, we introduce a novel solution involving the utilization of open-world segmentation models, e.g., SAM2 (Segment Anything Model2) for frames, to derive Region-Distinguishable Priors (RDPs) in different frames. These RDPs are represented as spatial-varying Gaussian mixtures, distinguishing an arbitrary number of areas with a unified modality. RDPs can be integrated into existing motion-based VFI methods to enhance features for motion estimation, facilitated by our designed play-and-plug Hierarchical Region-aware Feature Fusion Module (HRFFM). HRFFM incorporates RDP into various hierarchical stages of VFI’s encoder, using RDP-guided Feature Normalization (RDPFN) in a residual learning manner. With HRFFM and RDP, the features within VFI’s encoder exhibit similar representations for matched regions in neighboring frames, thus improving the synthesis of intermediate frames. Extensive experiments demonstrate that HRFFM consistently enhances VFI performance across various scenes.

[207] A Simple Framework Towards Vision-based Traffic Signal Control with Microscopic Simulation

Pan He, Quanyi Li, Xiaoyong Yuan, Bolei Zhou

Main category: cs.CV

TL;DR: Vision-based traffic signal control using computer vision for end-to-end learning, introducing TrafficDojo simulation framework integrating SUMO and MetaDrive for comprehensive evaluation.

Details

Motivation: Traditional traffic signal control relies on heuristics and predefined features, while vision-based approaches offer less dependency on these and enable end-to-end learning and optimization.

Method: Developed TrafficDojo simulation framework that integrates SUMO’s microscopic traffic flow with MetaDrive’s 3D driving simulator, establishing baseline algorithms including traditional and RL approaches.

Result: Created a versatile traffic environment for analyzing and evaluating traffic signal controllers across diverse conditions and scenarios.

Conclusion: This work provides insights for designing vision-based traffic signal control approaches and opens new research opportunities in the field.

Abstract: Traffic signal control (TSC) is crucial for reducing traffic congestion leading to smoother traffic flow, reduced idle time, and mitigated CO2 emissions. In this paper, we explore the computer vision approach for TSC that modulates on-road traffic flows through visual observation. Unlike traditional feature-based approaches, vision-based methods depend much less on heuristics and predefined features, bringing promising potentials for end-to-end learning and optimization of traffic signals. Thus, we introduce a simple traffic simulation framework called TrafficDojo towards vision-based TSC and its benchmark by integrating the microscopic traffic flow provided in SUMO into the 3D driving simulator MetaDrive. This proposed framework offers a versatile traffic environment for in-depth analysis and comprehensive evaluation of traffic signal controllers across diverse traffic conditions and scenarios. We establish and compare baseline algorithms including both traditional and Reinforcement Learning (RL) approaches. This work sheds light on the design and development of vision-based TSC approaches and opens up new research opportunities

[208] SOAP: Enhancing Spatio-Temporal Relation and Motion Information Capturing for Few-Shot Action Recognition

Wenbo Huang, Jinghui Zhang, Xuwei Qian, Zhen Wu, Meng Wang, Lei Zhang

Main category: cs.CV

TL;DR: SOAP is a plug-and-play architecture for few-shot action recognition that enhances spatio-temporal relations and motion information capture through frame tuples, achieving state-of-the-art performance on multiple benchmarks.

Details

Motivation: High frame-rate videos improve expression but reduce spatio-temporal relation density, requiring large datasets. Few-shot scenarios are common in real-world applications, but current methods separate spatial and temporal features and capture motion information inadequately from narrow perspectives.

Method: Proposes SOAP architecture that considers temporal connections between feature channels and spatio-temporal relations. Uses frame tuples with multiple frames to capture comprehensive motion information, combining tuples of diverse frame counts for broader perspective.

Result: SOAP-Net achieves new state-of-the-art performance on SthSthV2, Kinetics, UCF101, and HMDB51 benchmarks. Extensive evaluations demonstrate competitiveness, pluggability, generalization, and robustness.

Conclusion: The proposed SOAP architecture effectively addresses limitations in current few-shot action recognition methods by better integrating spatio-temporal features and capturing comprehensive motion information through frame tuples.

Abstract: High frame-rate (HFR) videos of action recognition improve fine-grained expression while reducing the spatio-temporal relation and motion information density. Thus, large amounts of video samples are continuously required for traditional data-driven training. However, samples are not always sufficient in real-world scenarios, promoting few-shot action recognition (FSAR) research. We observe that most recent FSAR works build spatio-temporal relation of video samples via temporal alignment after spatial feature extraction, cutting apart spatial and temporal features within samples. They also capture motion information via narrow perspectives between adjacent frames without considering density, leading to insufficient motion information capturing. Therefore, we propose a novel plug-and-play architecture for FSAR called Spatio-tempOral frAme tuPle enhancer (SOAP) in this paper. The model we designed with such architecture refers to SOAP-Net. Temporal connections between different feature channels and spatio-temporal relation of features are considered instead of simple feature extraction. Comprehensive motion information is also captured, using frame tuples with multiple frames containing more motion information than adjacent frames. Combining frame tuples of diverse frame counts further provides a broader perspective. SOAP-Net achieves new state-of-the-art performance across well-known benchmarks such as SthSthV2, Kinetics, UCF101, and HMDB51. Extensive empirical evaluations underscore the competitiveness, pluggability, generalization, and robustness of SOAP. The code is released at https://github.com/wenbohuang1002/SOAP.

[209] Activator: GLU Activation Function as the Core Component of a Vision Transformer

Abdullah Nazhat Abdullah, Tarkan Aydin

Main category: cs.CV

TL;DR: This paper proposes replacing the MLP and attention mechanisms in transformers with a gated linear unit (GLU) architecture to reduce computational costs while maintaining competitive performance.

Details

Motivation: Transformers have achieved great success in NLP and CV, but their reliance on computationally expensive scaled dot product attention with softmax requires large compute capabilities for training and inference.

Method: Substitute the traditional MLP and attention mechanism in transformers with an architecture based on incorporating a gated linear unit (GLU) activation function structure.

Result: Experimental assessments show the proposed GLU-based modification offers competitive performance compared to baseline architectures while achieving targeted reductions in computational complexity.

Conclusion: GLU-based MLPs provide a more efficient but capable alternative to traditional MLP and attention mechanisms as core components in transformer architecture design.

Abstract: The transformer architecture has driven many successes in a variety of tasks within the field of deep learning, in particular the recent advances in natural language processing (NLP) culminating with large language models (LLM). Adding to that success, transformer architecture has found widespread interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multitask and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities for both training and inference. This paper investigates substituting the MLP and attention mechanism usually adopted for transformer architecture with an architecture based on incorporating a gated linear unit (GLU) activation function structure with the aim of reducing the computational cost. The equalized experimental assessments conducted in this work show that the proposed modification with the targeted reductions in computational complexity offers competitive performance compared to the selected baseline architectures. The results are significantly in support of the aims of this work, in which the focus was to extensively utilize GLU-based MLPs, establishing a more efficient but capable alternative to the traditional MLP and the attention mechanism as the core component in the design of transformer architectures.

[210] A Gray-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse

Zhongliang Guo, Chun Tong Lei, Lei Fang, Shuai Zhao, Yifei Qian, Jingyu Lin, Zeyu Wang, Cunjian Chen, Ognjen Arandjelović, Chun Pong Lau

Main category: cs.CV

TL;DR: PCA is a novel framework that protects images from unauthorized manipulation in LDMs by exploiting posterior collapse phenomena in VAEs, requiring only VAE encoder access and achieving prompt-invariant protection with high efficiency.

Details

Motivation: To address limitations of existing adversarial protection methods that rely heavily on model-specific knowledge and have high computational costs, while preventing data misappropriation and IP infringement in LDMs.

Method: Identifies diffusion and concentration collapse phenomena in VAE inference, designs unified loss function to achieve both collapse types through parameter adjustment, operates on VAE encoder before text conditioning.

Result: Outperforms existing techniques in protection effectiveness, computational efficiency (runtime and VRAM), and generalization across VAE-based LDM variants; requires only 4% of LDM parameters.

Conclusion: PCA provides an efficient, transferable solution for image protection in LDMs by leveraging VAE posterior collapse, significantly reducing model dependency while maintaining strong protection against unauthorized manipulation.

Abstract: Recent advancements in Latent Diffusion Models (LDMs) have revolutionized image synthesis and manipulation, raising significant concerns about data misappropriation and intellectual property infringement. While adversarial attacks have been extensively explored as a protective measure against such misuse of generative AI, current approaches are severely limited by their heavy reliance on model-specific knowledge and substantial computational costs. Drawing inspiration from the posterior collapse phenomenon observed in VAE training, we propose the Posterior Collapse Attack (PCA), a novel framework for protecting images from unauthorized manipulation. Through comprehensive theoretical analysis and empirical validation, we identify two distinct collapse phenomena during VAE inference: diffusion collapse and concentration collapse. Based on this discovery, we design a unified loss function that can flexibly achieve both types of collapse through parameter adjustment, each corresponding to different protection objectives in preventing image manipulation. Our method significantly reduces dependence on model-specific knowledge by requiring access to only the VAE encoder, which constitutes less than 4% of LDM parameters. Notably, PCA achieves prompt-invariant protection by operating on the VAE encoder before text conditioning occurs, eliminating the need for empty prompt optimization required by existing methods. This minimal requirement enables PCA to maintain adequate transferability across various VAE-based LDM architectures while effectively preventing unauthorized image editing. Extensive experiments show PCA outperforms existing techniques in protection effectiveness, computational efficiency (runtime and VRAM), and generalization across VAE-based LDM variants. Our code is available at https://github.com/ZhongliangGuo/PosteriorCollapseAttack.

[211] Interactive Occlusion Boundary Estimation through Exploitation of Synthetic Data

Lintao Xu, Chaohui Wang

Main category: cs.CV

TL;DR: MS³PE is a multi-scribble-guided deep learning framework for interactive occlusion boundary estimation, featuring intuitive multi-scribble interactions and a 3-encoding-path network with multi-scale strip convolutions. The paper also introduces synthetic data generation tool Mesh2OB and new benchmarks OB-FUTURE and OB-LIGM.

Details

Motivation: To address the challenge of occlusion boundary estimation in scene understanding, particularly the lack of systematic interactive methods and scarcity of well-annotated real-world data for training and evaluation.

Method: Proposed MS³PE framework with multi-scribble interaction mechanism and 3-encoding-path network enhanced with multi-scale strip convolutions. Developed Mesh2OB tool for automated generation of precise ground-truth occlusion boundaries from 3D scenes with self-occlusions handled.

Result: MS³PE outperforms adapted baselines from seven state-of-the-art interactive segmentation methods. Created OB-FUTURE synthetic benchmark for generalizable training without domain adaptation, and OB-LIGM real-world benchmark with 120 high-resolution annotated images.

Conclusion: The work establishes the first systematic study of interactive occlusion boundary estimation, providing effective tools, frameworks, and benchmarks that advance the field and enable better scene understanding through improved occlusion boundary detection.

Abstract: Occlusion boundaries (OBs) geometrically localize occlusion events in 2D images and provide critical cues for scene understanding. In this paper, we present the first systematic study of Interactive Occlusion Boundary Estimation (IOBE), introducing MS\textsuperscript{3}PE, a novel multi-scribble-guided deep-learning framework that advances IOBE through two key innovations: (1) an intuitive multi-scribble interaction mechanism, and (2) a 3-encoding-path network enhanced with multi-scale strip convolutions. Our MS\textsuperscript{3}PE surpasses adapted baselines from seven state-of-the-art interactive segmentation methods, and demonstrates strong potential for OB benchmark construction through our real-user experiment. Besides, to address the scarcity of well-annotated real-world data, we propose using synthetic data for training IOBE models, and developed Mesh2OB, the first automated tool for generating precise ground-truth OBs from 3D scenes with self-occlusions explicitly handled, enabling creation of the OB-FUTURE synthetic benchmark that facilitates generalizable training without domain adaptation. Finally, we introduce OB-LIGM, a high-quality real-world benchmark comprising 120 meticulously annotated high-resolution images advancing evaluation standards in OB research. Source code and resources are available at https://github.com/xul-ops/IOBE.

[212] Open Vocabulary Monocular 3D Object Detection

Jin Yao, Hao Gu, Xuweiyi Chen, Jiayun Wang, Zezhou Cheng

Main category: cs.CV

TL;DR: Open-vocabulary monocular 3D detection from single RGB images using pretrained foundation models to overcome 3D annotation scarcity and semantic ambiguities.

Details

Motivation: Existing 3D detectors rely on costly sensors (LiDAR) or multi-view setups, and are limited to closed vocabularies, restricting real-world applicability.

Method: Integrates pretrained 2D and 3D vision foundation models to reduce dependence on 3D supervision, with a novel evaluation metric to address missing labels and semantic ambiguities.

Result: Achieves state-of-the-art results in both zero-shot 3D detection of novel categories and in-domain detection on seen classes.

Conclusion: Provides a strong baseline and establishes a reliable benchmark for open-vocabulary monocular 3D detection research.

Abstract: We propose and study open-vocabulary monocular 3D detection, a novel task that aims to detect objects of any categores in metric 3D space from a single RGB image. Existing 3D object detectors either rely on costly sensors such as LiDAR or multi-view setups, or remain confined to closed vocabularies settings with limited categories, restricting their applicability. We identify two key challenges in this new setting. First, the scarcity of 3D bounding box annotations limits the ability to train generalizable models. To reduce dependence on 3D supervision, we propose a framework that effectively integrates pretrained 2D and 3D vision foundation models. Second, missing labels and semantic ambiguities (\eg, table vs. desk) in existing datasets hinder reliable evaluation. To address this, we design a novel metric that captures model performance while mitigating annotation issues. Our approach achieves state-of-the-art results in zero-shot 3D detection of novel categories as well as in-domain detection on seen classes. We hope our method provides a strong baseline and our evaluation protocol establishes a reliable benchmark for future research.

[213] Active Negative Loss: A Robust Framework for Learning with Noisy Labels

Xichen Ye, Yifan Wu, Yiqi Wang, Xiaoqiang Li, Weizhong Zhang, Yifan Chen

Main category: cs.CV

TL;DR: The paper introduces Normalized Negative Loss Functions (NNLFs) to replace MAE in the Active Passive Loss framework, creating Active Negative Loss (ANL) for better robustness against label noise in deep learning.

Details

Motivation: Existing noise-robust loss functions like APL with MAE pay equal attention to clean and noisy samples, slowing convergence and making training difficult in large-scale datasets with noisy labels.

Method: Proposed NNLFs as passive loss functions in APL framework, creating ANL. Also introduced entropy-based regularization for non-symmetric noise scenarios to handle label imbalance.

Result: Extensive experiments show ANL with NNLFs achieves better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks.

Conclusion: The proposed ANL framework with NNLFs effectively addresses MAE’s limitations by focusing more on memorized clean samples, providing improved robustness against label noise in deep learning.

Abstract: Deep supervised learning has achieved remarkable success across a wide range of tasks, yet it remains susceptible to overfitting when confronted with noisy labels. To address this issue, noise-robust loss functions offer an effective solution for enhancing learning in the presence of label noise. In this work, we systematically investigate the limitation of the recently proposed Active Passive Loss (APL), which employs Mean Absolute Error (MAE) as its passive loss function. Despite the robustness brought by MAE, one of its key drawbacks is that it pays equal attention to clean and noisy samples; this feature slows down convergence and potentially makes training difficult, particularly in large-scale datasets. To overcome these challenges, we introduce a novel loss function class, termed Normalized Negative Loss Functions (NNLFs), which serve as passive loss functions within the APL framework. NNLFs effectively address the limitations of MAE by concentrating more on memorized clean samples. By replacing MAE in APL with our proposed NNLFs, we enhance APL and present a new framework called Active Negative Loss (ANL). Moreover, in non-symmetric noise scenarios, we propose an entropy-based regularization technique to mitigate the vulnerability to the label imbalance. Extensive experiments demonstrate that the new loss functions adopted by our ANL framework can achieve better or comparable performance to state-of-the-art methods across various label noise types and in image segmentation tasks. The source code is available at: https://github.com/Virusdoll/Active-Negative-Loss.

[214] Unsupervised Segmentation by Diffusing, Walking and Cutting

Daniela Ivanova, Marco Aversa, Paul Henderson, John Williamson

Main category: cs.CV

TL;DR: Unsupervised image segmentation using self-attention from pre-trained diffusion models, achieving state-of-the-art zero-shot performance without training.

Details

Motivation: To leverage the rich semantic relationships captured in pre-trained text-to-image diffusion models' self-attention layers for unsupervised segmentation, avoiding the need for additional training.

Method: Construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts, treating self-attention probabilities as transition matrices for random walks.

Result: Achieves state-of-the-art results on COCO-Stuff-27 and Cityscapes for zero-shot unsupervised segmentation, surpassing all existing methods.

Conclusion: Pre-trained diffusion models’ self-attention layers provide powerful semantic features for unsupervised segmentation, with the random walk interpretation effectively capturing long-range relationships between image patches.

Abstract: We propose an unsupervised image segmentation method using features from pre-trained text-to-image diffusion models. Inspired by classic spectral clustering approaches, we construct adjacency matrices from self-attention layers between image patches and recursively partition using Normalised Cuts. A key insight is that self-attention probability distributions, which capture semantic relations between patches, can be interpreted as a transition matrix for random walks across the image. We leverage this by first using Random Walk Normalized Cuts directly on these self-attention activations to partition the image, minimizing transition probabilities between clusters while maximizing coherence within clusters. Applied recursively, this yields a hierarchical segmentation that reflects the rich semantics in the pre-trained attention layers, without any additional training. Next, we explore other ways to build the NCuts adjacency matrix from features, and how we can use the random walk interpretation of self-attention to capture long-range relationships. Finally, we propose an approach to automatically determine the NCut cost criterion, avoiding the need to tune this manually. We quantitatively analyse the effect incorporating different features, a constant versus dynamic NCut threshold, and incorporating multi-node paths when constructing the NCuts adjacency matrix. We show that our approach surpasses all existing methods for zero-shot unsupervised segmentation, achieving state-of-the-art results on COCO-Stuff-27 and Cityscapes.

Abhinav Pratap, Sushant Kumar, Suchinton Chakravarty

Main category: cs.CV

TL;DR: Evaluation of four real-time object detection algorithms (YOLO, SSD, Faster R-CNN, Mask R-CNN) for indoor navigation assistance for visually impaired individuals, analyzing accuracy, speed, and adaptability trade-offs.

Details

Motivation: Address the need for accurate and efficient object detection in assistive technologies for visually impaired individuals to enhance indoor navigation solutions and promote accessibility.

Method: Evaluate four real-time object detection algorithms using the Indoor Objects Detection dataset, analyzing detection accuracy, processing speed, and adaptability to indoor environments.

Result: Findings highlight trade-offs between precision and efficiency in object detection algorithms, providing insights for selecting optimal algorithms for real-time assistive navigation.

Conclusion: This research advances adaptive machine learning applications for enhancing indoor navigation solutions for the visually impaired, contributing to improved accessibility through optimized algorithm selection.

Abstract: This study addresses the need for accurate and efficient object detection in assistive technologies for visually impaired individuals. We evaluate four real-time object detection algorithms YOLO, SSD, Faster R-CNN, and Mask R-CNN within the context of indoor navigation assistance. Using the Indoor Objects Detection dataset, we analyze detection accuracy, processing speed, and adaptability to indoor environments. Our findings highlight the trade-offs between precision and efficiency, offering insights into selecting optimal algorithms for realtime assistive navigation. This research advances adaptive machine learning applications, enhancing indoor navigation solutions for the visually impaired and promoting accessibility.

[216] Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

Yuxuan Xue, Xianghui Xie, Riccardo Marin, Gerard Pons-Moll

Main category: cs.CV

TL;DR: Gen-3Diffusion is a novel method that synergizes 2D and 3D diffusion models to generate realistic 3D objects and avatars from single RGB images, addressing the challenge of 3D consistency in multi-view generation.

Details

Motivation: Single image 3D generation is ill-posed and existing 2D diffusion models lack 3D consistency in multi-view generation, requiring a solution that combines the generalization of 2D models with the 3D consistency of 3D models.

Method: Proposes Gen-3Diffusion that synchronizes pre-trained 2D and 3D diffusion models through an elegantly designed training and sampling process, enabling mutual enhancement between 2D generalization and 3D consistency.

Result: Generates realistic 3D objects and avatars with high-fidelity geometry and texture, demonstrating strong generalization to diverse clothing and compositional shapes through extensive experiments.

Conclusion: The synergy between 2D and 3D diffusion models effectively addresses the challenges of single image 3D generation, producing high-quality 3D content with both generalization capability and multi-view consistency.

Abstract: Creating realistic 3D objects and clothed avatars from a single RGB image is an attractive yet challenging problem. Due to its ill-posed nature, recent works leverage powerful prior from 2D diffusion models pretrained on large datasets. Although 2D diffusion models demonstrate strong generalization capability, they cannot guarantee the generated multi-view images are 3D consistent. In this paper, we propose Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy. We leverage a pre-trained 2D diffusion model and a 3D diffusion model via our elegantly designed process that synchronizes two diffusion models at both training and sampling time. The synergy between the 2D and 3D diffusion models brings two major advantages: 1) 2D helps 3D in generalization: the pretrained 2D model has strong generalization ability to unseen images, providing strong shape priors for the 3D diffusion model; 2) 3D helps 2D in multi-view consistency: the 3D diffusion model enhances the 3D consistency of 2D multi-view sampling process, resulting in more accurate multi-view generation. We validate our idea through extensive experiments in image-based objects and clothed avatar generation tasks. Results show that our method generates realistic 3D objects and avatars with high-fidelity geometry and texture. Extensive ablations also validate our design choices and demonstrate the strong generalization ability to diverse clothing and compositional shapes. Our code and pretrained models will be publicly released on https://yuxuan-xue.com/gen-3diffusion.

[217] Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Guoqi Li

Main category: cs.CV

TL;DR: DMNIL is a self-supervised method for drone-view geo-localization that uses dynamic memory and neighborhood information learning to overcome the need for pre-paired drone-satellite images, achieving state-of-the-art performance without supervised training.

Details

Motivation: Existing drone-view geo-localization methods require expensive pre-paired drone-satellite images and lack transferability to new regions, limiting practical deployment in open-world scenarios.

Method: Uses clustering for pseudo-labels and dual-path contrastive learning. Includes DHML module for memory-driven feature consistency and ICEL module for neighborhood-driven cross-view alignment, plus pseudo-label enhancement for training stability.

Result: Outperforms existing self-supervised methods and surpasses several state-of-the-art supervised methods on three public benchmark datasets.

Conclusion: DMNIL provides an effective self-supervised solution for drone geo-localization that eliminates the need for costly paired annotations while achieving competitive performance.

Abstract: Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images. However, most existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning. When the target region shifts, new paired samples are typically required to adapt to the distribution changes. The high cost of annotation and the limited transferability of these methods significantly hinder the practical deployment of DVGL in open-world scenarios. To address these limitations, we propose a novel end-to-end self-supervised learning method with a shallow backbone network, called the dynamic memory-driven and neighborhood information learning (DMNIL) method. It employs a clustering algorithm to generate pseudo-labels and adopts a dual-path contrastive learning framework to learn discriminative intra-view representations. Furthermore, DMNIL incorporates two core modules, including the dynamic hierarchical memory learning (DHML) module and the information consistency evolution learning (ICEL) module. The DHML module combines short-term and long-term memory to enhance intra-view feature consistency and discriminability. Meanwhile, the ICEL module utilizes a neighborhood-driven dynamic constraint mechanism to systematically capture implicit cross-view semantic correlations, consequently improving cross-view feature alignment. To further stabilize and strengthen the self-supervised training process, a pseudo-label enhancement strategy is introduced to enhance the quality of pseudo supervision. Extensive experiments on three public benchmark datasets demonstrate that the proposed method consistently outperforms existing self-supervised methods and even surpasses several state-of-the-art supervised methods. Our code is available at https://github.com/ISChenawei/DMNIL.

[218] Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis

Cheng Yuan, Jian Jiang, Kunyi Yang, Lv Wu, Rui Wang, Zi Meng, Haonan Ping, Ziyu Xu, Yifan Zhou, Wanli Song, Hesheng Wang, Yueming Jin, Qi Dou, Yutong Ban

Main category: cs.CV

TL;DR: First comprehensive evaluation of SAM2’s zero-shot capability for surgical video segmentation across 9 datasets and 17 surgery types, showing notable adaptability in structured scenarios but performance gaps in dynamic surgical conditions.

Details

Motivation: Surgical video segmentation is critical for AI in surgery but limited by annotated data. SAM2 offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments remains unexplored.

Method: Comprehensive evaluation of SAM2’s zero-shot capability across 9 surgical datasets covering laparoscopic, endoscopic, and robotic procedures. Analysis includes various prompting strategies (points, boxes, masks) and finetuning approaches, plus robustness testing to surgical challenges.

Result: SAM2 demonstrates notable zero-shot adaptability in structured scenarios like instrument segmentation, multi-organ segmentation, and scene segmentation, but performance varies under dynamic surgical conditions with gaps in handling temporal coherence and domain-specific artifacts.

Conclusion: Results highlight future pathways to adaptive data-efficient solutions for surgical data science, showing SAM2’s potential while identifying areas needing improvement for complex surgical environments.

Abstract: Surgical video segmentation is critical for AI to interpret spatial-temporal dynamics in surgery, yet model performance is constrained by limited annotated data. The SAM2 model, pretrained on natural videos, offers potential for zero-shot surgical segmentation, but its applicability in complex surgical environments, with challenges like tissue deformation and instrument variability, remains unexplored. We present the first comprehensive evaluation of the zero-shot capability of SAM2 in 9 surgical datasets (17 surgery types), covering laparoscopic, endoscopic, and robotic procedures. We analyze various prompting (points, boxes, mask) and {finetuning (dense, sparse) strategies}, robustness to surgical challenges, and generalization across procedures and anatomies. Key findings reveal that while SAM2 demonstrates notable zero-shot adaptability in structured scenarios (e.g., instrument segmentation, {multi-organ segmentation}, and scene segmentation), its performance varies under dynamic surgical conditions, highlighting gaps in handling temporal coherence and domain-specific artifacts. These results highlight future pathways to adaptive data-efficient solutions for the surgical data science field.

[219] LASER: Lip Landmark Assisted Speaker Detection for Robustness

Le Thien Phuc Nguyen, Zhuoran Yu, Yong Jae Lee

Main category: cs.CV

TL;DR: LASER improves Active Speaker Detection by incorporating lip landmarks during training to focus on speech-relevant regions, achieving state-of-the-art performance and strong robustness to background noise.

Details

Motivation: Existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized, while humans naturally rely on lip-audio synchronization.

Method: Extracts visual features and encodes 2D lip landmarks into dense maps, with an auxiliary consistency loss to align lip-aware and face-only predictions, eliminating need for landmark detectors at test time.

Result: Outperforms state-of-the-art models across in-domain and out-of-domain benchmarks. On high-noise subset of LASER-bench, improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet respectively.

Conclusion: LASER demonstrates strong resilience to real-world acoustic challenges and provides robust active speaker detection through explicit lip landmark guidance during training.

Abstract: Active Speaker Detection (ASD) aims to identify who is speaking in complex visual scenes. While humans naturally rely on lip-audio synchronization, existing ASD models often misclassify non-speaking instances when lip movements and audio are unsynchronized. To address this, we propose Lip landmark Assisted Speaker dEtection for Robustness (LASER), which explicitly incorporates lip landmarks during training to guide the model’s attention to speech-relevant regions. Given a face track, LASER extracts visual features and encodes 2D lip landmarks into dense maps. To handle failure cases such as low resolution or occlusion, we introduce an auxiliary consistency loss that aligns lip-aware and face-only predictions, removing the need for landmark detectors at test time. LASER outperforms state-of-the-art models across both in-domain and out-of-domain benchmarks. To further evaluate robustness in realistic conditions, we introduce LASER-bench, a curated dataset of modern video clips with varying levels of background noise. On the high-noise subset, LASER improves mAP by 3.3 and 4.3 points over LoCoNet and TalkNet, respectively, demonstrating strong resilience to real-world acoustic challenges.

[220] Bayesian Neural Networks for One-to-Many Mapping in Image Enhancement

Guoxi Huang, Qirui Yang, Ruirui Lin, Zipeng Qi, David Bull, Nantheera Anantrasirichai

Main category: cs.CV

TL;DR: Proposes Bayesian Enhancement Model (BEM) using Bayesian Neural Networks to handle one-to-many mapping in image enhancement tasks, with a BNN-DNN framework for fast inference.

Details

Motivation: Degraded images can correspond to multiple plausible target images due to dynamic photography conditions, creating a one-to-many mapping problem in image enhancement tasks like low-light and underwater enhancement.

Method: Bayesian Enhancement Model (BEM) with BNN-DNN framework: BNN models one-to-many mapping in low-dimensional space, followed by DNN that refines fine-grained image details for fast inference.

Result: Extensive experiments on multiple low-light and underwater image enhancement benchmarks demonstrate the method’s effectiveness.

Conclusion: The proposed BEM successfully addresses the one-to-many mapping problem in image enhancement by capturing data uncertainty through Bayesian Neural Networks while maintaining fast inference.

Abstract: In image enhancement tasks, such as low-light and underwater image enhancement, a degraded image can correspond to multiple plausible target images due to dynamic photography conditions. This naturally results in a one-to-many mapping problem. To address this, we propose a Bayesian Enhancement Model (BEM) that incorporates Bayesian Neural Networks (BNNs) to capture data uncertainty and produce diverse outputs. To enable fast inference, we introduce a BNN-DNN framework: a BNN is first employed to model the one-to-many mapping in a low-dimensional space, followed by a Deterministic Neural Network (DNN) that refines fine-grained image details. Extensive experiments on multiple low-light and underwater image enhancement benchmarks demonstrate the effectiveness of our method.

[221] Towards Consistent and Controllable Image Synthesis for Face Editing

Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao

Main category: cs.CV

TL;DR: RigFace: A diffusion-based face editing method using Stable-Diffusion and 3D face models to control lighting, expression, and pose while preserving identity.

Details

Motivation: Diffusion models face challenges in controlling specific attributes and preserving identity consistency in face editing tasks, while GAN-based methods have limitations in attribute control.

Method: Uses Spatial Attribute Encoder for decoupled background/pose/expression/lighting conditions, FaceFusion for identity feature transfer, and Attribute Rigger to inject conditions into SD denoising UNet.

Result: Achieves comparable or superior performance in identity preservation and photorealism compared to existing face editing models.

Conclusion: RigFace successfully addresses identity preservation and attribute control challenges in diffusion-based face editing through disentangled factor control.

Abstract: Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.

[222] OuroMamba: A Data-Free Quantization Framework for Vision Mamba

Akshat Ramachandran, Mingyu Lee, Huan Xu, Souvik Kundu, Tushar Krishna

Main category: cs.CV

TL;DR: OuroMamba is the first data-free post-training quantization method for vision Mamba-based models that addresses challenges in generating meaningful synthetic data and handling dynamic outlier variations through a two-stage framework.

Details

Motivation: Vision Mamba-based models face two key challenges for data-free quantization: recurrent state transitions limit long-range interactions leading to weak synthetic data, and dynamic outlier variations across time-steps make static PTQ techniques ineffective.

Method: Two-stage framework: (1) OuroMamba-Gen uses contrastive learning on patch-level features from neighborhood interactions in latent state space to generate semantically rich synthetic data, (2) OuroMamba-Quant employs mixed-precision quantization with lightweight dynamic outlier detection using threshold-based outlier channel selection updated every time-step.

Result: OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings and vision/generative tasks, with practical latency speedup of up to 2.36x through efficient GPU kernels.

Conclusion: OuroMamba successfully enables data-free quantization for vision Mamba models by addressing their unique challenges through innovative synthetic data generation and dynamic quantization techniques, demonstrating superior performance over data-dependent methods.

Abstract: We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM’s recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code and synthetic dataset are available here: https://github.com/georgia-tech-synergy-lab/ICCV-OuroMamba

[223] RobustMerge: Parameter-Efficient Model Merging for MLLMs with Direction Robustness

Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang

Main category: cs.CV

TL;DR: RobustMerge is a training-free parameter-efficient merging method that maintains direction robustness through complementary parameter adaptation, enabling effective merging of parameter-efficient tuned models without data leakage.

Details

Motivation: With the expansion of data and model sizes, parameter-efficient tuning has become common practice, but existing merging methods designed for full fine-tuning fail when applied to efficiently tuned models. There's a need for efficient merging methods that can handle parameter-efficient modules.

Method: The method analyzes low-rank decomposition and identifies direction robustness as crucial. It prunes parameters and scales coefficients from inter-parameter relations for singular values to maintain direction stability, and performs cross-task normalization to enhance generalization to unseen tasks.

Result: Experiments on a diverse multimodal task benchmark demonstrate outstanding performance and generalizability. The method effectively merges parameter-efficient tuned models while maintaining robustness and preventing task interference.

Conclusion: RobustMerge provides an effective training-free solution for merging parameter-efficient tuned models, addressing the gap in efficient merging methods and enabling multi-task capabilities without data leakage.

Abstract: Fine-tuning pre-trained models with custom data leads to numerous expert models on specific tasks. Merging models into one universal model to empower multi-task ability refraining from data leakage has gained popularity. With the expansion in data and model size, parameter-efficient tuning becomes the common practice for obtaining task-specific models efficiently. However, few methods are dedicated to efficient merging, and existing methods designed for full fine-tuning merging fail under efficient merging. To address the issue, we analyze from low-rank decomposition and reveal that direction robustness during merging is crucial for merging efficient modules. We furthermore uncover that compensating for the gap between stark singular values contributes to direction robustness. Therefore, we propose RobustMerge, a training-free parameter-efficient merging method with complementary parameter adaptation to maintain direction robustness. Specifically, we (1) prune parameters and scale coefficients from inter-parameter relation for singular values to maintain direction stability away from task interference, and (2) perform cross-task normalization to enhance unseen task generalization. We establish a benchmark consisting of diverse multimodal tasks, on which we conduct experiments to certify the outstanding performance and generalizability of our method. Additional studies and extensive analyses further showcase the effectiveness. Code is available at https://github.com/AuroraZengfh/RobustMerge.

[224] Class-Independent Increment: An Efficient Approach for Multi-label Class-Incremental Learning

Chenhao Ding, Songlin Dong, Zhengdong Zhou, Jizhou Han, Qiang Wang, Yuhang He, Yihong Gong

Main category: cs.CV

TL;DR: Proposes CLIN, a novel multi-label class-incremental learning method that uses class-specific tokens and embeddings to address feature confusion and catastrophic forgetting in multi-label scenarios.

Details

Motivation: Real-world applications often involve multi-label classification (e.g., image retrieval, medical imaging), but current class-incremental learning research mainly focuses on single-label tasks, creating a gap for practical multi-label scenarios.

Method: Develops class-independent increment (CLIN) with CINet that extracts multiple class-level embeddings using class-specific tokens, plus two novel loss functions to optimize token learning and distinguish between new/old classes.

Result: Extensive experiments on MS-COCO and PASCAL VOC datasets show improved recognition performance and reduced forgetting across various multi-label class-incremental learning tasks.

Conclusion: CLIN effectively addresses feature confusion and catastrophic forgetting in multi-label class-incremental learning, demonstrating superior performance compared to existing methods.

Abstract: Current research on class-incremental learning primarily focuses on single-label classification tasks. However, real-world applications often involve multi-label scenarios, such as image retrieval and medical imaging. Therefore, this paper focuses on the challenging yet practical multi-label class-incremental learning (MLCIL) problem. In addition to the challenge of catastrophic forgetting, MLCIL encounters issues related to feature confusion, encompassing inter-session and intra-feature confusion. To address these problems, we propose a novel MLCIL approach called class-independent increment (CLIN). Specifically, in contrast to existing methods that extract image-level features, we propose a class-independent incremental network (CINet) to extract multiple class-level embeddings for multi-label samples. It learns and preserves the knowledge of different classes by constructing class-specific tokens. On this basis, we develop two novel loss functions, optimizing the learning of class-specific tokens and class-level embeddings, respectively. These losses aim to distinguish between new and old classes, further alleviating the problem of feature confusion. Extensive experiments on MS-COCO and PASCAL VOC datasets demonstrate the effectiveness of our method for improving recognition performance and mitigating forgetting on various MLCIL tasks.

[225] From Limited Labels to Open Domains:An Efficient Learning Method for Drone-view Geo-Localization

Zhongwei Chen, Zhao-Xu Yang, Hai-Jun Rong, Jiawei Lang, Guoqi Li

Main category: cs.CV

TL;DR: CDIKTNet is a novel cross-domain invariant knowledge transfer network for drone-view geo-localization that combines limited supervision with unsupervised learning to address feature confusion and enable effective cross-domain adaptation without extensive retraining.

Details

Motivation: Traditional supervised methods require paired training data and struggle with cross-view correlations from unpaired data, while unsupervised methods suffer from feature confusion due to geographical similarity and spatial continuity, leading to unreliable pseudo-labels.

Method: Proposes CDIKTNet with two sub-networks: CDIS learns cross-view structural and spatial invariance from limited paired data, and CDTS uses dual-path contrastive learning to optimize subspaces while maintaining shared feature space consistency.

Result: CDIKTNet achieves state-of-the-art performance under full supervision compared to supervised methods, and surpasses existing unsupervised methods in both few-shot and cross-domain initialization scenarios.

Conclusion: The proposed framework effectively addresses feature confusion in drone-view geo-localization by combining limited supervision with unsupervised learning, enabling robust cross-domain adaptation without extensive retraining requirements.

Abstract: Traditional supervised drone-view geo-localization (DVGL) methods heavily depend on paired training data and encounter difficulties in learning cross-view correlations from unpaired data. Moreover, when deployed in a new domain, these methods require obtaining the new paired data and subsequent retraining for model adaptation, which significantly increases computational overhead. Existing unsupervised methods have enabled to generate pseudo-labels based on cross-view similarity to infer the pairing relationships. However, geographical similarity and spatial continuity often cause visually analogous features at different geographical locations. The feature confusion compromises the reliability of pseudo-label generation, where incorrect pseudo-labels drive negative optimization. Given these challenges inherent in both supervised and unsupervised DVGL methods, we propose a novel cross-domain invariant knowledge transfer network (CDIKTNet) with limited supervision, whose architecture consists of a cross-domain invariance sub-network (CDIS) and a cross-domain transfer sub-network (CDTS). This architecture facilitates a closed-loop framework for invariance feature learning and knowledge transfer. The CDIS is designed to learn cross-view structural and spatial invariance from a small amount of paired data that serves as prior knowledge. It endows the shared feature space of unpaired data with similar implicit cross-view correlations at initialization, which alleviates feature confusion. Based on this, the CDTS employs dual-path contrastive learning to further optimize each subspace while preserving consistency in a shared feature space. Extensive experiments demonstrate that CDIKTNet achieves state-of-the-art performance under full supervision compared with those supervised methods, and further surpasses existing unsupervised methods in both few-shot and cross-domain initialization.

[226] Force Prompting: Video Generation Models Can Learn and Generalize Physics-based Control Signals

Nate Gillman, Charles Herrmann, Michael Freeman, Daksh Aggarwal, Evan Luo, Deqing Sun, Chen Sun

Main category: cs.CV

TL;DR: The paper introduces force prompts as a control signal for video generation, enabling realistic physical interactions like poking and wind effects without 3D assets or physics simulators at inference.

Details

Motivation: To enable physically meaningful interactions in video generation that mimic real-world forces, which remains largely understudied compared to navigation tasks.

Method: Leverages video generation models adapted to follow physical force conditioning from Blender-synthesized videos, using visual and motion priors from pretrained models without 3D assets or physics simulators at inference.

Result: The method generates videos that simulate forces across diverse geometries, settings, and materials, outperforming existing methods on force adherence and physics realism with only 15k training examples.

Conclusion: Video generation models can generalize remarkably well to physical force conditioning from limited synthetic demonstrations, with visual diversity and specific text keywords being key to this generalization, bringing world models closer to real-world physics interactions.

Abstract: Recent advances in video generation models have sparked interest in world models capable of simulating realistic environments. While navigation has been well-explored, physically meaningful interactions that mimic real-world forces remain largely understudied. In this work, we investigate using physical forces as a control signal for video generation and propose force prompts which enable users to interact with images through both localized point forces, such as poking a plant, and global wind force fields, such as wind blowing on fabric. We demonstrate that these force prompts can enable videos to respond realistically to physical control signals by leveraging the visual and motion prior in the original pretrained model, without using any 3D asset or physics simulator at inference. The primary challenge of force prompting is the difficulty in obtaining high quality paired force-video training data, both in the real world due to the difficulty of obtaining force signals, and in synthetic data due to limitations in the visual quality and domain diversity of physics simulators. Our key finding is that video generation models can generalize remarkably well when adapted to follow physical force conditioning from videos synthesized by Blender, even with limited demonstrations of few objects. Our method can generate videos which simulate forces across diverse geometries, settings, and materials. We also try to understand the source of this generalization and perform ablations that reveal two key elements: visual diversity and the use of specific text keywords during training. Our approach is trained on only around 15k training examples for a single day on four A100 GPUs, and outperforms existing methods on force adherence and physics realism, bringing world models closer to real-world physics interactions. We release all datasets, code, weights, and interactive video demos at our project page.

[227] PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction

Ziqiao Meng, Qichao Wang, Zhiyang Dou, Zixing Song, Zhipeng Zhou, Irwin King, Peilin Zhao

Main category: cs.CV

TL;DR: PointNSP is a coarse-to-fine autoregressive framework for point cloud generation that overcomes limitations of traditional autoregressive models by using multi-scale factorization and next-scale prediction, achieving SOTA quality while being more efficient than diffusion-based approaches.

Details

Motivation: Autoregressive models for point cloud generation suffer from performance gaps due to artificial ordering constraints that undermine global structure capture like symmetry and long-range dependencies, unlike diffusion-based methods.

Method: Proposes PointNSP using level-of-detail principle with coarse-to-fine generation, preserving global structure at low resolutions and refining geometry through next-scale prediction paradigm, enabling rich intra-scale interactions without fixed orderings.

Result: Establishes SOTA generation quality on ShapeNet within autoregressive paradigm, surpasses diffusion baselines in parameter/training/inference efficiency, and shows even better performance with 8,192 points demonstrating scalability.

Conclusion: PointNSP successfully bridges the performance gap between autoregressive and diffusion-based point cloud generation through multi-scale factorization, achieving superior quality and efficiency while maintaining global structural properties.

Abstract: Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality. The performance gap stems from the fact that autoregressive models impose an artificial ordering on inherently unordered point sets, forcing shape generation to proceed as a sequence of local predictions. This sequential bias emphasizes short-range continuity but undermines the model’s capacity to capture long-range dependencies, hindering its ability to enforce global structural properties such as symmetry, consistent topology, and large-scale geometric regularities. Inspired by the level-of-detail (LOD) principle in shape modeling, we propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions and progressively refines fine-grained geometry at higher scales through a next-scale prediction paradigm. This multi-scale factorization aligns the autoregressive objective with the permutation-invariant nature of point sets, enabling rich intra-scale interactions while avoiding brittle fixed orderings. Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm. In addition, it surpasses strong diffusion-based baselines in parameter, training, and inference efficiency. Finally, in dense generation with 8,192 points, PointNSP’s advantages become even more pronounced, underscoring its scalability potential.

[228] Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter, Yair Carmon

Main category: cs.CV

TL;DR: FLYT is a data curation algorithm that learns to score training examples using gradient signals from downstream tasks, achieving state-of-the-art results on DataComp benchmarks.

Details

Motivation: To improve the quality of large-scale vision-language datasets by learning which data points are most useful for pretraining, rather than relying on simple heuristics or manual curation.

Method: FLYT trains a scoring model using gradient signals from downstream tasks, M-FLYT combines multiple scoring methods, and Soft Cap Sampling prevents over-representation through repetition penalties.

Result: Achieved 40.1% ImageNet zero-shot accuracy on DataComp medium scale (2% absolute improvement) and 37.7% average across 38 tasks (0.4% improvement over previous public-resource approaches).

Conclusion: FLYT provides an effective framework for data curation that significantly improves pretraining performance by learning example usefulness from downstream task signals.

Abstract: We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example’s features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40.1% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2% absolute accuracy increase over all previous results and a 5.5% increase over results that - like us - use only public resources. Our approach also yields 37.7% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0.4%.

[229] IVY-FAKE: A Unified Explainable Framework and Benchmark for Image and Video AIGC Detection

Changjiang Jiang, Wenhui Dong, Zhonghao Zhang, Chenyang Si, Fengchang Yu, Wei Peng, Xinbin Yuan, Yifei Bi, Ming Zhao, Zian Zhou, Caifeng Shan

Main category: cs.CV

TL;DR: Ivy-Fake is a large-scale multimodal benchmark for explainable AIGC detection, addressing limitations in current datasets and detectors through rich annotations and a reinforcement learning-based detector that provides detailed reasoning chains.

Details

Motivation: Current AIGC detection methods face two major limitations: lack of multidimensional explainable datasets with only binary annotations, and insufficient fine-grained interpretability in existing MLLM-based detectors that hinders reliable localization and explanation.

Method: Introduced Ivy-Fake benchmark with over 106K annotated training samples and 5K verified evaluation examples from multiple sources. Proposed Ivy-xDetector using reinforcement learning with Group Relative Policy Optimization (GRPO) to produce explainable reasoning chains.

Result: Extensive experiments show superiority, with performance improving from 86.88% to 96.32% on GenImage benchmark, surpassing prior state-of-the-art methods by a clear margin.

Conclusion: The proposed Ivy-Fake benchmark and Ivy-xDetector effectively address current limitations in AIGC detection by providing rich annotations and fine-grained interpretability, achieving robust performance across multiple detection benchmarks.

Abstract: The rapid development of Artificial Intelligence Generated Content (AIGC) techniques has enabled the creation of high-quality synthetic content, but it also raises significant security concerns. Current detection methods face two major limitations: (1) the lack of multidimensional explainable datasets for generated images and videos. Existing open-source datasets (e.g., WildFake, GenVideo) rely on oversimplified binary annotations, which restrict the explainability and trustworthiness of trained detectors. (2) Prior MLLM-based forgery detectors (e.g., FakeVLM) exhibit insufficiently fine-grained interpretability in their step-by-step reasoning, which hinders reliable localization and explanation. To address these challenges, we introduce Ivy-Fake, the first large-scale multimodal benchmark for explainable AIGC detection. It consists of over 106K richly annotated training samples (images and videos) and 5,000 manually verified evaluation examples, sourced from multiple generative models and real world datasets through a carefully designed pipeline to ensure both diversity and quality. Furthermore, we propose Ivy-xDetector, a reinforcement learning model based on Group Relative Policy Optimization (GRPO), capable of producing explainable reasoning chains and achieving robust performance across multiple synthetic content detection benchmarks. Extensive experiments demonstrate the superiority of our dataset and confirm the effectiveness of our approach. Notably, our method improves performance on GenImage from 86.88% to 96.32%, surpassing prior state-of-the-art methods by a clear margin.

[230] FlowTok: Flowing Seamlessly Across Text and Image Tokens

Ju He, Qihang Yu, Qihao Liu, Liang-Chieh Chen

Main category: cs.CV

TL;DR: FlowTok introduces a simple flow matching framework that directly evolves between text and image modalities using compact 1D token representations, eliminating complex conditioning mechanisms while achieving comparable performance with improved efficiency.

Details

Motivation: To bridge the gap between text and image modalities more efficiently by avoiding the conventional approach of using text as a conditioning signal that guides denoising from Gaussian noise, and instead enabling direct evolution between modalities.

Method: Projects both text and images into a shared latent space by encoding images into compact 1D token representations, using flow matching to directly evolve between modalities without complex conditioning or noise scheduling.

Result: Reduces latent space size by 3.3x at 256 resolution, achieves comparable performance to state-of-the-art models while being highly memory-efficient, requiring fewer training resources, and achieving faster sampling speeds.

Conclusion: FlowTok demonstrates that a minimal framework using compact 1D tokens and flow matching can effectively bridge text and image modalities with superior efficiency and comparable performance to conventional approaches.

Abstract: Bridging different modalities lies at the heart of cross-modality generation. While conventional approaches treat the text modality as a conditioning signal that gradually guides the denoising process from Gaussian noise to the target image modality, we explore a much simpler paradigm-directly evolving between text and image modalities through flow matching. This requires projecting both modalities into a shared latent space, which poses a significant challenge due to their inherently different representations: text is highly semantic and encoded as 1D tokens, whereas images are spatially redundant and represented as 2D latent embeddings. To address this, we introduce FlowTok, a minimal framework that seamlessly flows across text and images by encoding images into a compact 1D token representation. Compared to prior methods, this design reduces the latent space size by 3.3x at an image resolution of 256, eliminating the need for complex conditioning mechanisms or noise scheduling. Moreover, FlowTok naturally extends to image-to-text generation under the same formulation. With its streamlined architecture centered around compact 1D tokens, FlowTok is highly memory-efficient, requires significantly fewer training resources, and achieves much faster sampling speeds-all while delivering performance comparable to state-of-the-art models. Code is available at https://github.com/TACJu/FlowTok.

[231] Image as an IMU: Estimating Camera Motion from a Single Motion-Blurred Image

Jerred Chen, Ronald Clark

Main category: cs.CV

TL;DR: A novel framework that uses motion blur as a cue for camera pose estimation, predicting motion flow and depth from single blurred images to recover instantaneous camera velocity.

Details

Motivation: Fast camera motions in robotics and VR/AR cause severe motion blur that makes existing pose estimation methods fail, so treating blur as useful information rather than artifact is needed.

Method: Predict dense motion flow field and monocular depth map from single motion-blurred image, then solve linear least squares problem under small motion assumption to recover camera velocity.

Result: Achieves state-of-the-art angular and translational velocity estimates on real-world benchmarks, outperforming MASt3R and COLMAP methods.

Conclusion: Motion blur can be effectively leveraged as a rich cue for robust camera motion estimation, producing IMU-like measurements that capture fast aggressive movements.

Abstract: In many robotics and VR/AR applications, fast camera motions lead to a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.

[232] Stream and Query-guided Feature Aggregation for Efficient and Effective 3D Occupancy Prediction

Seokha Moon, Janghyun Baek, Giseop Kim, Jinkyu Kim, Sunwook Choi

Main category: cs.CV

TL;DR: DuOcc introduces a dual aggregation strategy for 3D occupancy prediction that maintains dense voxel representations for spatial fidelity while achieving high efficiency through stream-based voxel aggregation and query-guided aggregation.

Details

Motivation: Address the trade-off in 3D occupancy prediction where dense voxel methods are accurate but computationally expensive, while sparse methods are efficient but lose spatial detail.

Method: Dual aggregation strategy with: (i) Stream-based Voxel Aggregation for recurrent accumulation and refinement of voxel features, and (ii) Query-guided Aggregation that selectively injects instance-level query features into dynamic object regions.

Result: Achieves state-of-the-art performance on Occ3D-nuScenes and SurroundOcc datasets in real-time settings, while reducing memory usage by over 40% compared to prior methods.

Conclusion: DuOcc successfully mitigates the accuracy-efficiency trade-off in 3D occupancy prediction through its dual aggregation approach, preserving spatial fidelity while maintaining computational efficiency.

Abstract: 3D occupancy prediction has become a key perception task in autonomous driving, as it enables comprehensive scene understanding. Recent methods enhance this understanding by incorporating spatiotemporal information through multi-frame fusion, but they suffer from a trade-off: dense voxel-based representations provide high accuracy at significant computational cost, whereas sparse representations improve efficiency but lose spatial detail. To mitigate this trade-off, we introduce DuOcc, which employs a dual aggregation strategy that retains dense voxel representations to preserve spatial fidelity while maintaining high efficiency. DuOcc consists of two key components: (i) Stream-based Voxel Aggregation, which recurrently accumulates voxel features over time and refines them to suppress warping-induced distortions, preserving a clear separation between occupied and free space. (ii) Query-guided Aggregation, which complements the limitations of voxel accumulation by selectively injecting instance-level query features into the voxel regions occupied by dynamic objects. Experiments on the widely used Occ3D-nuScenes and SurroundOcc datasets demonstrate that DuOcc achieves state-of-the-art performance in real-time settings, while reducing memory usage by over 40% compared to prior methods.

[233] Leveraging Contrast Information for Efficient Document Shadow Removal

Yifan Liu, Jiancheng Huang, Na Liu, Mingfu Yan, Yi Huang, Shifeng Chen

Main category: cs.CV

TL;DR: Proposes a document shadow removal method using contrast representation guidance without needing shadow masks, achieving state-of-the-art performance through coarse-to-fine refinement.

Details

Motivation: Existing document shadow removal methods rely on additional information like shadow masks, lack generalization, cause incomplete removal or content loss, and underutilize original document information.

Method: End-to-end method guided by contrast representation using coarse-to-fine refinement. Extracts contrast information to locate shadows without masks, integrates this into refined removal process for better network guidance and feature fusion.

Result: Extensive experiments show state-of-the-art performance in both qualitative and quantitative evaluations.

Conclusion: The proposed contrast-guided approach effectively removes document shadows without requiring additional masks, leveraging inherent document information for superior performance.

Abstract: Document shadows are a major obstacle in the digitization process. Due to the dense information in text and patterns covered by shadows, document shadow removal requires specialized methods. Existing document shadow removal methods, although showing some progress, still rely on additional information such as shadow masks or lack generalization and effectiveness across different shadow scenarios. This often results in incomplete shadow removal or loss of original document content and tones. Moreover, these methods tend to underutilize the information present in the original shadowed document image. In this paper, we refocus our approach on the document images themselves, which inherently contain rich information.We propose an end-to-end document shadow removal method guided by contrast representation, following a coarse-to-fine refinement approach. By extracting document contrast information, we can effectively and quickly locate shadow shapes and positions without the need for additional masks. This information is then integrated into the refined shadow removal process, providing better guidance for network-based removal and feature fusion. Extensive qualitative and quantitative experiments show that our method achieves state-of-the-art performance.

[234] Earth-Adapter: Bridge the Geospatial Domain Gaps with Mixture of Frequency Adaptation

Xiaoxing Hu, Ziyang Gong, Yupei Wang, Yuru Jia, Fei Lin, Dexiang Gao, Ke An, Jianhong Han, Zhuoran Sun, Gen Luo, Gen Luo, Xue Yang

Main category: cs.CV

TL;DR: Earth-Adapter is a novel Parameter-Efficient Fine-Tuning method specifically designed for remote sensing scenarios that uses frequency domain adaptation to handle artifacts, significantly outperforming existing PEFT methods.

Details

Motivation: Existing PEFT methods designed for natural imagery struggle with remote sensing scenarios due to their inability to handle artifact influences, which are particularly severe in RS image features.

Method: Introduces Mixture of Frequency Adaptation combining Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT) to decompose features into frequency components, separate artifacts, and dynamically combine features across frequency domains.

Result: Significantly improves performance by 9.0% mIoU in Domain Adaptation and 3.1% mIoU in Domain Generalization benchmarks compared to baseline methods.

Conclusion: Earth-Adapter effectively overcomes artifact disturbances in remote sensing scenarios through frequency domain adaptation, enhancing foundation models’ performance on RS tasks.

Abstract: Parameter-Efficient Fine-Tuning (PEFT) is a technique that allows us to adapt powerful Foundation Models (FMs) to diverse downstream tasks while preserving and unleashing their inherent capabilities. However, we have observed that existing PEFT methods, which are often designed with natural imagery in mind, struggle when applied to Remote Sensing (RS) scenarios. This is primarily due to their inability to handle artifact influences, a problem particularly severe in RS image features. To tackle this challenge, we introduce Earth-Adapter, the first PEFT method specifically designed for RS artifacts conquering. Earth-Adapter introduces a novel Mixture of Frequency Adaptation process that combines a Mixture of Adapter (MoA) with Discrete Fourier Transformation (DFT). By utilizing DFT, Earth-Adapter can decompose features into different frequency components, precisely separating artifacts from original features. The MoA then dynamically assigns weights to each adapter expert, allowing for the combination of features across various frequency domains. These simple-yet-effective approaches enable Earth-Adapter to more efficiently overcome the disturbances caused by artifacts than previous PEFT methods, significantly enhancing the FMs’ performance on RS scenarios. Experiments on Domain Adaptation (DA), and Domain Generalization (DG) semantic segmentation benchmarks showcase the Earth-Adapter’s effectiveness. Compared with baseline Rein, Earth-Adapter significantly improves 9.0% mIoU in DA and 3.1% mIoU in DG benchmarks. Our code will be released at https://github.com/VisionXLab/Earth-Adapter.

[235] Contrast-Prior Enhanced Duality for Mask-Free Shadow Removal

Jiyu Wu, Yifan Liu, Jiancheng Huang, Mingfu Yan, Shifeng Chen

Main category: cs.CV

TL;DR: A mask-free shadow removal method using adaptive gated attention and diffusion-based fusion to handle ambiguous contrast cues and restore fine details.

Details

Motivation: Existing methods rely on shadow masks that are hard to obtain in real scenarios. Intrinsic image cues like local contrast can guide removal but suffer from ambiguity in complex scenes where shadows are confused with low-reflectance objects and textures.

Method: Propose Adaptive Gated Dual-Branch Attention (AGBA) to dynamically filter contrast prior and disentangle shadow features. Introduce diffusion-based Frequency-Contrast Fusion Network (FCFN) using high-frequency and contrast cues to restore soft boundaries and fine details.

Result: Achieves state-of-the-art results among mask-free approaches and maintains competitive performance relative to mask-based methods in extensive experiments.

Conclusion: The proposed method effectively addresses shadow removal without requiring masks by adaptively handling contrast ambiguity and leveraging diffusion-based fusion for high-quality restoration.

Abstract: Existing shadow removal methods often rely on shadow masks, which are challenging to acquire in real-world scenarios. Exploring intrinsic image cues, such as local contrast information, presents a potential alternative for guiding shadow removal in the absence of explicit masks. However, the cue’s inherent ambiguity becomes a critical limitation in complex scenes, where it can fail to distinguish true shadows from low-reflectance objects and intricate background textures. To address this motivation, we propose the Adaptive Gated Dual-Branch Attention (AGBA) mechanism. AGBA dynamically filters and re-weighs the contrast prior to effectively disentangle shadow features from confounding visual elements. Furthermore, to tackle the persistent challenge of restoring soft shadow boundaries and fine-grained details, we introduce a diffusion-based Frequency-Contrast Fusion Network (FCFN) that leverages high-frequency and contrast cues to guide the generative process. Extensive experiments demonstrate that our method achieves state-of-the-art results among mask-free approaches while maintaining competitive performance relative to mask-based methods.

[236] Geometrically Regularized Transfer Learning with On-Manifold and Off-Manifold Perturbation

Hana Satou, Alan Mitkiy, Emma Collins, Finn Kingston

Main category: cs.CV

TL;DR: MAADA is a manifold-aware adversarial data augmentation framework that decomposes perturbations into on-manifold and off-manifold components to improve transfer learning under domain shift.

Details

Motivation: Address the fundamental challenge of domain shift in transfer learning by handling the divergence between source and target data manifolds more effectively.

Method: Decompose adversarial perturbations into on-manifold and off-manifold components, enforce on-manifold consistency, apply off-manifold regularization, and use geometry-aware alignment loss to minimize geodesic discrepancy between manifolds.

Result: Outperforms existing adversarial and adaptation methods on DomainNet, VisDA, and Office-Home datasets in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.

Conclusion: MAADA effectively addresses domain shift by leveraging manifold decomposition and geometric alignment, providing a robust framework for transfer learning that captures both semantic variation and model brittleness.

Abstract: Transfer learning under domain shift remains a fundamental challenge due to the divergence between source and target data manifolds. In this paper, we propose MAADA (Manifold-Aware Adversarial Data Augmentation), a novel framework that decomposes adversarial perturbations into on-manifold and off-manifold components to simultaneously capture semantic variation and model brittleness. We theoretically demonstrate that enforcing on-manifold consistency reduces hypothesis complexity and improves generalization, while off-manifold regularization smooths decision boundaries in low-density regions. Moreover, we introduce a geometry-aware alignment loss that minimizes geodesic discrepancy between source and target manifolds. Experiments on DomainNet, VisDA, and Office-Home show that MAADA consistently outperforms existing adversarial and adaptation methods in both unsupervised and few-shot settings, demonstrating superior structural robustness and cross-domain generalization.

[237] Disentangled Geometric Alignment with Adaptive Contrastive Perturbation for Reliable Domain Transfer

Emma Collins, Myungseo wong, Kim Yun, Finn Kingston, Hana Satou

Main category: cs.CV

TL;DR: GAMA++ improves geometry-aware domain adaptation by introducing latent space disentanglement and adaptive contrastive perturbation, achieving SOTA results on multiple benchmarks.

Details

Motivation: Current methods like GAMA suffer from insufficient disentanglement of task-relevant/irrelevant manifold dimensions and rigid perturbation schemes that ignore per-class alignment asymmetries.

Method: Proposes latent space disentanglement to isolate label-consistent manifold directions, adaptive contrastive perturbation strategy tailored to class-specific manifold curvature, and cross-domain contrastive consistency loss.

Result: Achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under standard and few-shot settings, with improvements in class-level alignment fidelity and boundary robustness.

Conclusion: GAMA++ sets a new standard for semantic geometry alignment in transfer learning by addressing key limitations in existing geometry-aware domain adaptation methods.

Abstract: Despite progress in geometry-aware domain adaptation, current methods such as GAMA still suffer from two unresolved issues: (1) insufficient disentanglement of task-relevant and task-irrelevant manifold dimensions, and (2) rigid perturbation schemes that ignore per-class alignment asymmetries. To address this, we propose GAMA++, a novel framework that introduces (i) latent space disentanglement to isolate label-consistent manifold directions from nuisance factors, and (ii) an adaptive contrastive perturbation strategy that tailors both on- and off-manifold exploration to class-specific manifold curvature and alignment discrepancy. We further propose a cross-domain contrastive consistency loss that encourages local semantic clusters to align while preserving intra-domain diversity. Our method achieves state-of-the-art results on DomainNet, Office-Home, and VisDA benchmarks under both standard and few-shot settings, with notable improvements in class-level alignment fidelity and boundary robustness. GAMA++ sets a new standard for semantic geometry alignment in transfer learning.

[238] WeatherDiffusion: Controllable Weather Editing in Intrinsic Space

Yixin Zhu, Zuoliang Zhu, Jian Yang, Miloš Hašan, Jin Xie, Beibei Wang

Main category: cs.CV

TL;DR: WeatherDiffusion is a diffusion-based framework for controllable weather editing using intrinsic maps, featuring an inverse renderer for estimating scene properties and a forward renderer for generating weather-modified images with text prompts.

Details

Motivation: Traditional pixel-space weather editing lacks controllability and spatial correspondence in large outdoor scenes, limiting applications in autonomous driving and weather robustness.

Method: Uses diffusion priors with two components: inverse renderer estimates material, geometry, and lighting as intrinsic maps; forward renderer uses these maps with weather text prompts. Introduces intrinsic map-aware attention and CLIP-space interpolation for fine-grained weather control.

Result: Outperforms state-of-the-art pixel-space editing, weather restoration, and rendering-based methods. Demonstrates improved controllability and spatial correspondence in large outdoor scenes.

Conclusion: WeatherDiffusion shows promise for downstream tasks like autonomous driving by enhancing detection and segmentation robustness in challenging weather conditions through controllable weather editing.

Abstract: We present WeatherDiffusion, a diffusion-based framework for controllable weather editing in intrinsic space. Our framework includes two components based on diffusion priors: an inverse renderer that estimates material properties, scene geometry, and lighting as intrinsic maps from an input image, and a forward renderer that utilizes these geometry and material maps along with a text prompt that describes specific weather conditions to generate a final image. The intrinsic maps enhance controllability compared to traditional pixel-space editing approaches.We propose an intrinsic map-aware attention mechanism that improves spatial correspondence and decomposition quality in large outdoor scenes. For forward rendering, we leverage CLIP-space interpolation of weather prompts to achieve fine-grained weather control. We also introduce a synthetic and a real-world dataset, containing 38k and 18k images under various weather conditions, each with intrinsic map annotations. WeatherDiffusion outperforms state-of-the-art pixel-space editing approaches, weather restoration methods, and rendering-based methods, showing promise for downstream tasks such as autonomous driving, enhancing the robustness of detection and segmentation in challenging weather scenarios.

[239] ISAC: Training-Free Instance-to-Semantic Attention Control for Improving Multi-Instance Generation

Sanghyun Jo, Wooyeol Lee, Ziseok Lee, Kyungsu Kim

Main category: cs.CV

TL;DR: ISAC is a training-free method that improves multi-object generation in diffusion models by using hierarchical attention control to separate instance formation from semantic assignment, addressing issues like incorrect instance counts and semantic leakage.

Details

Motivation: Text-to-image diffusion models struggle with multi-object scenes, producing wrong instance numbers and semantic leakage across objects due to vague instance boundaries in self-attention mechanisms.

Method: ISAC performs hierarchical attention control in two phases: Phase 1 clusters self-attention to establish instance layouts and repel overlaps; Phase 2 injects instance cues into cross-attention to create instance-aware semantic masks and decompose mixing semantics.

Result: ISAC achieves consistent improvements on T2I-CompBench, HRS-Bench, and IntraCompBench, with at least 50% improvement in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without fine-tuning or external models.

Conclusion: Hierarchical decoupling of instance formation and semantic assignment is key for robust multi-object generation, and ISAC also enhances layout-to-image controllers by refining coarse box layouts into dense instance masks.

Abstract: Text-to-image diffusion models have recently become highly capable, yet their behavior in multi-object scenes remains unreliable: models often produce an incorrect number of instances and exhibit semantics leaking across objects. We trace these failures to vague instance boundaries; self-attention already reveals instance layouts early in the denoising process, but existing approaches act only on semantic signals. We introduce $\textbf{ISAC}$ ($\textbf{I}$nstance-to-$\textbf{S}$emantic $\textbf{A}$ttention $\textbf{C}$ontrol), a training-free, model-agnostic objective that performs hierarchical attention control by first carving out instance layouts from self-attention and then binding semantics to these instances. In Phase 1, ISAC clusters self-attention into the number of instances and repels overlaps, establishing an instance-level structural hierarchy; in Phase 2, it injects these instance cues into cross-attention to obtain instance-aware semantic masks and decomposes mixing semantics by tying attributes within each instance. ISAC yields consistent gains on T2I-CompBench, HRS-Bench, and IntraCompBench, our new benchmark for intra-class compositions where failures are most frequent, with improvements of at least 50% in multi-class accuracy and 7% in multi-instance accuracy on IntraCompBench, without any fine-tuning or external models. Beyond text-to-image setups, ISAC also strengthens layout-to-image controllers under overlapping boxes by refining coarse box layouts into dense instance masks, indicating that hierarchical decoupling of instance formation and semantic assignment is a key principle for robust, controllable multi-object generation. Code will be released upon publication.

[240] Diffusion-Denoised Hyperspectral Gaussian Splatting

Sunil Kumar Narayanan, Lingjun Zhao, Lu Gan, Yongsheng Chen

Main category: cs.CV

TL;DR: DD-HGS enhances 3D Gaussian Splatting with wavelength-aware spherical harmonics, spectral loss, and diffusion denoising for efficient hyperspectral scene reconstruction, achieving state-of-the-art performance.

Details

Motivation: Current NeRF-based methods for hyperspectral imaging face limitations in training time and rendering speed, hindering practical agricultural applications for nutrient composition analysis.

Method: Proposed DD-HGS combines wavelength-aware spherical harmonics, Kullback-Leibler divergence-based spectral loss, and diffusion-based denoiser with 3D Gaussian Splatting for explicit hyperspectral reconstruction.

Result: Extensive evaluations on Hyper-NeRF dataset show DD-HGS achieves state-of-the-art performance in hyperspectral scene reconstruction across the entire spectral range.

Conclusion: DD-HGS provides an efficient and accurate solution for 3D hyperspectral reconstruction, overcoming previous limitations in training time and rendering speed.

Abstract: Hyperspectral imaging (HSI) has been widely used in agricultural applications for non-destructive estimation of plant nutrient composition and precise quantification of sample nutritional elements. Recently, 3D reconstruction methods, such as Neural Radiance Field (NeRF), have been used to create implicit neural representations of HSI scenes. This capability enables the rendering of hyperspectral channel compositions at every spatial location, thereby helping localize the target object’s nutrient composition both spatially and spectrally. However, it faces limitations in training time and rendering speed. In this paper, we propose Diffusion-Denoised Hyperspectral Gaussian Splatting (DD-HGS), which enhances the state-of-the-art 3D Gaussian Splatting (3DGS) method with wavelength-aware spherical harmonics, a Kullback-Leibler divergence-based spectral loss, and a diffusion-based denoiser to enable 3D explicit reconstruction of the hyperspectral scenes for the entire spectral range. We present extensive evaluations on diverse real-world hyperspectral scenes from the Hyper-NeRF dataset to show the effectiveness of our DD-HGS. The results demonstrate that DD-HGS achieves the new state-of-the-art performance compared to all the previously published methods. Project page: https://dragonpg2000.github.io/DDHGS-website/

[241] Adapting Segment Anything Model for Power Transmission Corridor Hazard Segmentation

Hang Chen, Maoyuan Ye, Peng Yang, Haibin He, Juhua Liu, Bo Du

Main category: cs.CV

TL;DR: ELE-SAM adapts Segment Anything Model (SAM) for power transmission corridor hazard segmentation by adding a Context-Aware Prompt Adapter and High-Fidelity Mask Decoder to handle fine structures in complex backgrounds, achieving significant performance improvements.

Details

Motivation: SAM struggles with fine-structured target objects in complex transmission corridor scenarios, limiting its effectiveness for power transmission safety applications.

Method: Developed Context-Aware Prompt Adapter for better prompt tokens using global-local features, and High-Fidelity Mask Decoder leveraging multi-granularity mask features at higher resolution. Also created ELE-40K dataset with 44,094 image-mask pairs.

Result: Outperforms baseline by 16.8% mIoU and 20.6% mBIoU on ELE-40K, and achieves 2.9% mIoU and 3.8% mBIoU improvements over SOTA on HQSeg-44K.

Conclusion: ELE-SAM effectively adapts SAM for power transmission corridor hazard segmentation, demonstrating superior performance on both domain-specific and generic high-quality segmentation tasks.

Abstract: Power transmission corridor hazard segmentation (PTCHS) aims to separate transmission equipment and surrounding hazards from complex background, conveying great significance to maintaining electric power transmission safety. Recently, the Segment Anything Model (SAM) has emerged as a foundational vision model and pushed the boundaries of segmentation tasks. However, SAM struggles to deal with the target objects in complex transmission corridor scenario, especially those with fine structure. In this paper, we propose ELE-SAM, adapting SAM for the PTCHS task. Technically, we develop a Context-Aware Prompt Adapter to achieve better prompt tokens via incorporating global-local features and focusing more on key regions. Subsequently, to tackle the hazard objects with fine structure in complex background, we design a High-Fidelity Mask Decoder by leveraging multi-granularity mask features and then scaling them to a higher resolution. Moreover, to train ELE-SAM and advance this field, we construct the ELE-40K benchmark, the first large-scale and real-world dataset for PTCHS including 44,094 image-mask pairs. Experimental results for ELE-40K demonstrate the superior performance that ELE-SAM outperforms the baseline model with the average 16.8% mIoU and 20.6% mBIoU performance improvement. Moreover, compared with the state-of-the-art method on HQSeg-44K, the average 2.9% mIoU and 3.8% mBIoU absolute improvements further validate the effectiveness of our method on high-quality generic object segmentation. The source code and dataset are available at https://github.com/Hhaizee/ELE-SAM.

[242] Not All Splits Are Equal: Rethinking Attribute Generalization Across Unrelated Categories

Liviu Nicolae Fircă, Antonio Bărbălau, Dan Oneata, Elena Burceanu

Main category: cs.CV

TL;DR: This paper evaluates whether models can generalize attribute knowledge across semantically and perceptually dissimilar categories, testing attribute prediction robustness when training and test categories are unrelated.

Details

Motivation: To understand if current models can abstract attributes and apply them to conceptually distant categories, beyond narrow taxonomic or visually similar domains.

Method: Introduced train-test split strategies that progressively reduce correlation between training and test sets using: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning.

Result: Results show a sharp performance drop as correlation between training and test categories decreases, indicating strong sensitivity to split design. Clustering yielded the most effective trade-off.

Conclusion: Current models have limitations in attribute reasoning across dissimilar categories, and clustering-based splits offer better benchmark construction for evaluating attribute generalization.

Abstract: Can models generalize attribute knowledge across semantically and perceptually dissimilar categories? While prior work has addressed attribute prediction within narrow taxonomic or visually similar domains, it remains unclear whether current models can abstract attributes and apply them to conceptually distant categories. This work presents the first explicit evaluation for the robustness of the attribute prediction task under such conditions, testing whether models can correctly infer shared attributes between unrelated object types: e.g., identifying that the attribute “has four legs” is common to both “dogs” and “chairs”. To enable this evaluation, we introduce train-test split strategies that progressively reduce correlation between training and test sets, based on: LLM-driven semantic grouping, embedding similarity thresholding, embedding-based clustering, and supercategory-based partitioning using ground-truth labels. Results show a sharp drop in performance as the correlation between training and test categories decreases, indicating strong sensitivity to split design. Among the evaluated methods, clustering yields the most effective trade-off, reducing hidden correlations while preserving learnability. These findings offer new insights into the limitations of current representations and inform future benchmark construction for attribute reasoning.

[243] Vision Remember: Recovering Visual Information in Efficient LVLM with Vision Feature Resampling

Ze Feng, Jiang-jiang Liu, Sen Yang, Lingyu Xiao, Zhibin Quan, Zhenhua Feng, Wankou Yang, Jingdong Wang

Main category: cs.CV

TL;DR: Vision Remember improves LVLM efficiency by resampling vision features across LLM decoder layers to recover fine-grained visual information lost in token compression, outperforming existing methods on multiple benchmarks.

Details

Motivation: Existing vision token compression methods lose crucial fine-grained visual information needed for tasks like OCR and chart understanding, creating a need for efficient visual information recovery.

Method: Proposes Vision Remember with two modules: Token-Feature Cross-Attention Layer for local cross-attention and multi-level fusion, and Token Bidirectional Self-Attention Layer for maintaining interaction between vision tokens and text-guided tokens.

Result: Outperforms TokenPacker by +2.7 and FastV by +5.7 across settings, and surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline, showing strong generalization with various efficient vision projectors and LVLMs.

Conclusion: Vision Remember effectively recovers visual information while maintaining efficiency, demonstrating superior performance over existing methods and strong generalization capability.

Abstract: The computational expense of redundant vision tokens in Large Vision-Language Models (LVLMs) has led many existing methods to compress them via a vision projector. However, this compression may lose visual information that is crucial for tasks relying on fine-grained spatial relationships, such as OCR and Chart&Table Understanding. In this paper, we propose to resample original vision features across the LLM decoder layers to recover visual information and attain efficiency. Following this principle, we introduce Vision Remember, which includes two key modules: (1) Token-Feature Cross-Attention Layer and (2) Token Bidirectional Self-Attention Layer. In the Token bidirectional attention, we employ self-attention mechanism to maintain the bidirectional interaction between vision tokens and the text-guided token. In the Token-Feature interaction attention, we introduce local cross-attention to resample the visual feature and utilize the multi-level fusion to enrich the visual representation. We conduct comprehensive experiments on multiple visual understanding benchmarks and the results with the LLaVA-NeXT baseline show that Vision Remember outperforms TokenPacker by +2.7 and FastV by +5.7 across nearly all the settings. Compared with previous vision feature re-fusion methods, our approach also surpasses DeepStack by +3.9 and SVA Aggregator by +3.4 on the same baseline. The experimental results validate the generalization capability of the proposed method when combined with various efficient vision projectors and LVLMs.

[244] Modular, On-Site Solutions with Lightweight Anomaly Detection for Sustainable Nutrient Management in Agriculture

Abigail R. Cohen, Yuming Sun, Zhihao Qin, Harsh S. Muriki, Zihao Xiao, Yeonju Lee, Matthew Housley, Andrew F. Sharkey, Rhuanito S. Ferrarezi, Jing Li, Lu Gan, Yongsheng Chen

Main category: cs.CV

TL;DR: A tiered pipeline using multispectral imaging and autoencoders for early anomaly detection in crop nutrient management, with vision transformers outperforming random forests on nutrient estimation at higher energy costs.

Details

Motivation: Current nutrient management approaches require lengthy analyses preventing real-time optimization, and imaging-based phenotyping is computationally intensive for resource-constrained deployment.

Method: Hierarchical pipeline using autoencoder for anomaly detection, comparing vegetation index features with random forest vs. raw whole-image vision transformer for status estimation of fresh weight, dry mass, and tissue nutrients.

Result: 73% net detection of severely nutrient-deficient samples 9 days after transplanting at lower energy than wasted nitrogen; ViT outperformed RF on phosphorus and calcium estimation (R2 0.61 vs 0.58, 0.48 vs 0.35) but with higher energy cost.

Conclusion: The modular pipeline enables edge diagnostics and practical opportunities for agricultural sustainability by balancing efficiency-accuracy trade-offs in nutrient management.

Abstract: Efficient nutrient management is critical for crop growth and sustainable resource consumption (e.g., nitrogen, energy). Current approaches require lengthy analyses, preventing real-time optimization; similarly, imaging facilitates rapid phenotyping but can be computationally intensive, preventing deployment under resource constraints. This study proposes a flexible, tiered pipeline for anomaly detection and status estimation (fresh weight, dry mass, and tissue nutrients), including a comprehensive energy analysis of approaches that span the efficiency-accuracy spectrum. Using a nutrient depletion experiment with three treatments (T1-100%, T2-50%, and T3-25% fertilizer strength) and multispectral imaging (MSI), we developed a hierarchical pipeline using an autoencoder (AE) for early warning. Further, we compared two status estimation modules of different complexity for more detailed analysis: vegetation index (VI) features with machine learning (Random Forest, RF) and raw whole-image deep learning (Vision Transformer, ViT). Results demonstrated high-efficiency anomaly detection (73% net detection of T3 samples 9 days after transplanting) at substantially lower energy than embodied energy in wasted nitrogen. The state estimation modules show trade-offs, with ViT outperforming RF on phosphorus and calcium estimation (R2 0.61 vs. 0.58, 0.48 vs. 0.35) at higher energy cost. With our modular pipeline, this work opens opportunities for edge diagnostics and practical opportunities for agricultural sustainability.

[245] Dynamic Epsilon Scheduling: A Multi-Factor Adaptive Perturbation Budget for Adversarial Training

Alan Mitkiy, James Smith, Myungseo wong, Hana Satou, Hiroshi Tanaka, Emily Johnson

Main category: cs.CV

TL;DR: DES is a dynamic adversarial training framework that adaptively adjusts perturbation budgets per instance and iteration using gradient-based boundary distance, prediction confidence, and model uncertainty.

Details

Motivation: Existing adversarial training methods use fixed perturbation budgets that don't account for instance-specific robustness characteristics, limiting their effectiveness.

Method: Dynamic Epsilon Scheduling (DES) integrates three factors: gradient-based decision boundary distance, softmax entropy prediction confidence, and Monte Carlo dropout model uncertainty to adaptively adjust perturbation budgets.

Result: Experiments on CIFAR-10 and CIFAR-100 show DES consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods.

Conclusion: DES provides a novel instance-aware, data-driven adversarial training approach with theoretical insights into stability and convergence, opening new avenues for adaptive defense methods.

Abstract: Adversarial training is among the most effective strategies for defending deep neural networks against adversarial examples. A key limitation of existing adversarial training approaches lies in their reliance on a fixed perturbation budget, which fails to account for instance-specific robustness characteristics. While prior works such as IAAT and MMA introduce instance-level adaptations, they often rely on heuristic or static approximations of data robustness. In this paper, we propose Dynamic Epsilon Scheduling (DES), a novel framework that adaptively adjusts the adversarial perturbation budget per instance and per training iteration. DES integrates three key factors: (1) the distance to the decision boundary approximated via gradient-based proxies, (2) prediction confidence derived from softmax entropy, and (3) model uncertainty estimated via Monte Carlo dropout. By combining these cues into a unified scheduling strategy, DES tailors the perturbation budget dynamically to guide more effective adversarial learning. Experimental results on CIFAR-10 and CIFAR-100 show that our method consistently improves both adversarial robustness and standard accuracy compared to fixed-epsilon baselines and prior adaptive methods. Moreover, we provide theoretical insights into the stability and convergence of our scheduling policy. This work opens a new avenue for instance-aware, data-driven adversarial training methods.

[246] MetricHMSR:Metric Human Mesh and Scene Recovery from Monocular Images

Chentao Song, He Zhang, Haolei Yuan, Haozhe Lin, Jianhua Tao, Hongwen Zhang, Tao Yu

Main category: cs.CV

TL;DR: MetricHMSR is a unified framework for metric human mesh and scene recovery from monocular images that uses camera rays and Human Mixture-of-Experts to simultaneously estimate human pose and 3D position.

Details

Motivation: Existing approaches struggle with metric human pose and 3D position estimation due to unrealistic camera assumptions and metric perception challenges, requiring a unified solution.

Method: Incorporates camera rays to encode bounding box and intrinsic parameters, uses Human Mixture-of-Experts to route features to task-specific experts, and refines metric depth estimation for accurate 3D overlay.

Result: Achieves state-of-the-art performance on both human mesh and scene recovery tasks.

Conclusion: MetricHMSR provides an effective unified framework for simultaneous metric human pose and 3D position estimation from monocular images.

Abstract: We introduce MetricHMSR (Metric Human Mesh and Scene Recovery), a novel approach for metric human mesh and scene recovery from monocular images. Due to unrealistic assumptions in the camera model and inherent challenges in metric perception, existing approaches struggle to achieve human pose and metric 3D position estimation through a unified module. To address this limitation, MetricHMSR incorporates camera rays to comprehensively encode both the bounding box information and the intrinsic parameters of perspective projection. Then we proposed Human Mixture-of-Experts (MoE), the model dynamically routes image features and ray features to task-specific experts for specialized understanding of different data aspects, enabling a unified framework that simultaneously perceives the local pose and the global 3D position. Based on the results above, we further refine the existing monocular metric depth estimation method to achieve more accurate results, ultimately enabling the seamless overlay of humans and scenes in 3D space. Comprehensive experimental results demonstrate that the proposed method achieves state-of-the-art performance on both human mesh and scene recovery.

[247] Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding

Youze Wang, Zijun Chen, Ruoyu Chen, Shishen Gu, Wenbo Hu, Jiayang Liu, Yinpeng Dong, Hang Su, Jun Zhu, Meng Wang, Richang Hong

Main category: cs.CV

TL;DR: Trust-videoLLMs is the first comprehensive benchmark evaluating 23 videoLLMs across truthfulness, robustness, safety, fairness, and privacy, revealing significant limitations in dynamic scene comprehension and risk mitigation.

Details

Motivation: VideoLLMs face challenges with factual inaccuracies, harmful content, biases, hallucinations, and privacy risks that compromise reliability, creating a need for comprehensive trustworthiness evaluation beyond just accuracy metrics.

Method: Developed a framework with 30 tasks using adapted, synthetic, and annotated videos to assess spatiotemporal risks, temporal consistency, and cross-modal impact across five dimensions: truthfulness, robustness, safety, fairness, and privacy.

Result: Significant limitations found in dynamic scene comprehension, cross-modal perturbation resilience, and real-world risk mitigation. Proprietary models generally show superior credibility, but scaling doesn’t consistently improve performance. Open-source models occasionally outperform commercial ones.

Conclusion: There’s a critical need for enhanced training data diversity and robust multimodal alignment. Trust-videoLLMs provides an extensible toolkit for standardized trustworthiness assessments to bridge the gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

Abstract: Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

[248] PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction

Ziqiao Meng, Qichao Wang, Zhiyang Dou, Zixing Song, Zhipeng Zhou, Irwin King, Peilin Zhao

Main category: cs.CV

TL;DR: PointNSP is a coarse-to-fine autoregressive framework for point cloud generation that overcomes traditional autoregressive limitations by using multi-scale factorization and next-scale prediction, achieving state-of-the-art quality while being more efficient than diffusion models.

Details

Motivation: Autoregressive point cloud generation has lagged behind diffusion models due to artificial ordering constraints that undermine global structural properties like symmetry and long-range dependencies.

Method: Proposes PointNSP, a coarse-to-fine generative framework using level-of-detail principles with next-scale prediction, enabling multi-scale factorization that preserves global structure at low resolutions and refines details progressively.

Result: Establishes SOTA generation quality on ShapeNet, surpasses diffusion baselines in parameter/training/inference efficiency, and shows superior scalability with 8,192 points.

Conclusion: PointNSP successfully bridges the performance gap between autoregressive and diffusion models for point cloud generation while offering better efficiency and scalability.

[249] Learning Normals of Noisy Points by Local Gradient-Aware Surface Filtering

Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu

Main category: cs.CV

TL;DR: A novel method for estimating normals from noisy point clouds using local gradient-aware surface filtering that projects noisy points onto underlying surfaces through implicit functions constrained by local gradients.

Details

Motivation: Existing normal estimation methods work well on clean data but struggle with noisy point clouds, relying on supervised priors and specific neighborhoods without effectively handling noise.

Method: Uses local gradient-aware surface filtering with implicit functions, distance measurement operators for global surface fitting, and incorporates local gradient consistency constraints to prevent over-smoothing and gradient degradation.

Result: Achieves state-of-the-art performance in normal estimation, surface reconstruction, and point cloud denoising across comprehensive experiments.

Conclusion: The proposed LGSF method effectively handles noisy point clouds through gradient-aware filtering and demonstrates superior performance in multiple 3D geometry processing tasks.

Abstract: Estimating normals for noisy point clouds is a persistent challenge in 3D geometry processing, particularly for end-to-end oriented normal estimation. Existing methods generally address relatively clean data and rely on supervised priors to fit local surfaces within specific neighborhoods. In this paper, we propose a novel approach for learning normals from noisy point clouds through local gradient-aware surface filtering. Our method projects noisy points onto the underlying surface by utilizing normals and distances derived from an implicit function constrained by local gradients. We start by introducing a distance measurement operator for global surface fitting on noisy data, which integrates projected distances along normals. Following this, we develop an implicit field-based filtering approach for surface point construction, adding projection constraints on these points during filtering. To address issues of over-smoothing and gradient degradation, we further incorporate local gradient consistency constraints, as well as local gradient orientation and aggregation. Comprehensive experiments on normal estimation, surface reconstruction, and point cloud denoising demonstrate the state-of-the-art performance of our method. The source code and trained models are available at https://github.com/LeoQLi/LGSF.

[250] Automated Neural Architecture Design for Industrial Defect Detection

Yuxi Liu, Yunfeng Ma, Yi Tang, Min Liu, Shuai Jiang, Yaonan Wang

Main category: cs.CV

TL;DR: AutoNAD is an automated neural architecture design framework for surface defect detection that jointly searches over convolutions, transformers, and MLPs to address intraclass differences and interclass similarity challenges.

Details

Motivation: Industrial surface defect detection faces challenges with diverse defect shapes and sizes, leading to intraclass differences and interclass similarity. Existing manual design methods require extensive trial and error and struggle to address both challenges effectively.

Method: Proposes AutoNAD framework that searches over convolutions, transformers, and MLPs using cross weight sharing strategy for efficient training and searchable multi-level feature aggregation module for multi-scale learning. Includes latency-aware prior for runtime efficiency.

Result: Validated on three industrial defect datasets and integrated into a defect imaging and detection platform. Code is publicly available.

Conclusion: AutoNAD effectively addresses surface defect detection challenges by automating neural architecture design, capturing both local variations and long-range context while ensuring runtime efficiency for industrial deployment.

Abstract: Industrial surface defect detection (SDD) is critical for ensuring product quality and manufacturing reliability. Due to the diverse shapes and sizes of surface defects, SDD faces two main challenges: intraclass difference and interclass similarity. Existing methods primarily utilize manually designed models, which require extensive trial and error and often struggle to address both challenges effectively. To overcome this, we propose AutoNAD, an automated neural architecture design framework for SDD that jointly searches over convolutions, transformers, and multi-layer perceptrons. This hybrid design enables the model to capture both fine-grained local variations and long-range semantic context, addressing the two key challenges while reducing the cost of manual network design. To support efficient training of such a diverse search space, AutoNAD introduces a cross weight sharing strategy, which accelerates supernet convergence and improves subnet performance. Additionally, a searchable multi-level feature aggregation module (MFAM) is integrated to enhance multi-scale feature learning. Beyond detection accuracy, runtime efficiency is essential for industrial deployment. To this end, AutoNAD incorporates a latency-aware prior to guide the selection of efficient architectures. The effectiveness of AutoNAD is validated on three industrial defect datasets and further applied within a defect imaging and detection platform. Code is available at https://github.com/Yuxi104/AutoNAD.

[251] ReasonAct: Progressive Training for Fine-Grained Video Reasoning in Small Models

Jiaxin Liu, Zhaolu Kang

Main category: cs.CV

TL;DR: ReasonAct enhances video reasoning in small models via three-stage training: text reasoning foundation, video fine-tuning, and temporal-aware RL with biomechanical sub-action decomposition.

Details

Motivation: Small multimodal models struggle with fine-grained temporal reasoning needed for video understanding, despite progress in vision-language tasks.

Method: Three-stage training: text-only reasoning foundation, video fine-tuning, temporal-aware RL with T-GRPO enhancement and biomechanical sub-action decomposition for graduated rewards.

Result: 3B-parameter model achieves 67.2% (HMDB51), 94.1% (UCF-101), 78.9% (Kinetics-400) accuracy, improving baselines by 17.9, 15.8, and 12.3 points respectively.

Conclusion: Progressive training enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.

Abstract: While recent multimodal models have shown progress in vision-language tasks, small-scale variants still struggle with the fine-grained temporal reasoning required for video understanding. We introduce ReasonAct, a method that enhances video reasoning in smaller models through a three-stage training process: first building a foundation with text-only reasoning, then fine-tuning on video, and finally refining with temporal-aware reinforcement learning. We build upon Temporal Group Relative Policy Optimization (T-GRPO) by incorporating temporal consistency modeling into policy optimization. We also propose a biomechanically-motivated sub-action decomposition mechanism that provides graduated rewards for constituent action phases. Through experiments on HMDB51, UCF-101, and Kinetics-400, our 3B-parameter model achieves 67.2%, 94.1%, and 78.9% accuracy respectively, demonstrating improvements of 17.9, 15.8, and 12.3 points over baselines. Ablation studies validate that our progressive training enables smaller models to achieve competitive video reasoning performance while maintaining computational efficiency.

[252] SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation

Zhenjie Mao, Yuhuan Yang, Chaofan Ma, Dongsheng Jiang, Jiangchao Yao, Ya Zhang, Yanfeng Wang

Main category: cs.CV

TL;DR: SaFiRe is a novel framework for Referring Image Segmentation that handles ambiguous expressions through a two-phase cognitive process, achieving state-of-the-art performance on both standard and new benchmarks.

Details

Motivation: Current RIS methods focus on simple expressions and reduce the task to keyword matching, failing to handle referential ambiguity in real-world scenarios like object-distracting and category-implicit expressions.

Method: Proposes SaFiRe framework that mimics human two-phase cognition (global understanding then detail refinement), leveraging Mamba’s scan-then-update property for efficient multi-cycle refinement with linear complexity.

Result: Extensive experiments show SaFiRe outperforms state-of-the-art baselines on both standard datasets and the newly introduced aRefCOCO benchmark for ambiguous referring expressions.

Conclusion: SaFiRe effectively addresses referential ambiguity in RIS through a cognitive-inspired approach and demonstrates superior performance, highlighting the importance of handling complex real-world expressions.

Abstract: Referring Image Segmentation (RIS) aims to segment the target object in an image given a natural language expression. While recent methods leverage pre-trained vision backbones and more training corpus to achieve impressive results, they predominantly focus on simple expressions–short, clear noun phrases like “red car” or “left girl”. This simplification often reduces RIS to a key word/concept matching problem, limiting the model’s ability to handle referential ambiguity in expressions. In this work, we identify two challenging real-world scenarios: object-distracting expressions, which involve multiple entities with contextual cues, and category-implicit expressions, where the object class is not explicitly stated. To address the challenges, we propose a novel framework, SaFiRe, which mimics the human two-phase cognitive process–first forming a global understanding, then refining it through detail-oriented inspection. This is naturally supported by Mamba’s scan-then-update property, which aligns with our phased design and enables efficient multi-cycle refinement with linear complexity. We further introduce aRefCOCO, a new benchmark designed to evaluate RIS models under ambiguous referring expressions. Extensive experiments on both standard and proposed datasets demonstrate the superiority of SaFiRe over state-of-the-art baselines.

[253] MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong, Christophe Bobda, Nitin Agarwal, Khoa Luu

Main category: cs.CV

TL;DR: MANGO introduces an interpretable multimodal fusion approach using invertible cross-attention layers in normalizing flows, achieving state-of-the-art performance on multiple tasks.

Details

Motivation: Current multimodal fusion methods use Transformers' attention to implicitly learn correlations, failing to capture essential modality features and complex multimodal structures.

Method: Proposes Multimodal Attention-based Normalizing Flow (MANGO) with Invertible Cross-Attention (ICA) layers and three cross-attention mechanisms: MMCA, IMCA, and LICA for capturing multimodal correlations.

Result: Achieved state-of-the-art performance on semantic segmentation, image-to-image translation, and movie genre classification tasks.

Conclusion: MANGO provides explicit, interpretable, and tractable multimodal fusion learning that effectively captures complex multimodal correlations.

Abstract: Multimodal learning has gained much success in recent years. However, current multimodal fusion methods adopt the attention mechanism of Transformers to implicitly learn the underlying correlation of multimodal features. As a result, the multimodal model cannot capture the essential features of each modality, making it difficult to comprehend complex structures and correlations of multimodal inputs. This paper introduces a novel Multimodal Attention-based Normalizing Flow (MANGO) approach to developing explicit, interpretable, and tractable multimodal fusion learning. In particular, we propose a new Invertible Cross-Attention (ICA) layer to develop the Normalizing Flow-based Model for multimodal data. To efficiently capture the complex, underlying correlations in multimodal data in our proposed invertible cross-attention layer, we propose three new cross-attention mechanisms: Modality-to-Modality Cross-Attention (MMCA), Inter-Modality Cross-Attention (IMCA), and Learnable Inter-Modality Cross-Attention (LICA). Finally, we introduce a new Multimodal Attention-based Normalizing Flow to enable the scalability of our proposed method to high-dimensional multimodal data. Our experimental results on three different multimodal learning tasks, i.e., semantic segmentation, image-to-image translation, and movie genre classification, have illustrated the state-of-the-art (SoTA) performance of the proposed approach.

[254] Towards Spatially Consistent Image Generation: On Incorporating Intrinsic Scene Properties into Diffusion Models

Hyundo Lee, Suhyung Choi, Inwoo Hwang, Byoung-Tak Zhang

Main category: cs.CV

TL;DR: A method that co-generates images and intrinsic scene properties (depth, segmentation maps) to improve spatial consistency and realism in image generation, building on pre-trained Latent Diffusion Models.

Details

Motivation: Image generation models often produce spatially inconsistent and distorted images due to limited structural information. Leveraging intrinsic scene properties can provide rich information about underlying scene structure.

Method: Extract intrinsic scene properties from large datasets using pre-trained estimators, aggregate them into a single latent variable via autoencoder, and simultaneously denoise image and intrinsic domains in Latent Diffusion Models while sharing mutual information.

Result: The method corrects spatial inconsistencies, produces more natural scene layouts, and maintains fidelity and textual alignment compared to base models like Stable Diffusion.

Conclusion: Co-generating images with intrinsic scene properties enables implicit capture of scene structure, leading to more spatially consistent and realistic image generation without degrading quality.

Abstract: Image generation models trained on large datasets can synthesize high-quality images but often produce spatially inconsistent and distorted images due to limited information about the underlying structures and spatial layouts. In this work, we leverage intrinsic scene properties (e.g., depth, segmentation maps) that provide rich information about the underlying scene, unlike prior approaches that solely rely on image-text pairs or use intrinsics as conditional inputs. Our approach aims to co-generate both images and their corresponding intrinsics, enabling the model to implicitly capture the underlying scene structure and generate more spatially consistent and realistic images. Specifically, we first extract rich intrinsic scene properties from a large image dataset with pre-trained estimators, eliminating the need for additional scene information or explicit 3D representations. We then aggregate various intrinsic scene properties into a single latent variable using an autoencoder. Building upon pre-trained large-scale Latent Diffusion Models (LDMs), our method simultaneously denoises the image and intrinsic domains by carefully sharing mutual information so that the image and intrinsic reflect each other without degrading image quality. Experimental results demonstrate that our method corrects spatial inconsistencies and produces a more natural layout of scenes while maintaining the fidelity and textual alignment of the base model (e.g., Stable Diffusion).

[255] SARVLM: A Vision Language Foundation Model for Semantic Understanding and Target Recognition in SAR Imagery

Qiwei Ma, Zhiyu Wang, Wang Liu, Xukun Lu, Bin Deng, Puhong Duan, Xudong Kang, Shutao Li

Main category: cs.CV

TL;DR: SARVLM is the first vision-language foundation model for SAR imagery, trained on a large-scale dataset (SARVLM-1M) with domain transfer strategy to bridge natural and SAR imagery gaps, enabling superior multimodal understanding and zero-shot capabilities.

Details

Motivation: Existing SAR foundation models focus on low-level visual features but lack multimodal alignment and zero-shot target recognition capabilities, limiting semantic understanding of SAR imagery.

Method: Constructed SARVLM-1M dataset with 1M+ image-text pairs, proposed domain transfer training strategy to address natural-SAR imagery gap, and developed SARVLM model with SARCLIP and SARCap components using vision-language contrastive learning.

Result: SARVLM achieves superior performance in image-text retrieval, zero-shot classification, semantic localization, and imagery captioning, outperforming state-of-the-art VLMs in SAR semantic understanding.

Conclusion: SARVLM advances SAR semantic understanding by effectively bridging SAR imagery with textual descriptions through multimodal alignment, demonstrating strong zero-shot capabilities and feature extraction performance.

Abstract: Synthetic Aperture Radar (SAR) is a crucial imaging modality thanks to its all-weather capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these methods largely emphasize low-level visual features and often overlook multimodal alignment and zero-shot target recognition in SAR imagery. To address this, we construct SARVLM-1M, a large-scale vision-language dataset with over one million image-text pairs aggregated from existing datasets. We further propose a domain transfer training strategy to mitigate the large gap between natural and SAR imagery. Building on this, we develop SARVLM, the first vision language foundation model (VLM) tailored to SAR, comprising SARCLIP and SARCap. SARVLM is trained with a vision-language contrastive objective under the proposed domain transfer strategy, bridging SAR imagery and textual descriptions. Extensive experiments on image text retrieval, zero-shot classification, semantic localization, and imagery captioning demonstrate that SARVLM delivers superior feature extraction and interpretation, outperforming state-of-the-art VLMs and advancing SAR semantic understanding. Code and datasets will be released soon.

[256] Directed-Tokens: A Robust Multi-Modality Alignment Approach to Large Language-Vision Models

Thanh-Dat Truong, Huu-Thien Tran, Tran Thai Son, Bhiksha Raj, Khoa Luu

Main category: cs.CV

TL;DR: A novel learning mechanism for large multimodal models that improves robustness and generalization through shuffling tasks and directed-token approach.

Details

Motivation: Current large multimodal models suffer from limitations in robustness and generalization due to alignment issues between visual and textual features.

Method: Introduces two shuffling tasks (reconstructing image order and text order) in pre-training and fine-tuning, plus a directed-token approach and Image-to-Response Guided loss.

Result: Consistently achieves state-of-the-art performance on academic task-oriented and instruction-following LMM benchmarks.

Conclusion: The proposed approach effectively improves reasoning capability, visual understanding, and cross-modality alignment in large multimodal models.

Abstract: Large multimodal models (LMMs) have gained impressive performance due to their outstanding capability in various understanding tasks. However, these models still suffer from some fundamental limitations related to robustness and generalization due to the alignment and correlation between visual and textual features. In this paper, we introduce a simple but efficient learning mechanism for improving the robust alignment between visual and textual modalities by solving shuffling problems. In particular, the proposed approach can improve reasoning capability, visual understanding, and cross-modality alignment by introducing two new tasks: reconstructing the image order and the text order into the LMM’s pre-training and fine-tuning phases. In addition, we propose a new directed-token approach to capture visual and textual knowledge, enabling the capability to reconstruct the correct order of visual inputs. Then, we introduce a new Image-to-Response Guided loss to further improve the visual understanding of the LMM in its responses. The proposed approach consistently achieves state-of-the-art (SoTA) performance compared with prior LMMs on academic task-oriented and instruction-following LMM benchmarks.

[257] Uncovering and Mitigating Destructive Multi-Embedding Attacks in Deepfake Proactive Forensics

Lixin Jia, Haiyang Sun, Zhiqing Guo, Yunfeng Diao, Dan Ma, Gaobo Yang

Main category: cs.CV

TL;DR: This paper identifies Multi-Embedding Attacks (MEA) as a vulnerability in deepfake proactive forensics, where subsequent watermark embeddings can destroy original forensic watermarks, and proposes Adversarial Interference Simulation (AIS) to enhance watermark resilience.

Details

Motivation: Existing deepfake proactive forensic methods rely on single watermark embedding assumptions, which are impractical in real-world scenarios where images may undergo multiple watermarking rounds, rendering original forensic mechanisms ineffective.

Method: Proposed Adversarial Interference Simulation (AIS) training paradigm that simulates MEA scenarios during fine-tuning, uses resilience-driven loss function to enforce learning of sparse and stable watermark representations without modifying network architecture.

Result: Extensive experiments show AIS significantly enhances robustness of various existing methods against MEA, enabling models to correctly extract original watermarks even after second embedding.

Conclusion: AIS provides a plug-and-play solution to address Multi-Embedding Attacks, making deepfake proactive forensics more practical and resilient in real-world scenarios with multiple watermarking operations.

Abstract: With the rapid evolution of deepfake technologies and the wide dissemination of digital media, personal privacy is facing increasingly serious security threats. Deepfake proactive forensics, which involves embedding imperceptible watermarks to enable reliable source tracking, serves as a crucial defense against these threats. Although existing methods show strong forensic ability, they rely on an idealized assumption of single watermark embedding, which proves impractical in real-world scenarios. In this paper, we formally define and demonstrate the existence of Multi-Embedding Attacks (MEA) for the first time. When a previously protected image undergoes additional rounds of watermark embedding, the original forensic watermark can be destroyed or removed, rendering the entire proactive forensic mechanism ineffective. To address this vulnerability, we propose a general training paradigm named Adversarial Interference Simulation (AIS). Rather than modifying the network architecture, AIS explicitly simulates MEA scenarios during fine-tuning and introduces a resilience-driven loss function to enforce the learning of sparse and stable watermark representations. Our method enables the model to maintain the ability to extract the original watermark correctly even after a second embedding. Extensive experiments demonstrate that our plug-and-play AIS training paradigm significantly enhances the robustness of various existing methods against MEA.

[258] DensiCrafter: Physically-Constrained Generation and Fabrication of Self-Supporting Hollow Structures

Shengqi Dang, Fu Chai, Jiaxin Li, Chao Yuan, Wei Ye, Nan Cao

Main category: cs.CV

TL;DR: DensiCrafter generates lightweight, self-supporting 3D hollow structures by optimizing density fields from coarse voxel grids, achieving up to 43% material reduction while maintaining geometric fidelity and manufacturability.

Details

Motivation: Current 3D generative models ignore physical constraints and manufacturability, particularly the need for lightweight and self-supporting structures suitable for 3D printing.

Method: Optimizes continuous density fields from coarse voxel grids using three differentiable, physically constrained loss terms and mass regularization, while preserving outer surfaces through restricted optimization domains.

Result: Achieves up to 43% material mass reduction in text-to-3D tasks, improves stability compared to baselines, and maintains high geometric fidelity. Real-world 3D-printing confirms reliable fabrication of self-supporting hollow designs.

Conclusion: DensiCrafter successfully bridges the gap between 3D generative models and manufacturability requirements, producing lightweight, self-supporting structures without architectural changes to existing models.

Abstract: The rise of 3D generative models has enabled automatic 3D geometry and texture synthesis from multimodal inputs (e.g., text or images). However, these methods often ignore physical constraints and manufacturability considerations. In this work, we address the challenge of producing 3D designs that are both lightweight and self-supporting. We present DensiCrafter, a framework for generating lightweight, self-supporting 3D hollow structures by optimizing the density field. Starting from coarse voxel grids produced by Trellis, we interpret these as continuous density fields to optimize and introduce three differentiable, physically constrained, and simulation-free loss terms. Additionally, a mass regularization penalizes unnecessary material, while a restricted optimization domain preserves the outer surface. Our method seamlessly integrates with pretrained Trellis-based models (e.g., Trellis, DSO) without any architectural changes. In extensive evaluations, we achieve up to 43% reduction in material mass on the text-to-3D task. Compared to state-of-the-art baselines, our method could improve the stability and maintain high geometric fidelity. Real-world 3D-printing experiments confirm that our hollow designs can be reliably fabricated and could be self-supporting.

[259] FastAvatar: Instant 3D Gaussian Splatting for Faces from Single Unconstrained Poses

Hao Liang, Zhixuan Ge, Soumendu Majee, Ashish Tiwari, G. M. Dilshan Godaliyadda, Ashok Veeraraghavan, Guha Balakrishnan

Main category: cs.CV

TL;DR: FastAvatar enables fast 3D face reconstruction from a single image using 3D Gaussian Splatting, achieving high-quality results in ~3 seconds with a hybrid prediction-optimization approach.

Details

Motivation: To create fast and robust single-image 3D face reconstruction that overcomes the slow speed of existing per-subject optimization methods while maintaining high fidelity and pose robustness.

Method: Two-stage design: feed-forward encoder-decoder predicts coarse geometry from pose-invariant identity embedding, followed by lightweight test-time refinement of appearance parameters for photorealistic rendering.

Result: Achieves state-of-the-art reconstruction quality (24.01 dB PSNR, 0.91 SSIM) while running 600x faster than existing methods, supporting novel-view synthesis and expression animation.

Conclusion: FastAvatar significantly broadens the applicability of 3DGS-based facial avatars by offering high fidelity, pose robustness, and rapid reconstruction in a practical timeframe.

Abstract: We present FastAvatar, a fast and robust algorithm for single-image 3D face reconstruction using 3D Gaussian Splatting (3DGS). Given a single input image from an arbitrary pose, FastAvatar recovers a high-quality, full-head 3DGS avatar in approximately 3 seconds on a single NVIDIA A100 GPU. We use a two-stage design: a feed-forward encoder-decoder predicts coarse face geometry by regressing Gaussian structure from a pose-invariant identity embedding, and a lightweight test-time refinement stage then optimizes the appearance parameters for photorealistic rendering. This hybrid strategy combines the speed and stability of direct prediction with the accuracy of optimization, enabling strong identity preservation even under extreme input poses. FastAvatar achieves state-of-the-art reconstruction quality (24.01 dB PSNR, 0.91 SSIM) while running over 600x faster than existing per-subject optimization methods (e.g., FlashAvatar, GaussianAvatars, GASP). Once reconstructed, our avatars support photorealistic novel-view synthesis and FLAME-guided expression animation, enabling controllable reenactment from a single image. By jointly offering high fidelity, robustness to pose, and rapid reconstruction, FastAvatar significantly broadens the applicability of 3DGS-based facial avatars.

[260] EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer

Pukun Zhao, Longxiang Wang, Miaowei Wang, Chen Chen, Fanqing Zhou, Haojian Huang

Main category: cs.CV

TL;DR: The paper introduces two dynamic spatial reasoning benchmarks to evaluate models’ abilities in spatial understanding and adaptive planning under partial observability and dynamic changes.

Details

Motivation: Existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes.

Method: Proposed two dynamic spatial benchmarks: locally observable maze navigation and match-2 elimination, with a subjective experience-based memory mechanism for cross-task experience transfer and validation.

Result: Experiments reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory.

Conclusion: The benchmarks provide a comprehensive platform for future methodological advances in dynamic spatial reasoning.

Abstract: Most existing spatial reasoning benchmarks focus on static or globally observable environments, failing to capture the challenges of long-horizon reasoning and memory utilization under partial observability and dynamic changes. We introduce two dynamic spatial benchmarks, locally observable maze navigation and match-2 elimination that systematically evaluate models’ abilities in spatial understanding and adaptive planning when local perception, environment feedback, and global objectives are tightly coupled. Each action triggers structural changes in the environment, requiring continuous update of cognition and strategy. We further propose a subjective experience-based memory mechanism for cross-task experience transfer and validation. Experiments show that our benchmarks reveal key limitations of mainstream models in dynamic spatial reasoning and long-term memory, providing a comprehensive platform for future methodological advances. Our code and data are available at https://anonymous.4open.science/r/EvoEmpirBench-143C/.

[261] Passive Dementia Screening via Facial Temporal Micro-Dynamics Analysis of In-the-Wild Talking-Head Video

Filippo Cenacchi, Longbing Cao, Mitchell McEwan, Deborah Richards

Main category: cs.CV

TL;DR: Language-free dementia screening using facial micro-dynamics analysis from short talking head videos, achieving high performance without speech or text.

Details

Motivation: Existing dementia screening methods rely on speech or scripted interviews, limiting scalability and requiring clinical intervention. Need for passive, language-free screening that works across devices and cultures.

Method: Analyze temporal facial kinematics (blink dynamics, mouth/jaw motions, gaze variability, head adjustments) by converting micro movements into interpretable time series, smoothing them, and summarizing into clip-level statistics based on activity mix distribution across motion streams.

Result: On YT DemTalk dataset (300 clips), achieved AUROC 0.953, AP 0.961, F1-score 0.851, accuracy 0.857. Gaze lability and mouth/jaw dynamics identified as most informative cues.

Conclusion: Facial micro-dynamics analysis enables effective language-free dementia screening from unscripted videos, offering scalable passive monitoring without clinical intervention.

Abstract: We target passive dementia screening from short camera-facing talking head video, developing a facial temporal micro dynamics analysis for language free detection of early neuro cognitive change. This enables unscripted, in the wild video analysis at scale to capture natural facial behaviors, transferrable across devices, topics, and cultures without active intervention by clinicians or researchers during recording. Most existing resources prioritize speech or scripted interviews, limiting use outside clinics and coupling predictions to language and transcription. In contrast, we identify and analyze whether temporal facial kinematics, including blink dynamics, small mouth jaw motions, gaze variability, and subtle head adjustments, are sufficient for dementia screening without speech or text. By stabilizing facial signals, we convert these micro movements into interpretable facial microdynamic time series, smooth them, and summarize short windows into compact clip level statistics for screening. Each window is encoded by its activity mix (the relative share of motion across streams), thus the predictor analyzes the distribution of motion across streams rather than its magnitude, making per channel effects transparent. We also introduce YT DemTalk, a new dataset curated from publicly available, in the wild camera facing videos. It contains 300 clips (150 with self reported dementia, 150 controls) to test our model and offer a first benchmarking of the corpus. On YT DemTalk, ablations identify gaze lability and mouth/jaw dynamics as the most informative cues, and light weighted shallow classifiers could attain a dementia prediction performance of (AUROC) 0.953, 0.961 Average Precision (AP), 0.851 F1-score, and 0.857 accuracy.

[262] Multi-scale Temporal Prediction via Incremental Generation and Multi-agent Collaboration

Zhitao Zeng, Guojian Yuan, Junyuan Mao, Yuxuan Wang, Xiaoshuang Jia, Yueming Jin

Main category: cs.CV

TL;DR: The paper introduces Multi-Scale Temporal Prediction (MSTP) task for scene understanding, proposes a benchmark with synchronized multi-scale annotations, and presents IG-MC method with incremental generation and multi-agent collaboration for temporal prediction.

Details

Motivation: Accurate temporal prediction bridges scene understanding and embodied AI, but current vision-language models struggle with predicting multiple fine-grained states at multiple temporal scales in both general and surgical scenes.

Method: Proposes Incremental Generation and Multi-agent Collaboration (IG-MC) with: 1) plug-and-play incremental generation module for continuous visual previews, and 2) decision-driven multi-agent collaboration framework with generation, initiation, and assessment agents for dynamic prediction cycles.

Result: The method addresses the challenge of maintaining synchronized decisions and generated visuals across expanding temporal scales, preventing performance degradation as look-ahead intervals lengthen.

Conclusion: The MSTP task formalization and IG-MC method provide a unified approach for multi-scale temporal prediction that balances global coherence and local fidelity across varying temporal and state scales.

Abstract: Accurate temporal prediction is the bridge between comprehensive scene understanding and embodied artificial intelligence. However, predicting multiple fine-grained states of a scene at multiple temporal scales is difficult for vision-language models. We formalize the Multi-Scale Temporal Prediction (MSTP) task in general and surgical scenes by decomposing multi-scale into two orthogonal dimensions: the temporal scale, forecasting states of humans and surgery at varying look-ahead intervals, and the state scale, modeling a hierarchy of states in general and surgical scenes. For example, in general scenes, states of contact relationships are finer-grained than states of spatial relationships. In surgical scenes, medium-level steps are finer-grained than high-level phases yet remain constrained by their encompassing phase. To support this unified task, we introduce the first MSTP Benchmark, featuring synchronized annotations across multiple state scales and temporal scales. We further propose a method, Incremental Generation and Multi-agent Collaboration (IG-MC), which integrates two key innovations. First, we present a plug-and-play incremental generation module that continuously synthesizes up-to-date visual previews at expanding temporal scales to inform multiple decision-making agents, keeping decisions and generated visuals synchronized and preventing performance degradation as look-ahead intervals lengthen. Second, we present a decision-driven multi-agent collaboration framework for multi-state prediction, comprising generation, initiation, and multi-state assessment agents that dynamically trigger and evaluate prediction cycles to balance global coherence and local fidelity.

[263] Cheating Stereo Matching in Full-scale: Physical Adversarial Attack against Binocular Depth Estimation in Autonomous Driving

Kangqiao Zhao, Shuo Huai, Xurui Song, Jun Luo

Main category: cs.CV

TL;DR: First texture-enabled physical adversarial attack against stereo matching models using 3D PAEs with global camouflage texture for enhanced stealth and effectiveness across stereo camera viewpoints.

Details

Motivation: Existing adversarial attacks on autonomous driving perception are mostly limited to 2D patches targeting monocular perception, leaving stereo-based binocular depth estimation largely unexplored for physical adversarial examples.

Method: Proposes 3D PAE with global camouflage texture, 3D stereo matching rendering module for real-world alignment in binocular vision, and novel merging attack that blends target into environment through fine-grained PAE optimization.

Result: Extensive evaluations show PAEs successfully fool stereo models into producing erroneous depth information with significantly enhanced stealth and lethality compared to existing hiding attacks.

Conclusion: The proposed method demonstrates effective physical adversarial attacks on stereo matching models, addressing the gap in existing research and showing practical vulnerability in autonomous driving perception systems.

Abstract: Though deep neural models adopted to realize the perception of autonomous driving have proven vulnerable to adversarial examples, known attacks often leverage 2D patches and target mostly monocular perception. Therefore, the effectiveness of Physical Adversarial Examples (PAEs) on stereo-based binocular depth estimation remains largely unexplored. To this end, we propose the first texture-enabled physical adversarial attack against stereo matching models in the context of autonomous driving. Our method employs a 3D PAE with global camouflage texture rather than a local 2D patch-based one, ensuring both visual consistency and attack effectiveness across different viewpoints of stereo cameras. To cope with the disparity effect of these cameras, we also propose a new 3D stereo matching rendering module that allows the PAE to be aligned with real-world positions and headings in binocular vision. We further propose a novel merging attack that seamlessly blends the target into the environment through fine-grained PAE optimization. It has significantly enhanced stealth and lethality upon existing hiding attacks that fail to get seamlessly merged into the background. Extensive evaluations show that our PAEs can successfully fool the stereo models into producing erroneous depth information.

[264] ControlEvents: Controllable Synthesis of Event Camera Datawith Foundational Prior from Image Diffusion Models

Yixuan Hu, Yuxuan Xue, Simon Klenk, Daniel Cremers, Gerard Pons-Moll

Main category: cs.CV

TL;DR: ControlEvents is a diffusion-based generative model that synthesizes high-quality event data using control signals like text labels, 2D skeletons, and 3D body poses, leveraging foundation models to reduce data labeling costs.

Details

Motivation: Event cameras offer bio-inspired advantages but face challenges in obtaining large-scale labeled ground-truth data, which is costly and difficult to acquire.

Method: Uses diffusion-based generative model with control signals (text labels, 2D skeletons, 3D poses) and leverages diffusion prior from foundation models like Stable Diffusion for minimal fine-tuning with limited labeled data.

Result: Successfully synthesizes event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation, enhancing model performance in all tasks and enabling generation from unseen text labels.

Conclusion: The approach streamlines event data generation, significantly reduces labeling costs, and demonstrates powerful text-based generation capabilities inherited from foundation models.

Abstract: In recent years, event cameras have gained significant attention due to their bio-inspired properties, such as high temporal resolution and high dynamic range. However, obtaining large-scale labeled ground-truth data for event-based vision tasks remains challenging and costly. In this paper, we present ControlEvents, a diffusion-based generative model designed to synthesize high-quality event data guided by diverse control signals such as class text labels, 2D skeletons, and 3D body poses. Our key insight is to leverage the diffusion prior from foundation models, such as Stable Diffusion, enabling high-quality event data generation with minimal fine-tuning and limited labeled data. Our method streamlines the data generation process and significantly reduces the cost of producing labeled event datasets. We demonstrate the effectiveness of our approach by synthesizing event data for visual recognition, 2D skeleton estimation, and 3D body pose estimation. Our experiments show that the synthesized labeled event data enhances model performance in all tasks. Additionally, our approach can generate events based on unseen text labels during training, illustrating the powerful text-based generation capabilities inherited from foundation models.

[265] XYZCylinder: Towards Compatible Feed-Forward 3D Gaussian Splatting for Driving Scenes via Unified Cylinder Lifting Method

Haochen Yu, Qiankun Liu, Hongyuan Liu, Jianfei Jiang, Juntao Lyu, Jiansheng Chen, Huimin Ma

Main category: cs.CV

TL;DR: XYZCylinder is a novel 3D reconstruction method for complex driving scenes that uses unified cylinder lifting to handle varying camera configurations and improve reconstruction accuracy from sparse 360° views.

Details

Motivation: Existing feed-forward 3D reconstruction methods struggle with complex driving scenes due to fixed view transformations that are incompatible with varying camera configurations and difficulty learning from sparse 360° views with minimal overlap.

Method: Proposes Unified Cylinder Camera Modeling (UCCM) to explicitly model projection parameters for diverse camera setups, and a hybrid representation with Cylinder Plane Feature Group (CPFG) modules to lift 2D image features to 3D space.

Result: Achieves state-of-the-art performance across different evaluation settings and demonstrates remarkable zero-shot compatibility in new scenes with different camera configurations.

Conclusion: XYZCylinder effectively addresses the limitations of existing methods by providing camera-agnostic 3D reconstruction with improved accuracy and compatibility for complex driving scenes.

Abstract: Feed-forward paradigms for 3D reconstruction have become a focus of recent research, which learn implicit, fixed view transformations to generate a single scene representation. However, their application to complex driving scenes reveals significant limitations. Two core challenges are responsible for this performance gap. First, the reliance on a fixed view transformation hinders compatibility to varying camera configurations. Second, the inherent difficulty of learning complex driving scenes from sparse 360° views with minimal overlap compromises the final reconstruction fidelity. To handle these difficulties, we introduce XYZCylinder, a novel method built upon a unified cylinder lifting method that integrates camera modeling and feature lifting. To tackle the compatibility problem, we design a Unified Cylinder Camera Modeling (UCCM) strategy. This strategy explicitly models projection parameters to unify diverse camera setups, thus bypassing the need for learning viewpoint-dependent correspondences. To improve the reconstruction accuracy, we propose a hybrid representation with several dedicated modules based on newly designed Cylinder Plane Feature Group (CPFG) to lift 2D image features to 3D space. Extensive evaluations confirm that XYZCylinder not only achieves state-of-the-art performance under different evaluation settings but also demonstrates remarkable compatibility in entirely new scenes with different camera settings in a zero-shot manner. Project page: \href{https://yuyuyu223.github.io/XYZCYlinder-projectpage/}{here}

[266] ConceptGuard: Proactive Safety in Text-and-Image-to-Video Generation through Multimodal Risk Detection

Ruize Ma, Minghong Cai, Yilei Jiang, Jiaming Han, Yi Feng, Yingshui Tan, Xiaoyong Zhu, Bo Zhang, Bo Zheng, Xiangyu Yue

Main category: cs.CV

TL;DR: ConceptGuard is a unified safeguard framework that proactively detects and mitigates unsafe semantics in multimodal video generation by identifying latent safety risks and steering the generative process away from unsafe concepts.

Details

Motivation: Current video generative models with multimodal prompts introduce new safety risks from individual modalities or their interactions, while existing safety methods are often text-only, require prior risk knowledge, or operate as post-generation auditors.

Method: Two-stage framework: 1) Contrastive detection module projects fused image-text inputs into structured concept space to identify latent safety risks; 2) Semantic suppression mechanism intervenes in multimodal conditioning to steer generation away from unsafe concepts.

Result: Comprehensive experiments on ConceptRisk and T2VSafetyBench-TI2V benchmarks show ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation.

Conclusion: ConceptGuard provides an effective proactive safeguard for multimodal video generation, addressing compositional safety risks through structured concept detection and semantic suppression.

Abstract: Recent progress in video generative models has enabled the creation of high-quality videos from multimodal prompts that combine text and images. While these systems offer enhanced controllability, they also introduce new safety risks, as harmful content can emerge from individual modalities or their interaction. Existing safety methods are often text-only, require prior knowledge of the risk category, or operate as post-generation auditors, struggling to proactively mitigate such compositional, multimodal risks. To address this challenge, we present ConceptGuard, a unified safeguard framework for proactively detecting and mitigating unsafe semantics in multimodal video generation. ConceptGuard operates in two stages: First, a contrastive detection module identifies latent safety risks by projecting fused image-text inputs into a structured concept space; Second, a semantic suppression mechanism steers the generative process away from unsafe concepts by intervening in the prompt’s multimodal conditioning. To support the development and rigorous evaluation of this framework, we introduce two novel benchmarks: ConceptRisk, a large-scale dataset for training on multimodal risks, and T2VSafetyBench-TI2V, the first benchmark adapted from T2VSafetyBench for the Text-and-Image-to-Video (TI2V) safety setting. Comprehensive experiments on both benchmarks show that ConceptGuard consistently outperforms existing baselines, achieving state-of-the-art results in both risk detection and safe video generation. Our code is available at https://github.com/Ruize-Ma/ConceptGuard.

[267] VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment

Qing Li, Huifang Feng, Xun Gong, Yu-Shen Liu

Main category: cs.CV

TL;DR: A novel method that enhances 3D Gaussian Splatting for better surface reconstruction through view alignment, incorporating edge-aware rendering, visibility-aware photometric alignment, normal constraints, and deep feature embeddings.

Details

Motivation: 3D Gaussian Splatting is efficient for novel view synthesis but struggles with accurate surface reconstruction due to discrete Gaussians and image-only supervision leading to inaccurate geometry and inconsistent multi-view alignment.

Method: Proposes view alignment (VA) method with: edge-aware image cues in rendering loss, visibility-aware photometric alignment for cross-view consistency, normal-based constraints for spatial orientation, and deep image feature embeddings for robust geometry learning.

Result: Extensive experiments show state-of-the-art performance in both surface reconstruction and novel view synthesis on standard benchmarks.

Conclusion: The proposed VA-GS method successfully enhances geometric representation of 3D Gaussians, achieving superior surface reconstruction while maintaining high-quality novel view synthesis capabilities.

Abstract: 3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https://github.com/LeoQLi/VA-GS.

[268] Decorrelation Speeds Up Vision Transformers

Kieran Carrigg, Rob van Gastel, Melda Yeghaian, Sander Dalm, Faysal Boughorbel, Marcel van Gerven

Main category: cs.CV

TL;DR: DBP-MAE integrates Decorrelated Backpropagation into MAE pre-training to reduce computational costs and accelerate convergence while maintaining performance in low-label data regimes.

Details

Motivation: MAE pre-training of vision transformers provides strong performance but has substantial computational costs, making it impractical for time- and resource-constrained industrial settings.

Method: Integrate Decorrelated Backpropagation (DBP) into MAE pre-training, applying it selectively to the encoder to iteratively reduce input correlations at each layer for faster convergence without stability loss.

Result: DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, improves segmentation mIoU by 1.1 points on ImageNet-1K and ADE20K, with similar gains on proprietary industrial data.

Conclusion: DBP can effectively reduce training time and energy use while improving downstream performance for large-scale ViT pre-training in real-world industrial applications.

Abstract: Masked Autoencoder (MAE) pre-training of vision transformers (ViTs) yields strong performance in low-label data regimes but comes with substantial computational costs, making it impractical in time- and resource-constrained industrial settings. We address this by nitegrating Decorrelated Backpropagation (DBP) into MAE pre-training, an optimization method that iteratively reduces input correlations at each layer to accelerate convergence. Applied selectively to the encoder, DBP achieves faster pre-training without loss of stability. To mimic constrained-data scenarios, we evaluate our approach on ImageNet-1K pre-training and ADE20K fine-tuning using randomly sampled subsets of each dataset. Under this setting, DBP-MAE reduces wall-clock time to baseline performance by 21.1%, lowers carbon emissions by 21.4%, and improves segmentation mIoU by 1.1 points. We observe similar gains when pre-training and fine-tuning on proprietary industrial data, confirming the method’s applicability in real-world scenarios. These results demonstrate that DBP can reduce training time and energy use while improving downstream performance for large-scale ViT pre-training. Keywords: Deep learning, Vision transformers, Efficient AI, Decorrelation

[269] Saliency-R1: Incentivizing Unified Saliency Reasoning Capability in MLLM with Confidence-Guided Reinforcement Learning

Long Li, Shuichen Ji, Ziyang Luo, Zhihui Li, Dingwen Zhang, Junwei Han, Nian Liu

Main category: cs.CV

TL;DR: Saliency-R1 is a unified MLLM framework that integrates saliency reasoning into multimodal models, addressing three key saliency tasks (SOD, SIS, CoSOD) through structured textual interfaces and efficient training with CGPO algorithm.

Details

Motivation: Current multimodal large language models lack visual saliency awareness, making it difficult to identify key visual elements despite excelling at high-level vision-language reasoning.

Method: Proposes a unified framework with structured textual tags (, ) for region- and instance-level referring expressions, and introduces Confidence-Guided Policy Optimization (CGPO) - a single-sample RL algorithm that improves on GRPO by using reward-confidence discrepancy for better training efficiency.

Result: The model exceeds or matches performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three saliency tasks (SOD, SIS, CoSOD).

Conclusion: The framework effectively enhances MLLMs’ capacity for saliency reasoning, demonstrating the efficacy of the proposed approach in bridging the gap between high-level reasoning and visual saliency awareness.

Abstract: Although multimodal large language models (MLLMs) excel in high-level vision-language reasoning, they lack inherent awareness of visual saliency, making it difficult to identify key visual elements. To bridge this gap, we propose Saliency-R1, the first unified MLLM framework that jointly tackles three representative and heterogeneous saliency tasks: Salient Object Detection (SOD), Salient Instance Segmentation (SIS), and Co-salient Object Detection (CoSOD), enhancing the model’s capacity for saliency reasoning. We introduce a textual interface with structured tags (, ) to encode region- and instance-level referring expressions, enabling a single referring segmenter to produce task-appropriate masks. To train the MLLM efficiently, we propose Confidence-Guided Policy Optimization (CGPO), a novel single-sample reinforcement learning algorithm. CGPO improves on GRPO by replacing group-normalized advantages with a per-sample signal based on reward-confidence discrepancy, thereby reducing computational waste, mitigating signal dilution, and lowering training overhead. Our model exceeds or matches the performance of robust open/closed-source MLLMs and specialized state-of-the-art methods across all three tasks, demonstrating the efficacy of our framework in saliency reasoning.

[270] Probabilistic Robustness for Free? Revisiting Training via a Benchmark

Yi Zhang, Zheng Wang, Zhen Chen, Wenjie Ruan, Qing Guo, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

Main category: cs.CV

TL;DR: PRBench is the first benchmark for evaluating probabilistic robustness (PR) training methods, revealing that adversarial training (AT) methods are more versatile for improving both adversarial and probabilistic robustness, while PR-targeted methods offer better generalization and clean accuracy.

Details

Motivation: Current research focuses heavily on adversarial robustness (AR) while probabilistic robustness (PR) training methods are underexplored, with limitations in evaluation protocols, comparisons to strong baselines, and unified frameworks for generalization analysis.

Method: Developed PRBench benchmark to empirically compare AT and PR-targeted training methods using comprehensive metrics including clean accuracy, PR/AR performance, training efficiency, and generalization error, with theoretical analysis on generalization.

Result: AT methods are more versatile across diverse hyperparameter settings for improving both AR and PR, while PR-targeted methods consistently yield lower generalization error and higher clean accuracy. Created leaderboard with 222 trained models across 7 datasets and 10 architectures.

Conclusion: PRBench provides systematic evaluation showing complementary strengths of AT and PR-targeted methods, with AT offering broader robustness improvements and PR-targeted methods providing better generalization and clean accuracy.

Abstract: Deep learning models are notoriously vulnerable to imperceptible perturbations. Most existing research centers on adversarial robustness (AR), which evaluates models under worst-case scenarios by examining the existence of deterministic adversarial examples (AEs). In contrast, probabilistic robustness (PR) adopts a statistical perspective, measuring the probability that predictions remain correct under stochastic perturbations. While PR is widely regarded as a practical complement to AR, dedicated training methods for improving PR are still relatively underexplored, albeit with emerging progress. Among the few PR-targeted training methods, we identify three limitations: i non-comparable evaluation protocols; ii limited comparisons to strong AT baselines despite anecdotal PR gains from AT; and iii no unified framework to compare the generalization of these methods. Thus, we introduce PRBench, the first benchmark dedicated to evaluating improvements in PR achieved by different robustness training methods. PRBench empirically compares most common AT and PR-targeted training methods using a comprehensive set of metrics, including clean accuracy, PR and AR performance, training efficiency, and generalization error (GE). We also provide theoretical analysis on the GE of PR performance across different training methods. Main findings revealed by PRBench include: AT methods are more versatile than PR-targeted training methods in terms of improving both AR and PR performance across diverse hyperparameter settings, while PR-targeted training methods consistently yield lower GE and higher clean accuracy. A leaderboard comprising 222 trained models across 7 datasets and 10 model architectures is publicly available at https://tmpspace.github.io/PRBenchLeaderboard/.

[271] MoEGCL: Mixture of Ego-Graphs Contrastive Representation Learning for Multi-View Clustering

Jian Zhu, Xin Zou, Jun Sun, Cheng Luo, Lei Liu, Lingfang Zeng, Ning Zhang, Bian Wu, Chang Tang, Lirong Dai

Main category: cs.CV

TL;DR: MoEGCL is a novel multi-view clustering method that uses fine-grained ego-graph fusion at sample level instead of coarse view-level fusion, achieving state-of-the-art performance.

Details

Motivation: Existing GNN-based multi-view clustering methods suffer from coarse-grained graph fusion, where separate graph structures are fused at view level rather than at a more granular sample level.

Method: Proposes Mixture of Ego-Graphs Fusion (MoEGF) that constructs ego graphs and uses Mixture-of-Experts network for fine-grained sample-level fusion, plus Ego Graph Contrastive Learning (EGCL) to align fused and view-specific representations.

Result: Extensive experiments show MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks.

Conclusion: MoEGCL successfully addresses the coarse-grained fusion problem in multi-view clustering through fine-grained ego-graph fusion and contrastive learning, demonstrating superior performance over existing methods.

Abstract: In recent years, the advancement of Graph Neural Networks (GNNs) has significantly propelled progress in Multi-View Clustering (MVC). However, existing methods face the problem of coarse-grained graph fusion. Specifically, current approaches typically generate a separate graph structure for each view and then perform weighted fusion of graph structures at the view level, which is a relatively rough strategy. To address this limitation, we present a novel Mixture of Ego-Graphs Contrastive Representation Learning (MoEGCL). It mainly consists of two modules. In particular, we propose an innovative Mixture of Ego-Graphs Fusion (MoEGF), which constructs ego graphs and utilizes a Mixture-of-Experts network to implement fine-grained fusion of ego graphs at the sample level, rather than the conventional view-level fusion. Additionally, we present the Ego Graph Contrastive Learning (EGCL) module to align the fused representation with the view-specific representation. The EGCL module enhances the representation similarity of samples from the same cluster, not merely from the same sample, further boosting fine-grained graph representation. Extensive experiments demonstrate that MoEGCL achieves state-of-the-art results in deep multi-view clustering tasks. The source code is publicly available at https://github.com/HackerHyper/MoEGCL.

[272] TinyChemVL: Advancing Chemical Vision-Language Models via Efficient Visual Token Reduction and Complex Reaction Tasks

Xuanle Zhao, Shuxin Zeng, Xinyuan Cai, Xiang Cheng, Duzhen Zhang, Xiuyi Chen, Bo Xu

Main category: cs.CV

TL;DR: TinyChemVL is an efficient 4B-parameter chemical VLM that uses visual token reduction and reaction-level tasks to improve efficiency and reasoning, outperforming larger models while using fewer visual tokens.

Details

Motivation: Current VLMs for chemical tasks are computationally inefficient due to processing entire chemical images with non-informative backgrounds, and have narrow scope focusing only on molecular-level tasks, limiting chemical reasoning capabilities.

Method: Proposes TinyChemVL with visual token reduction to process fewer tokens, and introduces reaction-level tasks for enhanced reasoning. Also creates ChemRxn-V benchmark for vision-based reaction recognition and prediction.

Result: TinyChemVL achieves superior performance on both molecular and reaction tasks with faster inference/training speeds, outperforming ChemVLM while using only 1/16th of visual tokens.

Conclusion: This work demonstrates that efficient yet powerful chemical VLMs can be built through co-design of model architecture and task complexity, enabling better chemical reasoning with reduced computational costs.

Abstract: While Vision Language Models (VLMs) have demonstrated remarkable capabilities in general visual understanding, their application in the chemical domain has been limited, with previous works predominantly focusing on text and thus overlooking critical visual information, such as molecular structures. Current approaches that directly adopt standard VLMs for chemical tasks suffer from two primary issues: (i) computational inefficiency of processing entire chemical images with non-informative backgrounds. (ii) a narrow scope on molecular-level tasks that restricts progress in chemical reasoning. In this work, we propose \textbf{TinyChemVL}, an efficient and powerful chemical VLM that leverages visual token reduction and reaction-level tasks to improve model efficiency and reasoning capacity. Also, we propose \textbf{ChemRxn-V}, a reaction-level benchmark for assessing vision-based reaction recognition and prediction tasks. Directly predicting reaction products from molecular images poses a non-trivial challenge, as it requires models to integrate both recognition and reasoning capacities. Our results demonstrate that with only 4B parameters, TinyChemVL achieves superior performance on both molecular and reaction tasks while demonstrating faster inference and training speeds compared to existing models. Notably, TinyChemVL outperforms ChemVLM while utilizing only 1/16th of the visual tokens. This work builds efficient yet powerful VLMs for chemical domains by co-designing model architecture and task complexity.

[273] DWFF-Net : A Multi-Scale Farmland System Habitat Identification Method with Adaptive Dynamic Weight

Kesong Zheng, Zhi Song, Peizhou Li, Shuyi Yao, Zhenxing Bian

Main category: cs.CV

TL;DR: Proposed DWFF-Net with dynamic-weighted feature fusion for cultivated land habitat segmentation, achieving 69.79% mIoU and outperforming baselines by 2.1% using frozen DINOv3 encoder and adaptive multi-layer fusion.

Details

Motivation: Lack of standardized habitat classification system for cultivated land ecosystems, incomplete habitat type coverage, and existing models' inability to effectively integrate semantic and texture features, leading to insufficient segmentation accuracy and blurred boundaries for multi-scale habitats.

Method: Developed comprehensive annotated ultra-high-resolution remote sensing dataset with 15 habitat categories. Proposed DWFF-Net with frozen-parameter DINOv3 encoder, data-level adaptive dynamic weighting for feature fusion, dynamic weight computation network in decoder for multi-layer feature integration, and hybrid loss function for training optimization.

Result: Achieved 69.79% mIoU and 80.49% F1-score on constructed dataset, outperforming baseline by 2.1% and 1.61% respectively. Ablation studies confirmed complementary nature of multi-layer feature fusion, improving IoU for micro-habitat categories like field ridges.

Conclusion: Established habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes.

Abstract: Addressing the current lack of a standardized habitat classification system for cultivated land ecosystems, incomplete coverage of the habitat types, and the inability of existing models to effectively integrate semantic and texture features-resulting in insufficient segmentation accuracy and blurred boundaries for multi-scale habitats (e.g., large-scale field plots and micro-habitats)-this study developed a comprehensively annotated ultra-high-resolution remote sensing image dataset encompassing 15 categories of cultivated land system habitats. Furthermore, we propose a Dynamic-Weighted Feature Fusion Network (DWFF-Net). The encoder of this model utilizes a frozen-parameter DINOv3 to extract foundational features. By analyzing the relationships between different category images and feature maps, we introduce a data-level adaptive dynamic weighting strategy for feature fusion. The decoder incorporates a dynamic weight computation network to achieve thorough integration of multi-layer features, and a hybrid loss function is adopted to optimize model training. Experimental results on the constructed dataset demonstrate that the proposed model achieves a mean Intersection over Union (mIoU) of 69.79% and an F1-score of 80.49%, outperforming the baseline network by 2.1% and 1.61%, respectively. Ablation studies further confirm the complementary nature of multi-layer feature fusion, which effectively improves the IoU for micro-habitat categories such as field ridges. This study establishes a habitat identification framework for cultivated land systems based on adaptive multi-layer feature fusion, enabling sub-meter precision habitat mapping at a low cost and providing robust technical support for fine-grained habitat monitoring in cultivated landscapes. (The complete code repository can be accessed via GitHub at the following URL: https://github.com/sysau/DWFF-Net)

[274] Point-Supervised Facial Expression Spotting with Gaussian-Based Instance-Adaptive Intensity Modeling

Yicheng Deng, Hideaki Hayashi, Hajime Nagahara

Main category: cs.CV

TL;DR: Proposes a point-supervised facial expression spotting framework using Gaussian-based intensity modeling and two-branch architecture for macro/micro-expression detection with only single timestamp annotations.

Details

Motivation: Existing methods require costly temporal boundary annotations for facial expression spotting. This work aims to reduce annotation burden by using only point supervision (single timestamp per instance) while maintaining performance.

Method: Two-branch framework: 1) Class-agnostic expression intensity branch with Gaussian-based instance-adaptive intensity modeling (GIM) for soft pseudo-labeling, 2) Class-aware apex classification branch for macro/micro-expression distinction. Uses intensity-aware contrastive loss for feature learning.

Result: Extensive experiments on SAMM-LV, CAS(ME)^2, and CAS(ME)^3 datasets demonstrate the framework’s effectiveness in point-supervised facial expression spotting.

Conclusion: The proposed point-supervised framework successfully reduces annotation requirements while achieving competitive performance in facial expression spotting through innovative intensity modeling and two-branch architecture.

Abstract: Automatic facial expression spotting, which aims to identify facial expression instances in untrimmed videos, is crucial for facial expression analysis. Existing methods primarily focus on fully-supervised learning and rely on costly, time-consuming temporal boundary annotations. In this paper, we investigate point-supervised facial expression spotting (P-FES), where only a single timestamp annotation per instance is required for training. We propose a unique two-branch framework for P-FES. First, to mitigate the limitation of hard pseudo-labeling, which often confuses neutral and expression frames with various intensities, we propose a Gaussian-based instance-adaptive intensity modeling (GIM) module to model instance-level expression intensity distribution for soft pseudo-labeling. By detecting the pseudo-apex frame around each point label, estimating the duration, and constructing an instance-level Gaussian distribution, GIM assigns soft pseudo-labels to expression frames for more reliable intensity supervision. The GIM module is incorporated into our framework to optimize the class-agnostic expression intensity branch. Second, we design a class-aware apex classification branch that distinguishes macro- and micro-expressions solely based on their pseudo-apex frames. During inference, the two branches work independently: the class-agnostic expression intensity branch generates expression proposals, while the class-aware apex-classification branch is responsible for macro- and micro-expression classification. Furthermore, we introduce an intensity-aware contrastive loss to enhance discriminative feature learning and suppress neutral noise by contrasting neutral frames with expression frames with various intensities. Extensive experiments on the SAMM-LV, CAS(ME)$^2$, and CAS(ME)$^3$ datasets demonstrate the effectiveness of our proposed framework.

[275] Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning

Jiaqi Liu, Kaiwen Xiong, Peng Xia, Yiyang Zhou, Haonian Ji, Lu Feng, Siwei Han, Mingyu Ding, Huaxiu Yao

Main category: cs.CV

TL;DR: Agent0-VL is a self-evolving vision-language agent that uses tool-integrated reasoning for self-evaluation and self-repair, achieving continual improvement without human supervision.

Details

Motivation: To overcome limitations of human-annotated supervision and address text-based self-evaluation struggles with complex visual reasoning and evaluation hallucinations.

Method: Unifies Solver (multi-turn tool-integrated reasoning) and Verifier (tool-grounded critique with structured feedback) roles within a single LVLM through a Self-Evolving Reasoning Cycle.

Result: Achieves 12.5% improvement over base model on geometric problem solving and visual scientific analysis tasks.

Conclusion: Agent0-VL enables stable self-improvement through tool-based verification and reinforcement learning without external rewards or human annotation.

Abstract: Vision-language agents have achieved remarkable progress in a variety of multimodal reasoning tasks; however, their learning remains constrained by the limitations of human-annotated supervision. Recent self-rewarding approaches attempt to overcome this constraint by allowing models to act as their own critics or reward providers. Yet, purely text-based self-evaluation struggles to verify complex visual reasoning steps and often suffers from evaluation hallucinations. To address these challenges, inspired by recent advances in tool-integrated reasoning, we propose Agent0-VL, a self-evolving vision-language agent that achieves continual improvement with tool-integrated reasoning. Agent0-VL incorporates tool usage not only into reasoning but also into self-evaluation and self-repair, enabling the model to introspect, verify, and refine its reasoning through evidence-grounded analysis. It unifies two synergistic roles within a single LVLM: a Solver that performs multi-turn tool-integrated reasoning, and a Verifier that generates structured feedback and fine-grained self-rewards through tool-grounded critique. These roles interact through a Self-Evolving Reasoning Cycle, where tool-based verification and reinforcement learning jointly align the reasoning and evaluation distributions for stable self-improvement. Through this zero-external-reward evolution, Agent0-VL aligns its reasoning and verification behaviors without any human annotation or external reward models, achieving continual self-improvement. Experiments on geometric problem solving and visual scientific analysis show that Agent0-VL achieves an 12.5% improvement over the base model. Our code is available at https://github.com/aiming-lab/Agent0.

[276] One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution

Yushun Fang, Yuxiang Chen, Shibo Yin, Qiang Hu, Jiangchao Yao, Ya Zhang, Xiaoyun Zhang, Yanfeng Wang

Main category: cs.CV

TL;DR: ODTSR is a one-step diffusion transformer for real-world image super-resolution that balances fidelity and controllability using a noise-hybrid visual stream design and fidelity-aware adversarial training.

Details

Motivation: To address the trade-off between fidelity and controllability in diffusion-based real-world image super-resolution, where multi-step methods suffer from low fidelity due to generative diversity, while one-step methods lose control flexibility.

Method: Uses a noise-hybrid visual stream with adjustable noise for control and consistent prior noise, combined with fidelity-aware adversarial training to enable one-step inference while maintaining controllability.

Result: Achieves state-of-the-art performance on generic real-world image super-resolution and enables prompt controllability on challenging scenarios like Chinese character text super-resolution without specific dataset training.

Conclusion: ODTSR successfully balances fidelity and controllability in real-world image super-resolution through its novel noise-hybrid design and training approach, demonstrating strong performance across various applications.

Abstract: Recent advances in diffusion-based real-world image super-resolution (Real-ISR) have demonstrated remarkable perceptual quality, yet the balance between fidelity and controllability remains a problem: multi-step diffusion-based methods suffer from generative diversity and randomness, resulting in low fidelity, while one-step methods lose control flexibility due to fidelity-specific finetuning. In this paper, we present ODTSR, a one-step diffusion transformer based on Qwen-Image that performs Real-ISR considering fidelity and controllability simultaneously: a newly introduced visual stream receives low-quality images (LQ) with adjustable noise (Control Noise), and the original visual stream receives LQs with consistent noise (Prior Noise), forming the Noise-hybrid Visual Stream (NVS) design. ODTSR further employs Fidelity-aware Adversarial Training (FAA) to enhance controllability and achieve one-step inference. Extensive experiments demonstrate that ODTSR not only achieves state-of-the-art (SOTA) performance on generic Real-ISR, but also enables prompt controllability on challenging scenarios such as real-world scene text image super-resolution (STISR) of Chinese characters without training on specific datasets. Codes are available at $\href{https://github.com/RedMediaTech/ODTSR}{\text{this url}}$.

[277] EmoFeedback$^2$: Reinforcement of Continuous Emotional Image Generation via LVLM-based Reward and Textual Feedback

Jingyang Jia, Kai Shu, Gang Yang, Long Xing, Xun Chen, Aiping Liu

Main category: cs.CV

TL;DR: Proposes EmoFeedback², a reinforcement paradigm for continuous emotional image generation that uses fine-tuned LVLM to provide reward and textual feedback, improving emotional continuity and fidelity.

Details

Motivation: Existing C-EIG approaches lack emotional feedback from generated images and have simple emotion-text alignment, limiting emotional continuity control and fidelity.

Method: Uses generation-understanding-feedback reinforcement paradigm with emotion-aware reward feedback and self-promotion textual feedback framework using fine-tuned LVLM.

Result: Outperforms state-of-the-art methods on custom dataset, effectively generating high-quality images with desired emotions.

Conclusion: EmoFeedback² successfully addresses emotional continuity and fidelity issues in C-EIG through LVLM-based feedback mechanisms.

Abstract: Continuous emotional image generation (C-EICG) is emerging rapidly due to its ability to produce images aligned with both user descriptions and continuous emotional values. However, existing approaches lack emotional feedback from generated images, limiting the control of emotional continuity. Additionally, their simple alignment between emotions and naively generated texts fails to adaptively adjust emotional prompts according to image content, leading to insufficient emotional fidelity. To address these concerns, we propose a novel generation-understanding-feedback reinforcement paradigm (EmoFeedback$^2$) for C-EICG, which exploits the reasoning capability of the fine-tuned large vision-language model (LVLM) to provide reward and textual feedback for generating high-quality images with continuous emotions. Specifically, we introduce an emotion-aware reward feedback strategy, where the LVLM evaluates the emotional values of generated images and computes the reward against target emotions, guiding the reinforcement fine-tuning of the generative model and enhancing the emotional continuity of images. Furthermore, we design a self-promotion textual feedback framework, in which the LVLM iteratively analyzes the emotional content of generated images and adaptively produces refinement suggestions for the next-round prompt, improving the emotional fidelity with fine-grained content. Extensive experimental results demonstrate that our approach effectively generates high-quality images with the desired emotions, outperforming existing state-of-the-art methods in our custom dataset. The code and dataset will be released soon.

[278] Video-R4: Reinforcing Text-Rich Video Reasoning with Visual Rumination

Yolo Y. Tang, Daiki Shimada, Hang Hua, Chao Huang, Jing Bi, Rogerio Feris, Chenliang Xu

Main category: cs.CV

TL;DR: Video-R4 is a video reasoning LMM that performs visual rumination - iteratively selecting frames, zooming into regions, and updating reasoning state to better understand text-rich videos.

Details

Motivation: Current video QA models fail on fine-grained evidence because they use single-pass perception over fixed frames, leading to hallucinations. Humans pause, zoom, and re-read critical regions when understanding text-rich videos.

Method: Proposes visual rumination: iterative frame selection, region zooming, pixel re-encoding, and reasoning state updates. Uses multi-stage learning with supervised fine-tuning (SFT) and GRPO-based reinforcement learning on two datasets (Video-R4-CoT-17k and Video-R4-RL-30k).

Result: Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and generalizes well to multi-page document QA, slides QA, and generic video QA.

Conclusion: Iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning, enabling better understanding of text-rich videos through repeated inspection of critical regions.

Abstract: Understanding text-rich videos requires reading small, transient textual cues that often demand repeated inspection. Yet most video QA models rely on single-pass perception over fixed frames, leading to hallucinations and failures on fine-grained evidence. Inspired by how humans pause, zoom, and re-read critical regions, we introduce Video-R4 (Reinforcing Text-Rich Video Reasoning with Visual Rumination), a video reasoning LMM that performs visual rumination: iteratively selecting frames, zooming into informative regions, re-encoding retrieved pixels, and updating its reasoning state. We construct two datasets with executable rumination trajectories: Video-R4-CoT-17k for supervised practice and Video-R4-RL-30k for reinforcement learning. We propose a multi-stage rumination learning framework that progressively finetunes a 7B LMM to learn atomic and mixing visual operations via SFT and GRPO-based RL. Video-R4-7B achieves state-of-the-art results on M4-ViteVQA and further generalizes to multi-page document QA, slides QA, and generic video QA, demonstrating that iterative rumination is an effective paradigm for pixel-grounded multimodal reasoning. Project Page: https://yunlong10.github.io/Video-R4/

[279] Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization

Sina Mokhtarzadeh Azar, Emad Bahrami, Enrico Pallotta, Gianpiero Francesca, Radu Timofte, Juergen Gall

Main category: cs.CV

TL;DR: SAVi-DNO adapts diffusion-based video prediction models to continuous video streams by optimizing diffusion noise during inference, improving prediction quality without fine-tuning model parameters.

Details

Motivation: To improve video prediction for continuous streams by leveraging new training samples that become available over time, without the computational cost of fine-tuning large diffusion models.

Method: Refines diffusion noise during inference while keeping model parameters frozen, allowing adaptive determination of suitable sampling noise for continuous video adaptation.

Result: Improved performance on FVD, SSIM, and PSNR metrics across multiple datasets including Ego4D, OpenDV-YouTube, UCF-101, and SkyTimelapse, particularly effective for long continuous videos.

Conclusion: SAVi-DNO provides an efficient approach for continuous video stream adaptation in diffusion-based prediction models, achieving better performance through noise optimization without parameter updates.

Abstract: In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO’s effectiveness.

[280] DiffSeg30k: A Multi-Turn Diffusion Editing Benchmark for Localized AIGC Detection

Hai Ci, Ziheng Peng, Pei Yang, Yingxin Xuan, Mike Zheng Shou

Main category: cs.CV

TL;DR: DiffSeg30k is a 30k-image dataset with pixel-level annotations for detecting and localizing diffusion-based image edits, shifting AIGC detection from binary classification to semantic segmentation.

Details

Motivation: Existing AIGC detection benchmarks focus on classifying entire images but overlook localization of diffusion-based edits, which enables realistic modification of local image regions making AI-generated content harder to detect.

Method: Created DiffSeg30k dataset featuring: 1) In-the-wild images from COCO, 2) Diverse diffusion models (8 SOTA models), 3) Multi-turn editing (up to 3 sequential edits), 4) VLM-based pipeline for automatic region identification and context-aware prompts covering additions, removals, and attribute changes.

Result: Segmentation models trained on DiffSeg30k outperform established forgery classifiers in whole-image classification of diffusion edits and show strong cross-generator generalization. However, significant challenges remain in semantic segmentation tasks, particularly regarding robustness to image distortions.

Conclusion: DiffSeg30k advances fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods, enabling simultaneous localization of edits and identification of editing models.

Abstract: Diffusion-based editing enables realistic modification of local image regions, making AI-generated content harder to detect. Existing AIGC detection benchmarks focus on classifying entire images, overlooking the localization of diffusion-based edits. We introduce DiffSeg30k, a publicly available dataset of 30k diffusion-edited images with pixel-level annotations, designed to support fine-grained detection. DiffSeg30k features: 1) In-the-wild images–we collect images or image prompts from COCO to reflect real-world content diversity; 2) Diverse diffusion models–local edits using eight SOTA diffusion models; 3) Multi-turn editing–each image undergoes up to three sequential edits to mimic real-world sequential editing; and 4) Realistic editing scenarios–a vision-language model (VLM)-based pipeline automatically identifies meaningful regions and generates context-aware prompts covering additions, removals, and attribute changes. DiffSeg30k shifts AIGC detection from binary classification to semantic segmentation, enabling simultaneous localization of edits and identification of the editing models. We benchmark three baseline segmentation approaches, revealing significant challenges in semantic segmentation tasks, particularly concerning robustness to image distortions. Experiments also reveal that segmentation models, despite being trained for pixel-level localization, emerge as highly reliable whole-image classifiers of diffusion edits, outperforming established forgery classifiers while showing great potential in cross-generator generalization. We believe DiffSeg30k will advance research in fine-grained localization of AI-generated content by demonstrating the promise and limitations of segmentation-based methods. DiffSeg30k is released at: https://huggingface.co/datasets/Chaos2629/Diffseg30k

[281] ReMatch: Boosting Representation through Matching for Multimodal Retrieval

Qianying Liu, Xiao Liang, Zhiqiang Zhang, Zhongfei Qing, Fengfan Zhou, Yibo Chen, Xu Tang, Yao Hu, Paul Henderson

Main category: cs.CV

TL;DR: ReMatch is a framework that uses MLLMs for multimodal retrieval by training them end-to-end with a generative matching stage, achieving state-of-the-art results on MMEB with strong zero-shot generalization.

Details

Motivation: Previous approaches underutilized MLLMs' generative nature, compositional reasoning, and world knowledge by treating them as simple encoders.

Method: End-to-end training of embedding MLLM with chat-style generative matching stage using multi-view inputs; multiple learnable tokens for richer embeddings; instance-wise discrimination supervision complementing contrastive loss.

Result: Achieved new state-of-the-art on Massive Multimodal Embedding Benchmark (MMEB) with particularly strong zero-shot generalization on five datasets.

Conclusion: ReMatch demonstrates robustness and transferability by effectively leveraging MLLMs’ generative capabilities for multimodal retrieval.

Abstract: We present ReMatch, a framework that leverages the generative strength of MLLMs for multimodal retrieval. Previous approaches treated an MLLM as a simple encoder, ignoring its generative nature, and under-utilising its compositional reasoning and world knowledge. We instead train the embedding MLLM end-to-end with a chat-style generative matching stage. The matching stage uses the same MLLM to autoregressively decide relevance from multi-view inputs, including both raw data and its own projected embeddings for each query and document. It provides instance-wise discrimination supervision that complements a standard contrastive loss, offering stronger gradients on hard negatives and preserving the compositional strengths of the original MLLM. To obtain semantically richer multimodal embeddings, we use multiple learnable tokens to augment each input, generating fine-grained contextual, mutually orthogonal embeddings with low inference cost. Leveraging our established high-performance baseline,we assemble the ideas mentioned above into a powerful training recipe and achieve a new state-of-the-art on the Massive Multimodal Embedding Benchmark (MMEB). Our experiments show particularly strong zero-shot generalization results on five datasets, highlighting the robustness and transferability of ReMatch.

[282] Connecting the Dots: Training-Free Visual Grounding via Agentic Reasoning

Liqin Luo, Guangyao Chen, Xiawu Zheng, Yongxing Dai, Yixiong Zou, Yonghong Tian

Main category: cs.CV

TL;DR: GroundingAgent is a zero-shot visual grounding framework that uses iterative reasoning with pretrained models to link text queries to image regions without task-specific fine-tuning, achieving 65.1% accuracy on benchmarks.

Details

Motivation: Existing visual grounding methods require extensive task-specific annotations and fine-tuning, limiting their generalization to novel scenarios. The authors aim to create a framework that can perform visual grounding without fine-tuning.

Method: Uses iterative reasoning mechanism combining pretrained open-vocabulary object detectors, multimodal LLMs, and LLMs to progressively refine candidate regions through joint semantic and spatial analyses.

Result: Achieves 65.1% zero-shot grounding accuracy on RefCOCO benchmarks without fine-tuning. With MLLM-generated captions replaced by original queries, selection accuracy reaches ~90%, matching supervised performance.

Conclusion: GroundingAgent demonstrates strong zero-shot visual grounding capabilities and interpretability, highlighting the importance of LLM reasoning in bridging the gap between zero-shot and supervised methods.

Abstract: Visual grounding, the task of linking textual queries to specific regions within images, plays a pivotal role in vision-language integration. Existing methods typically rely on extensive task-specific annotations and fine-tuning, limiting their ability to generalize effectively to novel or out-of-distribution scenarios. To address these limitations, we introduce GroundingAgent, a novel agentic visual grounding framework that operates without any task-specific fine-tuning. GroundingAgent employs a structured, iterative reasoning mechanism that integrates pretrained open-vocabulary object detectors, multimodal large language models (MLLMs), and large language models (LLMs) to progressively refine candidate regions through joint semantic and spatial analyses. Remarkably, GroundingAgent achieves an average zero-shot grounding accuracy of 65.1 % on widely-used benchmarks (RefCOCO, RefCOCO+, RefCOCOg), entirely without fine-tuning. Furthermore, by substituting MLLM-generated captions with the original query texts, the accuracy at the selection stage alone reaches approximately 90 %, closely matching supervised performance and underscoring the critical role of LLM reasoning capabilities. GroundingAgent also offers strong interpretability, transparently illustrating each reasoning step and providing clear insights into its decision-making process.

[283] Vision-Language Enhanced Foundation Model for Semi-supervised Medical Image Segmentation

Jiaqi Guo, Mingzhen Li, Hanyu Su, Santiago López, Lexiaozi Fan, Daniel Kim, Aggelos Katsaggelos

Main category: cs.CV

TL;DR: VESSA integrates vision-language models into semi-supervised medical image segmentation, using a two-stage approach with visual feature matching and dynamic interaction between VLM and student model to improve accuracy with limited annotations.

Details

Motivation: To reduce reliance on extensive expert annotations in medical image segmentation by combining the generalization capabilities of vision-language models with semi-supervised learning frameworks.

Method: Two-stage approach: Stage 1 trains VESSA as reference-guided segmentation assistant using template bank; Stage 2 integrates VESSA into SSL framework for dynamic interaction between VLM and student model, where refined student predictions feed back to VESSA to generate better pseudo-labels.

Result: Significantly enhances segmentation accuracy across multiple datasets and domains, outperforming state-of-the-art baselines under extremely limited annotation conditions.

Conclusion: VESSA successfully incorporates foundation-level visual-semantic understanding into SSL frameworks, demonstrating that VLM-enhanced semi-supervised learning can effectively improve medical image segmentation with minimal labeled data.

Abstract: Semi-supervised learning (SSL) has emerged as an effective paradigm for medical image segmentation, reducing the reliance on extensive expert annotations. Meanwhile, vision-language models (VLMs) have demonstrated strong generalization and few-shot capabilities across diverse visual domains. In this work, we integrate VLM-based segmentation into semi-supervised medical image segmentation by introducing a Vision-Language Enhanced Semi-supervised Segmentation Assistant (VESSA) that incorporates foundation-level visual-semantic understanding into SSL frameworks. Our approach consists of two stages. In Stage 1, the VLM-enhanced segmentation foundation model VESSA is trained as a reference-guided segmentation assistant using a template bank containing gold-standard exemplars, simulating learning from limited labeled data. Given an input-template pair, VESSA performs visual feature matching to extract representative semantic and spatial cues from exemplar segmentations, generating structured prompts for a SAM2-inspired mask decoder to produce segmentation masks. In Stage 2, VESSA is integrated into a state-of-the-art SSL framework, enabling dynamic interaction with the student model: as student predictions become more refined, they are fed back to VESSA as prompts, allowing it to generate higher-quality pseudo-labels and stronger guidance. Extensive experiments across multiple segmentation datasets and domains show that VESSA-augmented SSL significantly enhances segmentation accuracy, outperforming state-of-the-art baselines under extremely limited annotation conditions.

[284] Scale Where It Matters: Training-Free Localized Scaling for Diffusion Models

Qin Ren, Yufei Wang, Lanqing Guo, Wen Zhang, Zhiwen Fan, Chenyu You

Main category: cs.CV

TL;DR: LoTTS introduces localized test-time scaling for diffusion models, adaptively resampling only defective image regions to improve quality while reducing computation by 2-4x compared to full-image methods.

Details

Motivation: Existing test-time scaling methods operate at full-image level, wasting computation on satisfactory regions while insufficiently correcting localized defects, leading to inefficient resource usage.

Method: LoTTS uses cross- and self-attention contrast under quality-aware prompts to identify defective regions, then refines them into coherent masks and performs localized denoising only on defective areas while preserving good regions.

Result: LoTTS achieves state-of-the-art performance on SD2.1, SDXL, and FLUX models, consistently improving both local quality and global fidelity while reducing GPU cost by 2-4x compared to Best-of-N sampling.

Conclusion: Localized test-time scaling represents a promising new direction for efficient inference-time scaling of diffusion models, enabling targeted quality improvements with significantly reduced computational overhead.

Abstract: Diffusion models have become the dominant paradigm in text-to-image generation, and test-time scaling (TTS) further improves quality by allocating more computation during inference. However, existing TTS methods operate at the full-image level, overlooking the fact that image quality is often spatially heterogeneous. This leads to unnecessary computation on already satisfactory regions and insufficient correction of localized defects. In this paper, we explore a new direction - Localized TTS - that adaptively resamples defective regions while preserving high-quality regions, thereby substantially reducing the search space. This paradigm poses two central challenges: accurately localizing defects and maintaining global consistency. We propose LoTTS, the first fully training-free framework for localized TTS. For defect localization, LoTTS contrasts cross- and self-attention signals under quality-aware prompts (e.g., high-quality vs. low-quality) to identify defective regions, and then refines them into coherent masks. For consistency, LoTTS perturbs only defective regions and denoises them locally, ensuring that corrections remain confined while the rest of the image remains undisturbed. Extensive experiments on SD2.1, SDXL, and FLUX demonstrate that LoTTS achieves state-of-the-art performance: it consistently improves both local quality and global fidelity, while reducing GPU cost by 2-4x compared to Best-of-N sampling. These findings establish localized TTS as a promising new direction for scaling diffusion models at inference time.

[285] GFT-GCN: Privacy-Preserving 3D Face Mesh Recognition with Spectral Diffusion

Hichem Felouat, Hanrui Wang, Isao Echizen

Main category: cs.CV

TL;DR: GFT-GCN is a privacy-preserving 3D face recognition framework that uses spectral graph learning and diffusion-based template protection to secure biometric data while maintaining high recognition accuracy.

Details

Motivation: 3D face recognition provides robustness against illumination, pose changes, and spoofing attacks, but protecting stored biometric templates is crucial for high-security applications.

Method: Combines Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract spectral features from 3D face meshes, with spectral diffusion mechanism for template protection in a client-server architecture.

Result: Experiments on BU-3DFE and FaceScape datasets show high recognition accuracy and strong resistance to reconstruction attacks.

Conclusion: GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication.

Abstract: 3D face recognition offers a robust biometric solution by capturing facial geometry, providing resilience to variations in illumination, pose changes, and presentation attacks. Its strong spoof resistance makes it suitable for high-security applications, but protecting stored biometric templates remains critical. We present GFT-GCN, a privacy-preserving 3D face recognition framework that combines spectral graph learning with diffusion-based template protection. Our approach integrates the Graph Fourier Transform (GFT) and Graph Convolutional Networks (GCN) to extract compact, discriminative spectral features from 3D face meshes. To secure these features, we introduce a spectral diffusion mechanism that produces irreversible, renewable, and unlinkable templates. A lightweight client-server architecture ensures that raw biometric data never leaves the client device. Experiments on the BU-3DFE and FaceScape datasets demonstrate high recognition accuracy and strong resistance to reconstruction attacks. Results show that GFT-GCN effectively balances privacy and performance, offering a practical solution for secure 3D face authentication.

[286] Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Arnela Hadzic, Franz Thaler, Lea Bogensperger, Simon Johannes Joham, Martin Urschler

Main category: cs.CV

TL;DR: Restora-Flow is a training-free flow matching method for image restoration that uses degradation masks and trajectory correction to achieve faster, higher-quality results than diffusion models.

Details

Motivation: Flow matching offers faster sampling than diffusion models but current methods still have long processing times or produce over-smoothed results in restoration tasks.

Method: Uses degradation mask guidance and trajectory correction mechanism to enforce consistency with degraded inputs during flow matching sampling, without requiring additional training.

Result: Shows superior perceptual quality and faster processing time compared to diffusion and flow matching-based methods on natural and medical datasets for inpainting, super-resolution and denoising.

Conclusion: Restora-Flow effectively addresses speed and quality limitations in flow-based image restoration through mask guidance and trajectory correction, making it suitable for practical applications.

Abstract: Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.

[287] SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery

Da Li, Jiping Jin, Xuanlong Yu, Wei Liu, Xiaodong Cun, Kai Chen, Rui Fan, Jiangang Kong, Xi Shen

Main category: cs.CV

TL;DR: SKEL-CF is a coarse-to-fine framework that improves SKEL parameter estimation for anatomically accurate 3D human modeling, addressing challenges like limited training data and perspective ambiguities through transformer-based refinement and explicit camera modeling.

Details

Motivation: Parametric 3D human models like SMPL lack biomechanical realism due to simplified kinematics, while the more anatomically accurate SKEL model faces estimation challenges from limited data, perspective ambiguities, and complex human articulation.

Method: Proposes SKEL-CF with transformer-based encoder-decoder architecture: encoder predicts coarse camera/SKEL parameters, decoder progressively refines them. Creates 4DHuman-SKEL dataset from SMPL data for training, and explicitly incorporates camera modeling to address depth/scale ambiguities.

Result: Achieves 85.0 MPJPE / 51.4 PA-MPJPE on MOYO dataset, significantly outperforming previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). Demonstrates effectiveness across diverse viewpoints.

Conclusion: SKEL-CF establishes a scalable and anatomically faithful framework for human motion analysis, bridging computer vision and biomechanics, with implementation available for public use.

Abstract: Parametric 3D human models such as SMPL have driven significant advances in human pose and shape estimation, yet their simplified kinematics limit biomechanical realism. The recently proposed SKEL model addresses this limitation by re-rigging SMPL with an anatomically accurate skeleton. However, estimating SKEL parameters directly remains challenging due to limited training data, perspective ambiguities, and the inherent complexity of human articulation. We introduce SKEL-CF, a coarse-to-fine framework for SKEL parameter estimation. SKEL-CF employs a transformer-based encoder-decoder architecture, where the encoder predicts coarse camera and SKEL parameters, and the decoder progressively refines them in successive layers. To ensure anatomically consistent supervision, we convert the existing SMPL-based dataset 4DHuman into a SKEL-aligned version, 4DHuman-SKEL, providing high-quality training data for SKEL estimation. In addition, to mitigate depth and scale ambiguities, we explicitly incorporate camera modeling into the SKEL-CF pipeline and demonstrate its importance across diverse viewpoints. Extensive experiments validate the effectiveness of the proposed design. On the challenging MOYO dataset, SKEL-CF achieves 85.0 MPJPE / 51.4 PA-MPJPE, significantly outperforming the previous SKEL-based state-of-the-art HSMR (104.5 / 79.6). These results establish SKEL-CF as a scalable and anatomically faithful framework for human motion analysis, bridging the gap between computer vision and biomechanics. Our implementation is available on the project page: https://pokerman8.github.io/SKEL-CF/.

[288] CrossEarth-Gate: Fisher-Guided Adaptive Tuning Engine for Efficient Adaptation of Cross-Domain Remote Sensing Semantic Segmentation

Shilei Cao, Ziyang Gong, Hehai Lin, Yang Liu, Jiashun Cheng, Xiaoxing Hu, Haoyuan Liang, Guowen Li, Chengwei Qin, Hong Cheng, Xue Yang, Juepeng Zheng, Haohuan Fu

Main category: cs.CV

TL;DR: CrossEarth-Gate is a Parameter-Efficient Fine-Tuning method for remote sensing that uses a Fisher-guided adaptive selection mechanism to dynamically activate spatial, semantic, and frequency modules to handle multifaceted domain gaps in Earth observation data.

Details

Motivation: Existing PEFT methods fail on large-scale Earth observation tasks because they cannot fully handle the multifaceted and unpredictable domain gaps (spatial, semantic, and frequency shifts) inherent in remote sensing data.

Method: Establishes a comprehensive RS module toolbox with spatial, semantic, and frequency modules, and develops a Fisher-guided adaptive selection mechanism that quantifies each module’s importance using Fisher Information to dynamically activate only the most critical modules at appropriate layers.

Result: Achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation, demonstrating efficacy and generalizability.

Conclusion: CrossEarth-Gate effectively addresses multifaceted domain gaps in remote sensing through adaptive module selection, providing superior adaptation effectiveness and efficiency for large-scale Earth observation tasks.

Abstract: In Remote Sensing (RS), Parameter-Efficient Fine-Tuning (PEFT) has emerged as a key approach to activate the generalizable representation ability of foundation models for downstream tasks. However, existing specialized PEFT methods often fail when applied to large-scale Earth observation tasks, as they are unable to fully handle the multifaceted and unpredictable domain gaps (\eg, spatial, semantic, and frequency shifts) inherent in RS data. To overcome this, we propose CrossEarth-Gate, which introduces two primary contributions. First, we establish a comprehensive RS module toolbox to address multifaceted domain gaps, comprising spatial, semantic, and frequency modules. Second, we develop a Fisher-guided adaptive selection mechanism that operates on this toolbox. This selection is guided by Fisher Information to quantify each module’s importance by measuring its contribution to the task-specific gradient flow. It dynamically activates only the most critical modules at the appropriate layers, guiding the gradient flow to maximize adaptation effectiveness and efficiency. Comprehensive experiments validate the efficacy and generalizability of our method, where CrossEarth-Gate achieves state-of-the-art performance across 16 cross-domain benchmarks for RS semantic segmentation. The code of the work will be released.

[289] Thinking in 360°: Humanoid Visual Search in the Wild

Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li

Main category: cs.CV

TL;DR: Proposes humanoid visual search using 360° panoramic images to develop embodied agents that mimic human head-eye coordination, with a new benchmark H* Bench for challenging real-world scenarios.

Details

Motivation: Prior visual search approaches are limited to static images and neglect physical embodiment and 3D world interaction. Need to develop efficient embodied visual search agents that bypass real-world hardware constraints.

Method: Humanoid agents actively rotate their head to search for objects/paths in 360° panoramic images. Uses post-training techniques to enhance open-source Qwen2.5-VL model.

Result: Top proprietary models achieve only ~30% success. Enhanced Qwen2.5-VL increased success rate over threefold: object search from 14.83% to 47.38%, path search from 6.44% to 24.94%. Path search is inherently more difficult.

Conclusion: Shows promising path forward but quantifies immense challenge in building MLLM agents for seamless integration into everyday human life, with path search revealing need for sophisticated spatial commonsense.

Abstract: Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

[290] VGGTFace: Topologically Consistent Facial Geometry Reconstruction in the Wild

Xin Ming, Yuxuan Han, Tianyu Huang, Feng Xu

Main category: cs.CV

TL;DR: VGGTFace is an automatic method that uses the 3D foundation model VGGT to reconstruct topologically consistent facial geometry from in-the-wild multi-view images, achieving high-quality results in 10 seconds.

Details

Motivation: Existing facial reconstruction methods require manual effort, lack generalization to real-world data, or are limited by 3D Morphable Models' expressiveness.

Method: Augments VGGT with Pixel3DMM to inject topology information via pixel-aligned UV values, then uses Topology-Aware Bundle Adjustment with Laplacian energy to fuse point clouds with known topology.

Result: Achieves state-of-the-art results on benchmarks with impressive generalization to in-the-wild data, processing 16 views in 10 seconds on a single RTX 4090.

Conclusion: VGGTFace successfully leverages 3D foundation models for automatic, high-quality facial reconstruction with strong generalization capabilities.

Abstract: Reconstructing topologically consistent facial geometry is crucial for the digital avatar creation pipelines. Existing methods either require tedious manual efforts, lack generalization to in-the-wild data, or are constrained by the limited expressiveness of 3D Morphable Models. To address these limitations, we propose VGGTFace, an automatic approach that innovatively applies the 3D foundation model, i.e. VGGT, for topologically consistent facial geometry reconstruction from in-the-wild multi-view images captured by everyday users. Our key insight is that, by leveraging VGGT, our method naturally inherits strong generalization ability and expressive power from its large-scale training and point map representation. However, it is unclear how to reconstruct a topologically consistent mesh from VGGT, as the topology information is missing in its prediction. To this end, we augment VGGT with Pixel3DMM for injecting topology information via pixel-aligned UV values. In this manner, we convert the pixel-aligned point map of VGGT to a point cloud with topology. Tailored to this point cloud with known topology, we propose a novel Topology-Aware Bundle Adjustment strategy to fuse them, where we construct a Laplacian energy for the Bundle Adjustment objective. Our method achieves high-quality reconstruction in 10 seconds for 16 views on a single NVIDIA RTX 4090. Experiments demonstrate state-of-the-art results on benchmarks and impressive generalization to in-the-wild data. Code is available at https://github.com/grignarder/vggtface.

[291] BRIC: Bridging Kinematic Plans and Physical Control at Test Time

Dohun Lim, Minji Kim, Jaewoon Lim, Sungchan Kim

Main category: cs.CV

TL;DR: BRIC is a test-time adaptation framework that resolves execution discrepancies between diffusion-based motion planners and physics controllers for long-term human motion generation, ensuring physically plausible executions.

Details

Motivation: Diffusion models can generate diverse motions but often produce physically implausible outputs, leading to execution drift during simulation when combined with physics controllers.

Method: BRIC dynamically adapts the physics controller to noisy motion plans at test time while preserving pre-trained skills, and introduces lightweight test-time guidance to steer the diffusion model without updating its parameters.

Result: BRIC achieves state-of-the-art performance on long-term tasks including motion composition, obstacle avoidance, and human-scene interaction across diverse environments.

Conclusion: BRIC effectively combines adaptation strategies to ensure consistent and physically plausible long-term human motion generation in an efficient manner.

Abstract: We propose BRIC, a novel test-time adaptation (TTA) framework that enables long-term human motion generation by resolving execution discrepancies between diffusion-based kinematic motion planners and reinforcement learning-based physics controllers. While diffusion models can generate diverse and expressive motions conditioned on text and scene context, they often produce physically implausible outputs, leading to execution drift during simulation. To address this, BRIC dynamically adapts the physics controller to noisy motion plans at test time, while preserving pre-trained skills via a loss function that mitigates catastrophic forgetting. In addition, BRIC introduces a lightweight test-time guidance mechanism that steers the diffusion model in the signal space without updating its parameters. By combining both adaptation strategies, BRIC ensures consistent and physically plausible long-term executions across diverse environments in an effective and efficient manner. We validate the effectiveness of BRIC on a variety of long-term tasks, including motion composition, obstacle avoidance, and human-scene interaction, achieving state-of-the-art performance across all tasks.

[292] STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows

Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai

Main category: cs.CV

TL;DR: STARFlow-V is a normalizing flow-based video generator that achieves high-quality autoregressive video generation with practical sampling efficiency, competing with diffusion-based models while offering benefits like end-to-end learning and native likelihood estimation.

Details

Motivation: To challenge the dominance of diffusion-based models in video generation by revisiting normalizing flows, which offer advantages like end-to-end learning, robust causal prediction, and native likelihood estimation but have been largely overlooked in the video domain due to spatiotemporal complexity.

Method: Builds upon STARFlow with a spatiotemporal latent space using global-local architecture that restricts causal dependencies to global latent space while preserving local within-frame interactions. Introduces flow-score matching for causal denoising and video-aware Jacobi iteration for efficient parallelizable sampling without breaking causality.

Result: Achieves strong visual fidelity and temporal consistency with practical sampling throughput comparable to diffusion-based baselines. Supports multiple generation tasks including text-to-video, image-to-video, and video-to-video generation natively through invertible structure.

Conclusion: Presents the first evidence that normalizing flows are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models and challenging the current dominance of diffusion-based approaches in video generation.

Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.

cs.AI

[293] Minimizing Hyperbolic Embedding Distortion with LLM-Guided Hierarchy Restructuring

Melika Ayoughi, Pascal Mettes, Paul Groth

Main category: cs.AI

TL;DR: LLMs can automatically restructure hierarchies to improve hyperbolic embedding quality by increasing branching factor and enforcing single inheritance, leading to better embedding performance across diverse datasets.

Details

Motivation: Hyperbolic embeddings work best with hierarchies having high branching factor and single inheritance, but real-world hierarchies often don't meet these criteria. The paper aims to use LLMs to automatically restructure existing hierarchies to optimize them for hyperbolic embeddings.

Method: Proposed a prompt-based approach using Large Language Models to transform existing hierarchies, guided by known desiderata for hyperbolic embeddings (high branching factor, single inheritance). Tested on 16 diverse hierarchies.

Result: LLM-restructured hierarchies consistently yielded higher-quality hyperbolic embeddings across several standard embedding quality metrics. The approach also enables explainable reorganizations with justifications.

Conclusion: LLMs can effectively restructure hierarchies to meet hyperbolic embedding desiderata, improving embedding quality while providing explainable transformations that assist knowledge engineers.

Abstract: Hyperbolic geometry is an effective geometry for embedding hierarchical data structures. Hyperbolic learning has therefore become increasingly prominent in machine learning applications where data is hierarchically organized or governed by hierarchical semantics, ranging from recommendation systems to computer vision. The quality of hyperbolic embeddings is tightly coupled to the structure of the input hierarchy, which is often derived from knowledge graphs or ontologies. Recent work has uncovered that for an optimal hyperbolic embedding, a high branching factor and single inheritance are key, while embedding algorithms are robust to imbalance and hierarchy size. To assist knowledge engineers in reorganizing hierarchical knowledge, this paper investigates whether Large Language Models (LLMs) have the ability to automatically restructure hierarchies to meet these criteria. We propose a prompt-based approach to transform existing hierarchies using LLMs, guided by known desiderata for hyperbolic embeddings. Experiments on 16 diverse hierarchies show that LLM-restructured hierarchies consistently yield higher-quality hyperbolic embeddings across several standard embedding quality metrics. Moreover, we show how LLM-guided hierarchy restructuring enables explainable reorganizations, providing justifications to knowledge engineers.

[294] AssurAI: Experience with Constructing Korean Socio-cultural Datasets to Discover Potential Risks of Generative AI

Chae-Gyun Lim, Seung-Ho Han, EunYoung Byun, Jeongyun Han, Soohyun Cho, Eojin Joo, Heehyeon Kim, Sieun Kim, Juhoon Lee, Hyunsoo Lee, Dongkun Lee, Jonghwan Hyeon, Yechan Hwang, Young-Jun Lee, Kyeongryul Lee, Minhyeong An, Hyunjun Ahn, Jeongwoo Son, Junho Park, Donggyu Yoon, Taehyung Kim, Jeemin Kim, Dasom Choi, Kwangyoung Lee, Hyunseung Lim, Yeohyun Jung, Jongok Hong, Sooyohn Nam, Joonyoung Park, Sungmin Na, Yubin Choi, Jeanne Choi, Yoojin Hong, Sueun Jang, Youngseok Seo, Somin Park, Seoungung Jo, Wonhye Chae, Yeeun Jo, Eunyoung Kim, Joyce Jiyoung Whang, HwaJung Hong, Joseph Seering, Uichin Lee, Juho Kim, Sunna Choi, Seokyeon Ko, Taeho Kim, Kyunghoon Kim, Myungsik Ha, So Jung Lee, Jemin Hwang, JoonHo Kwak, Ho-Jin Choi

Main category: cs.AI

TL;DR: AssurAI is a Korean multimodal safety evaluation dataset addressing gaps in non-English AI safety testing, featuring 35 risk factors and 11,480 instances across text, image, video, and audio with rigorous quality control.

Details

Motivation: Current safety datasets are English-centric and text-only, failing to capture Korean socio-cultural risks and multimodal safety concerns in generative AI.

Method: Defined 35 AI risk factors through expert adaptation, built multimodal Korean dataset using two-phase construction (expert seeding + crowdsourcing), triple annotation, and iterative red-teaming for quality control.

Result: Created AssurAI with 11,480 Korean multimodal instances validated through pilot study showing effectiveness in assessing LLM safety.

Conclusion: AssurAI enables safer Korean generative AI development and is publicly released to support the Korean community.

Abstract: The rapid evolution of generative AI necessitates robust safety evaluations. However, current safety datasets are predominantly English-centric, failing to capture specific risks in non-English, socio-cultural contexts such as Korean, and are often limited to the text modality. To address this gap, we introduce AssurAI, a new quality-controlled Korean multimodal dataset for evaluating the safety of generative AI. First, we define a taxonomy of 35 distinct AI risk factors, adapted from established frameworks by a multidisciplinary expert group to cover both universal harms and relevance to the Korean socio-cultural context. Second, leveraging this taxonomy, we construct and release AssurAI, a large-scale Korean multimodal dataset comprising 11,480 instances across text, image, video, and audio. Third, we apply the rigorous quality control process used to ensure data integrity, featuring a two-phase construction (i.e., expert-led seeding and crowdsourced scaling), triple independent annotation, and an iterative expert red-teaming loop. Our pilot study validates AssurAI’s effectiveness in assessing the safety of recent LLMs. We release AssurAI to the public to facilitate the development of safer and more reliable generative AI systems for the Korean community.

[295] $A^2Flow:$ Automating Agentic Workflow Generation via Self-Adaptive Abstraction Operators

Mingming Zhao, Xiaokang Wei, Yuanqi Shao, Kaiwen Zhou, Lin Yang, Siwei Rao, Junhui Zhan, Zhitang Chen

Main category: cs.AI

TL;DR: A²Flow is a fully automated framework for generating agentic workflows using self-adaptive abstraction operators, eliminating the need for manual operator predefinition and achieving significant performance improvements over state-of-the-art methods.

Details

Motivation: Existing LLM-based agentic workflow design methods heavily rely on manually predefined operators, which limits generalization and scalability. There is a need for fully automated approaches that can generate reusable operators without human intervention.

Method: Three-stage operator extraction: 1) Case-based initial operator generation using expert demonstrations and LLM reasoning, 2) Operator clustering and preliminary abstraction by grouping similar operators across tasks, 3) Deep extraction for abstract execution operators using long chain-of-thought prompting and multi-path reasoning. Enhanced with operator memory mechanism for workflow search.

Result: Achieves 2.4% and 19.3% average performance improvement on general and embodied benchmarks respectively, while reducing resource usage by 37% compared to state-of-the-art baselines.

Conclusion: A²Flow demonstrates that fully automated agentic workflow generation with self-adaptive abstraction operators is feasible and effective, providing reusable building blocks without manual predefinition while significantly improving performance and efficiency.

Abstract: Large language models (LLMs) have shown strong potential in automating the design of agentic workflows. However, existing methods still rely heavily on manually predefined operators, limiting generalization and scalability. To address this issue, we propose $A^2Flow$, a fully automated framework for agentic workflow generation based on self-adaptive abstraction operators. $A^2Flow$ employs a three-stage operator extraction process: 1) Case-based Initial Operator Generation: leveraging expert demonstrations and LLM reasoning to generate case-specific operators; 2) Operator Clustering and Preliminary Abstraction: grouping similar operators across tasks to form preliminary abstractions; and 3) Deep Extraction for Abstract Execution Operators: applying long chain-of-thought prompting and multi-path reasoning to derive compact and generalizable execution operators. These operators serve as reusable building blocks for workflow construction without manual predefinition. Furthermore, we enhance node-level workflow search with an operator memory mechanism, which retains historical outputs to enrich context and improve decision-making. Experiments on general and embodied benchmarks show that $A^2Flow$ achieves a 2.4% and 19.3% average performance improvement and reduces resource usage by 37% over state-of-the-art baselines. Homepage:https://github.com/pandawei-ele/A2FLOW

[296] Reasoning With a Star: A Heliophysics Dataset and Benchmark for Agentic Scientific Reasoning

Kevin Lee, Russell Spiewak, James Walsh

Main category: cs.AI

TL;DR: The paper introduces Reasoning With a Star, a heliophysics reasoning dataset, and benchmarks various approaches including multi-agent patterns, finding that decomposed workflows outperform direct prompting for deductive reasoning tasks.

Details

Motivation: To address the challenges in scientific reasoning through LLMs in heliophysics, which requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats.

Method: Created a heliophysics dataset from NASA/UCAR Living With a Star summer school problem sets with question-answer structure, then benchmarked single-shot baseline and four multi-agent patterns using programmatic grading with unit-aware tolerance and symbolic equivalence.

Result: Decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.

Conclusion: Multi-agent patterns and structured workflows are more effective for complex scientific reasoning tasks in heliophysics compared to simple prompting approaches.

Abstract: Scientific reasoning through Large Language Models in heliophysics involves more than just recalling facts: it requires incorporating physical assumptions, maintaining consistent units, and providing clear scientific formats through coordinated approaches. To address these challenges, we present Reasoning With a Star, a newly contributed heliophysics dataset applicable to reasoning; we also provide an initial benchmarking approach. Our data are constructed from National Aeronautics and Space Administration & University Corporation for Atmospheric Research Living With a Star summer school problem sets and compiled into a readily consumable question-and-answer structure with question contexts, reasoning steps, expected answer type, ground-truth targets, format hints, and metadata. A programmatic grader checks the predictions using unit-aware numerical tolerance, symbolic equivalence, and schema validation. We benchmark a single-shot baseline and four multi-agent patterns, finding that decomposing workflows through systems engineering principles outperforms direct prompting on problems requiring deductive reasoning rather than pure inductive recall.

[297] A Brief History of Digital Twin Technology

Yunqi Zhang, Kuangyu Shi, Biao Li

Main category: cs.AI

TL;DR: Digital twin technology, originating from NASA simulations, creates dynamic virtual counterparts of physical systems using real-time data. In healthcare, it integrates imaging, biosensors, and computational models for patient-specific simulations in diagnosis, treatment planning, and drug development.

Details

Motivation: To transform healthcare from reactive treatment to predictive, preventive, and personalized medicine by leveraging digital twin technology for improved patient care and medical outcomes.

Method: Digital twins integrate imaging, biosensors, and computational models to create patient-specific simulations. Applications include cardiac digital twins for arrhythmia treatment prediction, oncology digital twins for tumor tracking and radiotherapy optimization, and pharmacological digital twins for drug discovery.

Result: Digital twins enable improved diagnosis, treatment planning, and drug development through patient-specific simulations. They help predict arrhythmia treatment outcomes, track tumor progression, optimize radiotherapy, and accelerate drug discovery.

Conclusion: While digital twin technology shows great promise in healthcare transformation, challenges like interoperability, data privacy, and model fidelity need addressing through solutions like explainable AI, federated learning, and harmonized regulatory frameworks. Future advances in multi-organ digital twins, genomics integration, and ethical governance will be essential for widespread clinical adoption.

Abstract: Emerging from NASA’s spacecraft simulations in the 1960s, digital twin technology has advanced through industrial adoption to spark a healthcare transformation. A digital twin is a dynamic, data-driven virtual counterpart of a physical system, continuously updated through real-time data streams and capable of bidirectional interaction. In medicine, digital twin integrates imaging, biosensors, and computational models to generate patient-specific simulations that support diagnosis, treatment planning, and drug development. Representative applications include cardiac digital twin for predicting arrhythmia treatment outcomes, oncology digital twin for tracking tumor progression and optimizing radiotherapy, and pharmacological digital twin for accelerating drug discovery. Despite rapid progress, major challenges, including interoperability, data privacy, and model fidelity, continue to limit widespread clinical integration. Emerging solutions such as explainable AI, federated learning, and harmonized regulatory frameworks offer promising pathways forward. Looking ahead, advances in multi-organ digital twin, genomics integration, and ethical governance will be essential to ensure that digital twin shifts healthcare from reactive treatment to predictive, preventive, and truly personalized medicine.

[298] Paraconsistent-Lib: an intuitive PAL2v algorithm Python Library

Arnaldo de Carvalho Junior, Diego Oliveira da Cruz, Bruno da Silva Alves, Fernando da Silva Paulo Junior, João Inacio da Silva Filho

Main category: cs.AI

TL;DR: Paraconsistent-Lib is an open-source Python library for building PAL2v algorithms in reasoning and decision-making systems, providing three types of analysis outputs and enabling implementation of various PAL2v algorithms with reduced complexity.

Details

Motivation: To create an easy-to-use, general-purpose library for PAL2v standard calculations that simplifies the implementation of paraconsistent algorithms and reduces code complexity and bugs.

Method: Developed as an open-source Python library that provides standard PAL2v calculations, supporting three result types: paraconsistent analysis in 12 classical lattice regions, paraconsistent analysis node outputs, and decision outputs.

Result: Successfully created Paraconsistent-Lib that enables implementation of well-known PAL2v algorithms like Para-analyzer, ParaExtrCTX, PAL2v Filter, PANnet, and PNN in stand-alone or network form, with reduced complexity and code size.

Conclusion: Paraconsistent-Lib is a stable, actively developed library that responds to user requirements and enhancements, providing a practical tool for building PAL2v-based reasoning and decision-making systems.

Abstract: This paper introduces Paraconsistent-Lib, an open-source, easy-to-use Python library for building PAL2v algorithms in reasoning and decision-making systems. Paraconsistent-Lib is designed as a general-purpose library of PAL2v standard calculations, presenting three types of results: paraconsistent analysis in one of the 12 classical lattice PAL2v regions, paraconsistent analysis node (PAN) outputs, and a decision output. With Paraconsistent-Lib, well-known PAL2v algorithms such as Para-analyzer, ParaExtrCTX, PAL2v Filter, paraconsistent analysis network (PANnet), and paraconsistent neural network (PNN) can be written in stand-alone or network form, reducing complexity, code size, and bugs, as two examples presented in this paper. Given its stable state, Paraconsistent-Lib is an active development to respond to user-required features and enhancements received on GitHub.

[299] Prune4Web: DOM Tree Pruning Programming for Web Agent

Jiayuan Zhang, Kaiquan Chen, Zhihao Lu, Enshen Zhou, Qian Yu, Jing Zhang

Main category: cs.AI

TL;DR: Prune4Web introduces programmatic DOM pruning to handle large web page structures efficiently, replacing LLM reading with executable scoring scripts for 25-50x reduction in candidate elements and achieving 88.28% accuracy in web automation.

Details

Motivation: Existing web automation approaches struggle with large DOM structures (10k-100k tokens) that require crude truncation or inefficient heuristics, risking critical information loss and poor precision-scalability balance.

Method: DOM Tree Pruning Programming where LLM generates Python scoring scripts to dynamically filter DOM elements based on semantic cues from sub-tasks, eliminating need for LLMs to ingest raw DOMs and enabling lightweight programmatic traversal.

Result: Achieves 25x to 50x reduction in candidate elements for grounding, dramatically improves accuracy from 46.8% to 88.28% on low-level grounding task, demonstrating state-of-the-art performance in web automation.

Conclusion: Prune4Web effectively addresses DOM size challenges in web automation by shifting from resource-intensive LLM reading to efficient programmatic pruning, enabling precise action localization while mitigating attention dilution through unified framework optimization.

Abstract: Web automation employs intelligent agents to execute high-level tasks by mimicking human interactions with web interfaces. Despite the capabilities of recent Large Language Model (LLM)-based web agents, navigating complex, real-world webpages efficiently remains a significant hurdle due to the prohibitively large size of Document Object Model (DOM) structures, often ranging from 10,000 to 100,000 tokens. Existing strategies typically rely on crude DOM truncation – risking the loss of critical information – or employ inefficient heuristics and separate ranking models, failing to achieve an optimal balance between precision and scalability. To address these challenges, we introduce Prune4Web, a novel paradigm that shifts DOM processing from resource-intensive LLM reading to efficient programmatic pruning. Central to our approach is DOM Tree Pruning Programming, where an LLM generates executable Python scoring scripts to dynamically filter DOM elements based on semantic cues from decomposed sub-tasks. This mechanism eliminates the need for LLMs to ingest raw, massive DOMs, instead delegating traversal and scoring to lightweight, interpretable programs. This methodology achieves a 25x to 50x reduction in candidate elements for grounding, thereby facilitating precise action localization while mitigating attention dilution. Furthermore, we propose a specialized data annotation pipeline and a two-turn dialogue training strategy that jointly optimizes the Planner, Programmatic Filter, and Grounder within a unified framework. Extensive experiments demonstrate state-of-the-art performance. Notably, on our low-level grounding task, Prune4Web dramatically improves accuracy from 46.8% to 88.28%, underscoring its efficacy in real-world web automation.

[300] Cross Domain Evaluation of Multimodal Chain-of-Thought Reasoning of different datasets into the Amazon CoT Framework

Nitya Tiwari, Parv Maheshwari, Vidisha Agarwal

Main category: cs.AI

TL;DR: Analysis of Multimodal Chain-of-Thought reasoning across diverse domains (A-OKVQA, OKVQA, ChartQA) reveals that while vision integration reduces hallucination, effectiveness varies significantly by question type, with commonsense reasoning being particularly challenging.

Details

Motivation: To evaluate the generalizability of Multimodal-CoT reasoning beyond scientific domains and assess its effectiveness on tasks requiring broad commonsense and world knowledge.

Method: Implemented two-stage framework separating rationale generation from answer inference, using gated fusion mechanism with T5-based language models to integrate vision features, with systematic ablation studies.

Result: Vision integration significantly reduces hallucination in rationale generation, but CoT reasoning effectiveness varies substantially across question types, with commonsense reasoning presenting particular challenges.

Conclusion: Provides practical insights for multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.

Abstract: While recent work has extended CoT to multimodal settings, achieving state-of-the-art results on science question answering benchmarks like ScienceQA, the generalizability of these approaches across diverse domains remains underexplored. This work presents a comprehensive analysis of Multimodal Chain-of-Thought (Multimodal-CoT) reasoning, evaluating its effectiveness on the A-OKVQA, OKVQA and ChartQA datasets, which requires broad commonsense and world knowledge beyond scientific reasoning. We implement the two-stage framework proposed by Zhang et al. [3], which separates rationale generation from answer inference and integrates vision features through a gated fusion mechanism with T5-based language models. Through systematic ablation studies, we analyze the contributions of vision features, rationale quality, and architectural choices. Our findings reveal that while vision integration significantly reduces hallucination in rationale generation, the effectiveness of CoT reasoning varies substantially across question types, with commonsense reasoning presenting particular challenges. This work provides practical insights for researchers implementing multimodal reasoning systems and identifies key areas for future improvement in cross-domain generalization.

[301] Conversational no-code and multi-agentic disease module identification and drug repurposing prediction with ChatDRex

Simon Süwer, Kester Bagemihl, Sylvie Baier, Lucia Dicunta, Markus List, Jan Baumbach, Andreas Maier, Fernando M. Delgado-Chaves

Main category: cs.AI

TL;DR: ChatDRex is a conversation-based multi-agent system that enables natural language access to biomedical knowledge graphs for network-based drug repurposing prediction, allowing non-experts to conduct complex bioinformatic analyses.

Details

Motivation: Traditional drug repurposing requires collaboration across multiple specialized fields with fragmented tools and heterogeneous data, creating workflow integration challenges that limit accessibility for clinical experts.

Method: Builds on NeDRex knowledge graph with specialized agents for query routing, data retrieval, network analysis, functional coherence evaluation, literature mining, and result visualization, featuring a reasoning module for hallucination detection.

Result: Provides natural language interface that democratizes access to bioinformatics for physicians and researchers without computer science expertise, enabling hypothesis generation and drug repurposing exploration.

Conclusion: ChatDRex accelerates drug discovery by making complex bioinformatic analyses accessible to clinical experts, advancing personalized medicine and translational research through democratized drug repurposing capabilities.

Abstract: Repurposing approved drugs offers a time-efficient and cost-effective alternative to traditional drug development. However, in silico prediction of repurposing candidates is challenging and requires the effective collaboration of specialists in various fields, including pharmacology, medicine, biology, and bioinformatics. Fragmented, specialized algorithms and tools often address only narrow aspects of the overall problem, and heterogeneous, unstructured data landscapes require specialized users to be involved. Hence, these data services do not integrate smoothly across workflows. With ChatDRex, we present a conversation-based, multi-agent system that facilitates the execution of complex bioinformatic analyses aiming for network-based drug repurposing prediction. It builds on the integrated systems medicine knowledge graph NeDRex. ChatDRex provides natural language access to its extensive biomedical KG and integrates bioinformatics agents for network analysis and drug repurposing, complemented by agents for functional coherence evaluation for in silico validation, as well as agents for literature mining and for discussing the obtained results in a scientific context. Its flexible multi-agent design assigns specific tasks to specialized agents, including query routing, data retrieval, algorithm execution, and result visualization. A dedicated reasoning module keeps the user in the loop and allows for hallucination detection. By enabling physicians and researchers without computer science expertise to control complex analyses in natural language, ChatDRex democratizes access to bioinformatics as an important resource for drug repurposing. It enables clinical experts to generate hypotheses and explore drug repurposing opportunities, ultimately accelerating the discovery of novel therapies and advancing personalized medicine and translational research.

[302] Learning Multi-Access Point Coordination in Agentic AI Wi-Fi with Large Language Models

Yifan Fan, Le Liang, Peng Liu, Xiao Li, Ziyang Guo, Qiao Lan, Shi Jin, Wen Tong

Main category: cs.AI

TL;DR: Proposes an Agentic AI Wi-Fi framework where access points act as autonomous LLM agents that collaboratively reason and negotiate adaptive coordination strategies in real-time, outperforming static MAPC protocols.

Details

Motivation: Existing MAPC protocols use static, predefined rules that cannot adapt to dynamic network conditions like varying interference and topologies, limiting their effectiveness in dense Wi-Fi environments.

Method: Models each access point as an autonomous LLM agent with cognitive workflow capabilities including natural language dialogue, memory, reflection, and tool use to enable collaborative reasoning and real-time strategy negotiation.

Result: Comprehensive simulations show the agentic framework successfully adapts to diverse dynamic network environments, significantly outperforming state-of-the-art spatial reuse baselines.

Conclusion: The framework demonstrates potential as a robust and intelligent solution for future wireless networks by enabling dynamic adaptation to changing network conditions through agent collaboration.

Abstract: Multi-access point coordination (MAPC) is a key technology for enhancing throughput in next-generation Wi-Fi within dense overlapping basic service sets. However, existing MAPC protocols rely on static, protocol-defined rules, which limits their ability to adapt to dynamic network conditions such as varying interference levels and topologies. To address this limitation, we propose a novel Agentic AI Wi-Fi framework where each access point, modeled as an autonomous large language model agent, collaboratively reasons about the network state and negotiates adaptive coordination strategies in real time. This dynamic collaboration is achieved through a cognitive workflow that enables the agents to engage in natural language dialogue, leveraging integrated memory, reflection, and tool use to ground their decisions in past experience and environmental feedback. Comprehensive simulation results demonstrate that our agentic framework successfully learns to adapt to diverse and dynamic network environments, significantly outperforming the state-of-the-art spatial reuse baseline and validating its potential as a robust and intelligent solution for future wireless networks.

[303] OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

Karen Ullrich, Jingtong Su, Claudia Shi, Arjun Subramonian, Amir Bar, Ivan Evtimov, Nikolaos Tsilivis, Randall Balestriero, Julia Kempe, Mark Ibrahim

Main category: cs.AI

TL;DR: OpenApps is a lightweight ecosystem for evaluating UI-Agents across app variations, revealing that reliability fluctuates significantly across different app versions.

Details

Motivation: Current evaluations of autonomous UI-Agents use fixed environments, limiting insights into reliability across app design and content variations encountered in real deployments.

Method: Developed OpenApps with six configurable apps that can generate thousands of versions, running over 10,000 evaluations across seven multimodal agents.

Result: Task success rates fluctuate by over 50% across app variations, with some agents dropping from 63% to 4% success. Agent behaviors like looping and hallucinating also vary significantly.

Conclusion: Measuring reliability across app variations is crucial, as standard evaluations in fixed environments fail to capture real-world performance variability.

Abstract: Reliability is key to realizing the promise of autonomous UI-Agents, multimodal agents that directly interact with apps in the same manner as humans, as users must be able to trust an agent to complete a given task. Current evaluations rely on fixed environments, often clones of existing apps, which are limited in that they can only shed light on whether or how often an agent can complete a task within a specific environment. When deployed however, agents are likely to encounter variations in app design and content that can affect an agent’s ability to complete a task. To address this blind spot of measuring agent reliability across app variations, we develop OpenApps, a light-weight open-source ecosystem with six apps (messenger, calendar, maps, etc.) that are configurable in appearance and content. OpenApps requires just a single CPU to run, enabling easy generation and deployment of thousands of versions of each app. Specifically, we run more than 10,000 independent evaluations to study reliability across seven leading multimodal agents. We find that while standard reliability within a fixed app is relatively stable, reliability can vary drastically when measured across app variations. Task success rates for many agents can fluctuate by more than $50%$ across app variations. For example, Kimi-VL-3B’s average success across all tasks fluctuates from $63%$ to just $4%$ across app versions. We also find agent behaviors such as looping or hallucinating actions can differ drastically depending on the environment configuration. These initial findings highlight the importance of measuring reliability along this new dimension of app variations. OpenApps is available at https://facebookresearch.github.io/OpenApps/

[304] Representation Interventions Enable Lifelong Unstructured Knowledge Control

Xuyuan Liu, Zhengzhang Chen, Xinshuai Dong, Yanchi Liu, Xujiang Zhao, Shengyu Chen, Haoyu Wang, Yujun Yan, Haifeng Chen

Main category: cs.AI

TL;DR: RILKE is a scalable method for lifelong knowledge control in LLMs that uses representation-space interventions to efficiently update knowledge without retraining, maintaining model utility while handling complex unstructured knowledge.

Details

Motivation: LLMs often produce incorrect or outdated content, and updating their knowledge efficiently without costly retraining is challenging, especially for complex unstructured knowledge in lifelong settings where many edits must coexist without interference.

Method: RILKE treats knowledge control as interventions in the model’s representation space, learning paraphrase-robust and edit-localized modules that limit updates to low-dimensional subspaces to minimize interference, with a query-adaptive router for inference.

Result: Evaluation on knowledge editing benchmarks with LLaMA and Qwen models shows RILKE is scalable to large datasets, demonstrating high edit success, strong paraphrase generalization, and preserved general utility with modest memory overhead.

Conclusion: RILKE is an effective and scalable solution for lifelong knowledge control in LLMs, enabling fine-grained control over complex knowledge while maintaining model utility with frozen base weights.

Abstract: Large language models (LLMs) often produce incorrect or outdated content. Updating their knowledge efficiently and accurately without costly retraining is a major challenge. This problem is especially hard for complex, unstructured knowledge in a lifelong setting, where many edits must coexist without interference. We introduce RILKE (Representation Intervention for Lifelong KnowledgE Control), a robust and scalable method that treats knowledge control as interventions within the model’s representation space. Leveraging representation-space expressiveness, we identify two properties enabling RILKE to deliver fine-grained control over complex, unstructured knowledge while maintaining general utility with frozen base weights. During training, RILKE learns paraphrase-robust and edit-localized modules that limit each update to a low-dimensional subspace to minimize cross-edit interference. In inference, a query-adaptive router selects the appropriate module to guide the model’s generation. In evaluation on knowledge editing benchmarks with LLaMA and Qwen models, RILKE is scalable to large-scale datasets, demonstrating high edit success, strong paraphrase generalization, and preserving general utility with modest memory overhead. These results show RILKE is an effective and scalable solution for lifelong knowledge control in LLMs.

[305] Guaranteed Optimal Compositional Explanations for Neurons

Biagio La Rosa, Leilani H. Gilpin

Main category: cs.AI

TL;DR: First framework for computing guaranteed optimal compositional explanations of neuron activations in neural networks, addressing limitations of beam search methods.

Details

Motivation: Current compositional explanations use beam search which lacks theoretical guarantees of optimality, making it unclear how close explanations are to the true optimum.

Method: Proposed decomposition of spatial alignment factors, heuristic for alignment estimation, and first algorithm for computing optimal compositional explanations within feasible time.

Result: Analysis shows 10-40% of beam search explanations are suboptimal when overlapping concepts are involved. Proposed beam-search variant improves runtime and flexibility.

Conclusion: Framework enables guaranteed optimal explanations, revealing significant suboptimality in current methods and providing more efficient alternatives.

Abstract: While neurons are the basic units of deep neural networks, it is still unclear what they learn and if their knowledge is aligned with that of humans. Compositional explanations aim to answer this question by describing the spatial alignment between neuron activations and concepts through logical rules. These logical descriptions are typically computed via a search over all possible concept combinations. Since computing the spatial alignment over the entire state space is computationally infeasible, the literature commonly adopts beam search to restrict the space. However, beam search cannot provide any theoretical guarantees of optimality, and it remains unclear how close current explanations are to the true optimum. In this theoretical paper, we address this gap by introducing the first framework for computing guaranteed optimal compositional explanations. Specifically, we propose: (i) a decomposition that identifies the factors influencing the spatial alignment, (ii) a heuristic to estimate the alignment at any stage of the search, and (iii) the first algorithm that can compute optimal compositional explanations within a feasible time. Using this framework, we analyze the differences between optimal and non-optimal explanations in the most popular settings for compositional explanations, the computer vision domain and Convolutional Neural Networks. In these settings, we demonstrate that 10-40 percent of explanations obtained with beam search are suboptimal when overlapping concepts are involved. Finally, we evaluate a beam-search variant guided by our proposed decomposition and heuristic, showing that it matches or improves runtime over prior methods while offering greater flexibility in hyperparameters and computational resources.

[306] Step-Audio-R1 Technical Report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

Main category: cs.AI

TL;DR: Step-Audio-R1 is the first audio reasoning model that successfully enables reasoning in audio domains through Modality-Grounded Reasoning Distillation, outperforming Gemini 2.5 Pro and matching Gemini 3 Pro on audio understanding tasks.

Details

Motivation: Audio language models paradoxically perform better without reasoning, raising questions about whether audio intelligence can benefit from deliberate thinking like text and vision models do.

Method: Proposed Modality-Grounded Reasoning Distillation (MGRD) framework that teaches the model to generate audio-relevant reasoning chains grounded in acoustic features rather than hallucinating disconnected deliberations.

Result: Step-Audio-R1 exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to state-of-the-art Gemini 3 Pro across comprehensive audio understanding benchmarks spanning speech, environmental sounds, and music.

Conclusion: Reasoning is a transferable capability across modalities when properly anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence, opening pathways toward truly multimodal reasoning systems.

Abstract: Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

[307] Learning Individual Behavior in Agent-Based Models with Graph Diffusion Networks

Francesco Cozzi, Marco Pangallo, Alan Perotti, André Panisson, Corrado Monti

Main category: cs.AI

TL;DR: A novel framework that learns differentiable surrogates of Agent-Based Models using diffusion models and graph neural networks, enabling gradient-based optimization while preserving decentralized agent dynamics.

Details

Motivation: Agent-Based Models are non-differentiable, limiting their integration with real-world data and gradient-based optimization methods, which hinders practical applications.

Method: Combines diffusion models to capture behavioral stochasticity and graph neural networks to model agent interactions, directly modeling individual agent behavior rather than system-level outputs.

Result: Validated on Schelling’s segregation model and Predator-Prey ecosystem, showing accurate replication of individual-level patterns and emergent dynamics forecasting beyond training data.

Conclusion: Demonstrates the potential of combining diffusion models and graph learning for creating data-driven, differentiable ABM simulations that preserve the fundamental decentralized nature of agent-based modeling.

Abstract: Agent-Based Models (ABMs) are powerful tools for studying emergent properties in complex systems. In ABMs, agent behaviors are governed by local interactions and stochastic rules. However, these rules are, in general, non-differentiable, limiting the use of gradient-based methods for optimization, and thus integration with real-world data. We propose a novel framework to learn a differentiable surrogate of any ABM by observing its generated data. Our method combines diffusion models to capture behavioral stochasticity and graph neural networks to model agent interactions. Distinct from prior surrogate approaches, our method introduces a fundamental shift: rather than approximating system-level outputs, it models individual agent behavior directly, preserving the decentralized, bottom-up dynamics that define ABMs. We validate our approach on two ABMs (Schelling’s segregation model and a Predator-Prey ecosystem) showing that it replicates individual-level patterns and accurately forecasts emergent dynamics beyond training. Our results demonstrate the potential of combining diffusion models and graph learning for data-driven ABM simulation.

[308] ENACT: Evaluating Embodied Cognition with World Modeling of Egocentric Interaction

Qineng Wang, Wenlong Huang, Yu Zhou, Hang Yin, Tianwei Bao, Jianwen Lyu, Weiyu Liu, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, Manling Li

Main category: cs.AI

TL;DR: ENACT is a benchmark that evaluates embodied cognition in vision-language models through world modeling tasks using egocentric interaction in VQA format, revealing performance gaps between models and humans.

Details

Motivation: To investigate whether modern vision-language models, trained in disembodied ways, exhibit signs of embodied cognition by testing their ability to model the world from egocentric interactions.

Method: Created ENACT benchmark with two sequence reordering tasks: forward world modeling (reorder observations given actions) and inverse world modeling (reorder actions given observations), using robotics simulation data from BEHAVIOR with 8,972 QA pairs.

Result: Performance gap between frontier VLMs and humans that widens with interaction horizon; models perform better on inverse task than forward task; exhibit anthropocentric biases (right-handed preference, degraded performance with non-human camera parameters).

Conclusion: Current VLMs show limitations in embodied cognition capabilities compared to humans, particularly in long-horizon interactions, suggesting need for more embodied training approaches.

Abstract: Embodied cognition argues that intelligence arises from sensorimotor interaction rather than passive observation. It raises an intriguing question: do modern vision-language models (VLMs), trained largely in a disembodied manner, exhibit signs of embodied cognition? We introduce ENACT, a benchmark that casts evaluation of embodied cognition as world modeling from egocentric interaction in a visual question answering (VQA) format. Framed as a partially observable Markov decision process (POMDP) whose actions are scene graph changes, ENACT comprises two complementary sequence reordering tasks: forward world modeling (reorder shuffled observations given actions) and inverse world modeling (reorder shuffled actions given observations). While conceptually simple, solving these tasks implicitly demands capabilities central to embodied cognition-affordance recognition, action-effect reasoning, embodied awareness, and interactive, long-horizon memory from partially observable egocentric input, while avoiding low-level image synthesis that could confound the evaluation. We provide a scalable pipeline that synthesizes QA pairs from robotics simulation (BEHAVIOR) and evaluates models on 8,972 QA pairs spanning long-horizon home-scale activities. Experiments reveal a performance gap between frontier VLMs and humans that widens with interaction horizon. Models consistently perform better on the inverse task than the forward one and exhibit anthropocentric biases, including a preference for right-handed actions and degradation when camera intrinsics or viewpoints deviate from human vision. Website at https://enact-embodied-cognition.github.io/.

[309] Improving Procedural Skill Explanations via Constrained Generation: A Symbolic-LLM Hybrid Architecture

Rahul Dass, Thomas Bowlin, Zebing Li, Xiao Jin, Ashok Goel

Main category: cs.AI

TL;DR: Ivy is an AI coaching system that combines symbolic TMK models with LLMs to generate structured, multi-step explanations for procedural skills, improving explanation quality over pure LLM approaches.

Details

Motivation: LLMs often produce fluent but shallow explanations that miss the causal, goal-directed, and compositional logic needed for effective procedural skill learning in educational contexts.

Method: Combines symbolic Task-Method-Knowledge (TMK) models with a generative LLM layer, where TMK encodes causal transitions, goal hierarchies, and problem decompositions to structurally constrain the LLM’s explanation generation.

Result: Ivy outperforms GPT and retrieval-augmented GPT baselines across three inferential dimensions, with symbolic constraints consistently improving structural quality of explanations for “how” and “why” questions.

Conclusion: This demonstrates a scalable AI for education approach that enhances the pedagogical value of AI-generated explanations in intelligent coaching systems through symbolic structural guidance.

Abstract: In procedural skill learning, instructional explanations must convey not just steps, but the causal, goal-directed, and compositional logic behind them. Large language models (LLMs) often produce fluent yet shallow responses that miss this structure. We present Ivy, an AI coaching system that delivers structured, multi-step explanations by combining symbolic Task-Method-Knowledge (TMK) models with a generative interpretation layer-an LLM that constructs explanations while being constrained by TMK structure. TMK encodes causal transitions, goal hierarchies, and problem decompositions, and guides the LLM within explicit structural bounds. We evaluate Ivy against responses against GPT and retrieval-augmented GPT baselines using expert and independent annotations across three inferential dimensions. Results show that symbolic constraints consistently improve the structural quality of explanations for “how” and “why” questions. This study demonstrates a scalable AI for education approach that strengthens the pedagogical value of AI-generated explanations in intelligent coaching systems.

[310] ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning

Jinpeng Wang, Chao Li, Ting Ye, Mengyuan Zhang, Wei Liu, Jian Luan

Main category: cs.AI

TL;DR: ICPO method enhances LLM reasoning by using intrinsic confidence and relative preference optimization to address RLVR limitations like coarse rewards and inefficient exploration.

Details

Motivation: Existing RLVR methods suffer from coarse-grained rewards, reward noise, and inefficient exploration, leading to unstable training and entropy collapse in LLM reasoning.

Method: ICPO calculates preference advantage scores by comparing generation probabilities of multiple responses under the same prompt, integrating these with verifiable rewards to guide exploration.

Result: ICPO alleviates coarse-grained rewards and reward noise, curbs overconfident errors, enhances undervalued high-quality responses, and prevents overfitting to specific strategies.

Conclusion: Comprehensive experiments across multiple benchmarks show ICPO steadily boosts reasoning performance compared to GRPO.

Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) demonstrates significant potential in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing RLVR methods are often constrained by issues such as coarse-grained rewards, reward noise, and inefficient exploration, which lead to unstable training and entropy collapse. To address this challenge, we propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO). The intuition behind it lies in the fact that the probabilities of an LLM generating different responses can inherently and directly reflect its self-assessment of the reasoning process. Inspired by the idea of preference modeling, ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt, and integrates this score with verifiable rewards to guide the exploration process. We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors, enhances the relative superiority of undervalued high-quality responses, and prevents the model from overfitting to specific strategies, thereby facilitating more thorough exploration. Comprehensive experiments across four general-domain benchmarks and three mathematical benchmarks demonstrate that ICPO steadily boosts reasoning compared to GRPO.

[311] Towards Trustworthy Legal AI through LLM Agents and Formal Reasoning

Linze Chen, Yufan Cai, Zhe Hou, Jinsong Dong

Main category: cs.AI

TL;DR: L4M is a framework that combines LLM agents with SMT-solver proofs to bridge natural language interpretation with symbolic verification for legal reasoning, outperforming advanced LLMs and legal AI systems.

Details

Motivation: Existing LLM-based systems lack the guarantees required for principled jurisprudence, excelling only at surface-level text analysis without formal rationality guarantees.

Method: Three-phase pipeline: (1) Statute Formalization converts legal provisions to logical formulae, (2) Dual Fact and Statute Extraction uses prosecutor- and defense-aligned LLMs for role-isolated argument mapping, (3) Solver-Centric Adjudication compiles arguments into logic constraints with iterative self-critique until satisfiable formula.

Result: Experimental results show L4M surpasses GPT-4-mini, DeepSeek-V3, Claude 4, and state-of-the-art Legal AI baselines while providing rigorous symbolic justifications.

Conclusion: L4M successfully unites interpretive flexibility of natural language with rigor of symbolic verification, enabling transparent and explainable legal decision-making with formal guarantees.

Abstract: The rationality of law manifests in two forms: substantive rationality, which concerns the fairness or moral desirability of outcomes, and formal rationality, which requires legal decisions to follow explicitly stated, general, and logically coherent rules. Existing LLM-based systems excel at surface-level text analysis but lack the guarantees required for principled jurisprudence. We introduce L4M, a novel framework that combines adversarial LLM agents with SMT-solver-backed proofs to unite the interpretive flexibility of natural language with the rigor of symbolic verification. The pipeline consists of three phases: (1) Statute Formalization, where domain-specific prompts convert legal provisions into logical formulae; (2) Dual Fact and Statute Extraction, in which prosecutor- and defense-aligned LLMs independently map case narratives to fact tuples and statutes, ensuring role isolation; and (3) Solver-Centric Adjudication, where an autoformalizer compiles both parties’ arguments into logic constraints, and unsat cores trigger iterative self-critique until a satisfiable formula is achieved, which is then verbalized by a Judge-LLM into a transparent verdict and optimized sentence. Experimental results on public benchmarks show that our system surpasses advanced LLMs including GPT-o4-mini, DeepSeek-V3, and Claude 4 as well as state-of-the-art Legal AI baselines, while providing rigorous and explainable symbolic justifications.

[312] OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

Chujie Wang, Jianyu Lu, Zhiyuan Luo, Xi Chen, Chu He

Main category: cs.AI

TL;DR: OVOD-Agent transforms passive category matching into proactive visual reasoning and self-evolving detection using a Visual Chain-of-Thought approach and Weakly Markovian Decision Process modeling.

Details

Motivation: Existing Open-Vocabulary Object Detection methods have a gap between multimodal training and unimodal inference, with textual space being underexplored despite its potential to significantly improve performance.

Method: Proposes OVOD-Agent with Visual-CoT for interpretable reasoning, models visual context as w-MDP over eight state spaces, uses Bandit module for exploration under limited supervision, and integrates Markov transitions with Bandit trajectories for self-supervised Reward Model optimization.

Result: Experiments on COCO and LVIS show consistent improvements across OVOD backbones, particularly on rare categories.

Conclusion: The proposed framework effectively bridges the training-inference gap in OVOD through proactive visual reasoning and self-evolving detection mechanisms.

Abstract: Open-Vocabulary Object Detection (OVOD) aims to enable detectors to generalize across categories by leveraging semantic information. Although existing methods are pretrained on large vision-language datasets, their inference is still limited to fixed category names, creating a gap between multimodal training and unimodal inference. Previous work has shown that improving textual representation can significantly enhance OVOD performance, indicating that the textual space is still underexplored. To this end, we propose OVOD-Agent, which transforms passive category matching into proactive visual reasoning and self-evolving detection. Inspired by the Chain-of-Thought (CoT) paradigm, OVOD-Agent extends the textual optimization process into an interpretable Visual-CoT with explicit actions. OVOD’s lightweight nature makes LLM-based management unsuitable; instead, we model visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight state spaces, which naturally represents the agent’s state, memory, and interaction dynamics. A Bandit module generates exploration signals under limited supervision, helping the agent focus on uncertain regions and adapt its detection policy. We further integrate Markov transition matrices with Bandit trajectories for self-supervised Reward Model (RM) optimization, forming a closed loop from Bandit exploration to RM learning. Experiments on COCO and LVIS show that OVOD-Agent provides consistent improvements across OVOD backbones, particularly on rare categories, confirming the effectiveness of the proposed framework.

[313] Causality Without Causal Models

Joseph Y. Halpern, Rafael Pass

Main category: cs.AI

TL;DR: The paper abstracts Halpern and Pearl’s causality definition to work with any model where counterfactuals are defined, enabling broader applications including handling complex formulas and extending to explanation.

Details

Motivation: To extend Halpern and Pearl's causality definition beyond causal models to any model with counterfactuals, overcoming limitations in handling complex logical formulas and enabling broader applications.

Method: Abstracting the key features of Halpern and Pearl’s causality definition to create a generalized version that can be applied to any model where counterfactuals are defined.

Result: The abstracted definition can handle formulas with disjunctions, negations, beliefs, and nested counterfactuals, and can be applied to models allowing backtracking, while also enabling extension to abstract explanation definitions.

Conclusion: Abstracting causality definitions provides broader applicability, deeper understanding of the original definition, and enables extension to explanation concepts beyond causal models.

Abstract: Perhaps the most prominent current definition of (actual) causality is due to Halpern and Pearl. It is defined using causal models (also known as structural equations models). We abstract the definition, extracting its key features, so that it can be applied to any other model where counterfactuals are defined. By abstracting the definition, we gain a number of benefits. Not only can we apply the definition in a wider range of models, including ones that allow, for example, backtracking, but we can apply the definition to determine if A is a cause of B even if A and B are formulas involving disjunctions, negations, beliefs, and nested counterfactuals (none of which can be handled by the Halpern-Pearl definition). Moreover, we can extend the ideas to getting an abstract definition of explanation that can be applied beyond causal models. Finally, we gain a deeper understanding of features of the definition even in causal models.

[314] New Hybrid Heuristics for Pseudo-Boolean Propagation

Mia Müßig, Jan Johannsen

Main category: cs.AI

TL;DR: New heuristics for hybrid unit propagation in pseudo-boolean solving outperform current methods in RoundingSAT.

Details

Motivation: Current hybrid unit propagation strategies combining watched literal scheme with counting method are successful but can be improved.

Method: Introduces new heuristics for making hybrid decisions in unit propagation for pseudo-boolean solving.

Result: The new heuristics drastically outperform the current method in the RoundingSAT solver.

Conclusion: The proposed heuristics significantly improve performance of hybrid unit propagation in pseudo-boolean solving.

Abstract: In pseudo-boolean solving the currently most successful unit propagation strategy is a hybrid mode combining the watched literal scheme with the counting method. This short paper introduces new heuristics for this hybrid decision, which are able to drastically outperform the current method in the RoundingSAT solver.

[315] EWE: An Agentic Framework for Extreme Weather Analysis

Zhe Jiang, Jiong Wang, Xiaoyu Yue, Zijie Guo, Wenlong Zhang, Fenghua Ling, Wanli Ouyang, Lei Bai

Main category: cs.AI

TL;DR: EWE is the first AI framework for automated extreme weather diagnosis, using knowledge-guided planning and meteorological tools to analyze raw data and generate visualizations, with a new benchmark for evaluation.

Details

Motivation: Extreme weather events are increasing globally, but current manual diagnostic approaches create analytical bottlenecks that hinder scientific progress in understanding their mechanisms.

Method: EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit to autonomously produce and interpret multimodal visualizations from raw meteorological data.

Result: The framework successfully performs comprehensive diagnostic analyses and the authors introduce the first benchmark with 103 high-impact events and a step-wise evaluation metric for this emerging field.

Conclusion: EWE represents a step toward automated scientific discovery and has potential to democratize weather expertise, especially benefiting developing nations vulnerable to extreme weather.

Abstract: Extreme weather events pose escalating risks to global society, underscoring the urgent need to unravel their underlying physical mechanisms. Yet the prevailing expert-driven, labor-intensive diagnostic paradigm has created a critical analytical bottleneck, stalling scientific progress. While AI for Earth Science has achieved notable advances in prediction, the equally essential challenge of automated diagnostic reasoning remains largely unexplored. We present the Extreme Weather Expert (EWE), the first intelligent agent framework dedicated to this task. EWE emulates expert workflows through knowledge-guided planning, closed-loop reasoning, and a domain-tailored meteorological toolkit. It autonomously produces and interprets multimodal visualizations from raw meteorological data, enabling comprehensive diagnostic analyses. To catalyze progress, we introduce the first benchmark for this emerging field, comprising a curated dataset of 103 high-impact events and a novel step-wise evaluation metric. EWE marks a step toward automated scientific discovery and offers the potential to democratize expertise and intellectual resources, particularly for developing nations vulnerable to extreme weather.

[316] MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning

Junjian Wang, Lidan Zhao, Xi Sheryl Zhang

Main category: cs.AI

TL;DR: MADRA is a training-free multi-agent debate framework that enhances safety assessment for embodied AI agents through collective reasoning, reducing false rejections while maintaining high safety sensitivity.

Details

Motivation: Existing methods for ensuring safety of embodied AI agents suffer from high computational costs or over-rejection of safe tasks, creating a need for more efficient and accurate safety assessment.

Method: Proposes MADRA framework using multiple LLM agents to debate instruction safety with iterative deliberation and consensus voting, plus hierarchical cognitive planning with safety, memory, and self-evolution mechanisms.

Result: Achieves over 90% rejection of unsafe tasks with low safe-task rejection, outperforming existing methods in safety and execution efficiency on AI2-THOR and VirtualHome benchmarks.

Conclusion: Provides a scalable, model-agnostic solution for trustworthy embodied agents that balances safety and task performance without requiring training.

Abstract: Ensuring the safety of embodied AI agents during task planning is critical for real-world deployment, especially in household environments where dangerous instructions pose significant risks. Existing methods often suffer from either high computational costs due to preference alignment training or over-rejection when using single-agent safety prompts. To address these limitations, we propose MADRA, a training-free Multi-Agent Debate Risk Assessment framework that leverages collective reasoning to enhance safety awareness without sacrificing task performance. MADRA employs multiple LLM-based agents to debate the safety of a given instruction, guided by a critical evaluator that scores responses based on logical soundness, risk identification, evidence quality, and clarity. Through iterative deliberation and consensus voting, MADRA significantly reduces false rejections while maintaining high sensitivity to dangerous tasks. Additionally, we introduce a hierarchical cognitive collaborative planning framework that integrates safety, memory, planning, and self-evolution mechanisms to improve task success rates through continuous learning. We also contribute SafeAware-VH, a benchmark dataset for safety-aware task planning in VirtualHome, containing 800 annotated instructions. Extensive experiments on AI2-THOR and VirtualHome demonstrate that our approach achieves over 90% rejection of unsafe tasks while ensuring that safe-task rejection is low, outperforming existing methods in both safety and execution efficiency. Our work provides a scalable, model-agnostic solution for building trustworthy embodied agents.

[317] SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Peiran Xu, Sudong Wang, Yao Zhu, Jianing Li, Yunjian Zhang

Main category: cs.AI

TL;DR: Proposes a hierarchical spatial cognition framework and SpatialBench benchmark to systematically evaluate multimodal LLMs’ spatial reasoning across 5 cognitive levels, revealing models’ limitations in higher-level reasoning despite strong perceptual abilities.

Details

Motivation: Existing benchmarks oversimplify spatial cognition as single-dimensional metrics, failing to capture the hierarchical structure and interdependence of spatial abilities in multimodal intelligence.

Method: Developed a hierarchical framework decomposing spatial intelligence into 5 progressive levels, constructed SpatialBench with 15 tasks across these levels, and introduced a unified capability-oriented evaluation metric.

Result: Models show strong perceptual grounding but limited symbolic reasoning, causal inference, and planning; humans perform goal-directed abstraction while MLLMs over-attend to surface details without coherent spatial intent.

Conclusion: Establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying foundation for future spatially intelligent systems.

Abstract: Spatial cognition is fundamental to real-world multimodal intelligence, allowing models to effectively interact with the physical environment. While multimodal large language models (MLLMs) have made significant strides, existing benchmarks often oversimplify spatial cognition, reducing it to a single-dimensional metric, which fails to capture the hierarchical structure and interdependence of spatial abilities. To address this gap, we propose a hierarchical spatial cognition framework that decomposes spatial intelligence into five progressively complex levels from basic observation to high-level planning. Building upon this taxonomy, we construct SpatialBench, a large-scale, fine-grained benchmark covering 15 tasks aligned with these cognitive levels. To provide a unified evaluation across heterogeneous tasks, we further introduce a high-level capability-oriented metric that reliably assesses a model’s overall spatial reasoning ability. Extensive experiments over massive MLLMs reveal distinct performance stratification across cognitive levels: models exhibit strong perceptual grounding yet remain limited in symbolic reasoning, causal inference, and planning. Additional human tests demonstrate that humans perform selective, goal-directed abstraction, while MLLMs tend to over-attend to surface details without coherent spatial intent. Our work establishes the first systematic framework for measuring hierarchical spatial cognition in MLLMs, laying the foundation for future spatially intelligent systems.

[318] Pessimistic Verification for Open Ended Math Questions

Yanxing Huang, Zihan Tang, Zejin Lin, Peng Li, Yang Liu

Main category: cs.AI

TL;DR: Pessimistic verification improves math proof verification by running multiple parallel checks and deeming proofs incorrect if any check reports an error, significantly boosting performance across benchmarks without high computational cost.

Details

Motivation: The key limitation in verification performance is error detection capability, so the authors sought to improve verification of open-ended math questions through simple but effective workflows.

Method: Designed pessimistic verification variants that construct multiple parallel verifications for the same proof, where the proof is considered incorrect if any verification reports an error.

Result: Significantly improved performance across many math verification benchmarks without substantial computational resources, with token efficiency surpassing extended long-CoT in test-time scaling. Case studies showed many false negatives in stronger models were actually dataset annotation errors.

Conclusion: Pessimistic verification effectively improves reliability and performance of language model outputs for mathematical problems and enables long-horizon mathematical tasks, with potential to enhance mathematical capabilities across a wide range of tasks.

Abstract: The key limitation of the verification performance lies in the ability of error detection. With this intuition we designed several variants of pessimistic verification, which are simple workflows that could significantly improve the verification of open-ended math questions. In pessimistic verification we construct multiple parallel verifications for the same proof, and the proof is deemed incorrect if any one of them reports an error. This simple technique significantly improves the performance across many math verification benchmarks without incurring substantial computational resources. Its token efficiency even surpassed extended long-CoT in test-time scaling. Our case studies further indicate that the majority of false negatives in stronger models are actually caused by annotation errors in the original dataset, so our method’s performance is in fact underestimated. Self-verification for mathematical problems can effectively improve the reliability and performance of language model outputs, and it also plays a critical role in enabling long-horizon mathematical tasks. We believe that research on pessimistic verification will help enhance the mathematical capabilities of language models across a wide range of tasks.

[319] Self-Transparency Failures in Expert-Persona LLMs: A Large-Scale Behavioral Audit

Alex Diep

Main category: cs.AI

TL;DR: Language models show inconsistent AI identity disclosure across professional personas, with disclosure rates ranging from 2.8% to 73.6% depending on domain, creating risks of misplaced user trust.

Details

Motivation: To examine whether language models can reliably disclose their AI identity when assigned professional personas in high-stakes domains, where failure to do so could lead to user harm through false expertise.

Method: Used a common-garden design to audit 16 open-weight models (4B-671B parameters) across 19,200 trials, testing disclosure rates across different professional personas and analyzing the relationship with model parameters and identity.

Result: Models showed sharp domain-specific inconsistency in disclosure (30.8% for Financial Advisor vs 3.5% for Neurosurgeon), with reasoning optimization actively suppressing self-transparency (up to 48.4% lower disclosure). Model identity predicted behavior better than parameter count.

Conclusion: Transparency reflects training factors rather than scale, and organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification.

Abstract: If a language model cannot reliably disclose its AI identity in expert contexts, users cannot trust its competence boundaries. This study examines self-transparency in models assigned professional personas within high-stakes domains where false expertise risks user harm. Using a common-garden design, sixteen open-weight models (4B–671B parameters) were audited across 19,200 trials. Models exhibited sharp domain-specific inconsistency: a Financial Advisor persona elicited 30.8% disclosure initially, while a Neurosurgeon persona elicited only 3.5%. This creates preconditions for a “Reverse Gell-Mann Amnesia” effect, where transparency in some domains leads users to overgeneralize trust to contexts where disclosure fails. Disclosure ranged from 2.8% to 73.6%, with a 14B model reaching 61.4% while a 70B produced just 4.1%. Model identity predicted behavior better than parameter count ($ΔR_{adj}^{2} = 0.359$ vs 0.018). Reasoning optimization actively suppressed self-transparency in some models, with reasoning variants showing up to 48.4% lower disclosure than base counterparts. Bayesian validation with Rogan–Gladen correction confirmed robustness to measurement error ($κ= 0.908$). These findings demonstrate transparency reflects training factors rather than scale. Organizations cannot assume safety properties transfer to deployment contexts, requiring deliberate behavior design and empirical verification.

[320] From Prediction to Foresight: The Role of AI in Designing Responsible Futures

Maria Perez-Ortiz

Main category: cs.AI

TL;DR: This paper introduces ‘responsible computational foresight’ as a framework combining human-centric AI and computational modeling to help policymakers navigate future uncertainties ethically and proactively.

Details

Motivation: To address rapid technological changes and complex global challenges by developing ethical frameworks for future planning that leverage AI while maintaining human-centered decision-making.

Method: Establishes foundational principles for responsible computational foresight and presents AI-driven foresight tools, including simulations and scenario analysis, that complement policymaker judgment.

Result: AI enhances policymakers’ ability to address uncertainty, evaluate risks, and devise sustainable strategies, but serves as a supportive tool rather than a replacement for human intelligence and ethical decision-making.

Conclusion: Advocates for thoughtful integration of AI into foresight practices to empower policymakers in confronting 21st century challenges while maintaining ethical, human-centered approaches to future design.

Abstract: In an era marked by rapid technological advancements and complex global challenges, responsible foresight has emerged as an essential framework for policymakers aiming to navigate future uncertainties and shape the future. Responsible foresight entails the ethical anticipation of emerging opportunities and risks, with a focus on fostering proactive, sustainable, and accountable future design. This paper coins the term “responsible computational foresight”, examining the role of human-centric artificial intelligence and computational modeling in advancing responsible foresight, establishing a set of foundational principles for this new field and presenting a suite of AI-driven foresight tools currently shaping it. AI, particularly in conjunction with simulations and scenario analysis, enhances policymakers’ ability to address uncertainty, evaluate risks, and devise strategies geared toward sustainable, resilient futures. However, responsible foresight extends beyond mere technical forecasting; it demands a nuanced understanding of the interdependencies within social, environmental, economic and political systems, alongside a commitment to ethical, long-term decision-making that supports human intelligence. We argue that AI will play a role as a supportive tool in responsible, human-centered foresight, complementing rather than substituting policymaker judgment to enable the proactive shaping of resilient and ethically sound futures. This paper advocates for the thoughtful integration of AI into foresight practices to empower policymakers and communities as they confront the grand challenges of the 21st century.

[321] On the Limits of Innate Planning in Large Language Models

Charles Schepanowski, Charles Ling

Main category: cs.AI

TL;DR: LLMs struggle with planning and stateful reasoning in 8-puzzle tasks, showing limitations in maintaining internal state and heuristic planning even with external assistance.

Details

Motivation: To evaluate LLMs' capacity for planning and state tracking without external tools, using the 8-puzzle as a precise testbed.

Method: Tested four LLMs on 8-puzzle under various prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) with tiered corrective feedback and external move validation.

Result: Feedback improved some success rates but runs were inefficient. With external move validation, no models solved any puzzles. Models showed brittle state representations and weak heuristic planning.

Conclusion: Current LLMs have substantial limitations in planning without external tools, requiring mechanisms for explicit state maintenance and structured search.

Abstract: Large language models (LLMs) achieve impressive results on many benchmarks, yet their capacity for planning and stateful reasoning remains unclear. We study these abilities directly, without code execution or other tools, using the 8-puzzle: a classic task that requires state tracking and goal-directed planning while allowing precise, step-by-step evaluation. Four models are tested under common prompting conditions (Zero-Shot, Chain-of-Thought, Algorithm-of-Thought) and with tiered corrective feedback. Feedback improves success rates for some model-prompt combinations, but many successful runs are long, computationally expensive, and indirect. We then examine the models with an external move validator that provides only valid moves. Despite this level of assistance, none of the models solve any puzzles in this setting. Qualitative analysis reveals two dominant deficits across all models: (1) brittle internal state representations, leading to frequent invalid moves, and (2) weak heuristic planning, with models entering loops or selecting actions that do not reduce the distance to the goal state. These findings indicate that, in the absence of external tools such as code interpreters, current LLMs have substantial limitations in planning and that further progress may require mechanisms for maintaining explicit state and performing structured search.

[322] Bridging the Unavoidable A Priori: A Framework for Comparative Causal Modeling

Peter S. Hovmand, Kari O’Donnell, Callie Ogland-Hand, Brian Biroscak, Douglas D. Gunzler

Main category: cs.AI

TL;DR: This paper integrates system dynamics and structural equation modeling into a unified mathematical framework to address biases in AI/ML systems and advance responsible AI development.

Details

Motivation: To overcome the challenge of combining methods with different underlying assumptions (Dana Meadow's 'the unavoidable a priori') and enable richer causal modeling for responsible AI/ML development.

Method: Develops a common mathematical framework that brings together system dynamics and structural equation modeling to generate systems from distributions, develop methods, and compare results.

Result: Creates a unified approach that can inform the underlying epistemology of system dynamics for data science and AI/ML applications.

Conclusion: The integration of system dynamics and structural equation modeling provides a foundation for advancing responsible AI/ML by enabling better causal understanding and addressing unintended consequences like human bias amplification.

Abstract: AI/ML models have rapidly gained prominence as innovations for solving previously unsolved problems and their unintended consequences from amplifying human biases. Advocates for responsible AI/ML have sought ways to draw on the richer causal models of system dynamics to better inform the development of responsible AI/ML. However, a major barrier to advancing this work is the difficulty of bringing together methods rooted in different underlying assumptions (i.e., Dana Meadow’s “the unavoidable a priori”). This paper brings system dynamics and structural equation modeling together into a common mathematical framework that can be used to generate systems from distributions, develop methods, and compare results to inform the underlying epistemology of system dynamics for data science and AI/ML applications.

[323] Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li

Main category: cs.AI

TL;DR: ViLoMem is a dual-stream memory framework for MLLMs that separately encodes visual distraction patterns and logical reasoning errors to enable learning from both successful and failed experiences, improving performance across multimodal benchmarks.

Details

Motivation: Existing memory-augmented agents suffer from brevity bias and single-modality memory traces, failing to preserve how visual attention and logical reasoning jointly contributed to solutions, which is misaligned with human multimodal semantic memory.

Method: ViLoMem constructs compact, schema-based dual-stream memory that separately encodes visual distraction patterns and logical reasoning errors, following a grow-and-refine principle to incrementally accumulate and update multimodal semantic knowledge.

Result: Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors, with ablations confirming the necessity of dual-stream memory with explicit distraction-hallucination separation.

Conclusion: The framework demonstrates the value of error-aware multimodal memory for lifelong and cross-domain agentic learning, preserving stable, generalizable strategies while avoiding catastrophic forgetting.

Abstract: MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo – solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually losing essential domain knowledge. More critically, even in truly multimodal problem-solving settings, it records only a single-modality trace of past behavior, failing to preserve how visual attention and logical reasoning jointly contributed to the solution. This is fundamentally misaligned with human cognition: semantic memory is both multimodal and integrated, preserving visual and abstract knowledge through coordinated but distinct representational streams. We thus introduce ViLoMem, a dual-stream memory framework that constructs compact, schema-based memory. It separately encodes visual distraction patterns and logical reasoning errors, enabling MLLMs to learn from their successful and failed experiences. Following a grow-and-refine principle, the system incrementally accumulates and updates multimodal semantic knowledge – preserving stable, generalizable strategies while avoiding catastrophic forgetting. Across six multimodal benchmarks, ViLoMem consistently improves pass@1 accuracy and substantially reduces repeated visual and logical errors. Ablations confirm the necessity of dual-stream memory with explicit distraction–hallucination separation, demonstrating the value of error-aware multimodal memory for lifelong and cross-domain agentic learning. Our project page will be available at https://weihao-bo.github.io/ViLoMeo-page.

[324] Earth Observation Satellite Scheduling with Graph Neural Networks and Monte Carlo Tree Search

Antoine Jacquet, Guillaume Infantes, Emmanuel Benazera, Vincent Baudoui, Jonathan Guerra, Stéphanie Roussel

Main category: cs.AI

TL;DR: This paper presents a Graph Neural Network and Deep Reinforcement Learning approach for Earth Observation Satellite Planning, which learns on small instances and generalizes to larger real-world problems with competitive performance.

Details

Motivation: Earth Observation Satellite Planning is a difficult oversubscribed optimization problem where traditional heuristic approaches have limitations, creating a need for more advanced learning-based methods.

Method: Uses Graph Neural Networks to extract information from problem graphs and Deep Reinforcement Learning to search for optimal schedules, with a post-learning Monte Carlo Tree Search step for further improvement.

Result: The approach successfully learns on small problem instances and generalizes to larger real-world instances, achieving very competitive performance compared to traditional methods.

Conclusion: The GNN+DRL framework with MCTS post-processing provides an effective solution for EOSP that can scale from training on small instances to solving large practical problems.

Abstract: Earth Observation Satellite Planning (EOSP) is a difficult optimization problem with considerable practical interest. A set of requested observations must be scheduled on an agile Earth observation satellite while respecting constraints on their visibility window, as well as maneuver constraints that impose varying delays between successive observations. In addition, the problem is largely oversubscribed: there are much more candidate observations than can possibly be achieved. Therefore, one must select the set of observations that will be performed while maximizing their cumulative benefit and propose a feasible schedule for these observations. As previous work mostly focused on heuristic and iterative search algorithms, this paper presents a new technique for selecting and scheduling observations based on Graph Neural Networks (GNNs) and Deep Reinforcement Learning (DRL). GNNs are used to extract relevant information from the graphs representing instances of the EOSP, and DRL drives the search for optimal schedules. A post-learning search step based on Monte Carlo Tree Search (MCTS) is added that is able to find even better solutions. Experiments show that it is able to learn on small problem instances and generalize to larger real-world instances, with very competitive performance compared to traditional approaches.

[325] Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, Wenbo Guo

Main category: cs.AI

TL;DR: Co-PatcheR is a collaborative software patching system that uses specialized small models for different patching tasks, achieving 46% resolved rate on SWE-bench-Verified with only 3×14B models.

Details

Motivation: Current end-to-end patching models struggle as different sub-tasks require different expertise, with SOTA methods only reaching 41% resolved rate using a 70B model. The collaborative nature of software development inspires a specialized approach.

Method: Uses three specialized 14B models: (1) localization model with two-step suspicious line pinpointing, (2) generation model combining patch generation and critique, and (3) hybrid validation with two models for test case creation and correctness judgment, followed by majority vote-based patch selection.

Result: Achieves 46% resolved rate on SWE-bench-Verified, outperforming SOTA methods while using smaller models (3×14B vs 70B) and requiring less training resources.

Conclusion: Collaborative specialized models are more effective than monolithic end-to-end models for software patching, demonstrating better performance with smaller models and reduced resource requirements.

Abstract: Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.

[326] Safe and Economical UAV Trajectory Planning in Low-Altitude Airspace: A Hybrid DRL-LLM Approach with Compliance Awareness

Yanwei Gong, Junchao Fan, Ruichen Zhang, Dusit Niyato, Yingying Yao, Xiaolin Chang

Main category: cs.AI

TL;DR: Proposes a UAV trajectory planning framework combining DRL with LLM reasoning to address safety, compliance, and economic efficiency in low-altitude urban environments.

Details

Motivation: The rapid growth of low-altitude economy and UAV adoption creates challenges in complex urban environments, with existing studies overlooking key factors like airspace constraints and economic efficiency.

Method: Combines deep reinforcement learning (DRL) with large language model (LLM) reasoning for UAV trajectory planning.

Result: Significantly outperforms existing baselines across multiple metrics: data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency.

Conclusion: Validates the effectiveness of the DRL+LLM approach in addressing UAV trajectory planning challenges under low-altitude economy networking constraints.

Abstract: The rapid growth of the low-altitude economy has driven the widespread adoption of unmanned aerial vehicles (UAVs). This growing deployment presents new challenges for UAV trajectory planning in complex urban environments. However, existing studies often overlook key factors, such as urban airspace constraints and economic efficiency, which are essential in low-altitude economy contexts. Deep reinforcement learning (DRL) is regarded as a promising solution to these issues, while its practical adoption remains limited by low learning efficiency. To overcome this limitation, we propose a novel UAV trajectory planning framework that combines DRL with large language model (LLM) reasoning to enable safe, compliant, and economically viable path planning. Experimental results demonstrate that our method significantly outperforms existing baselines across multiple metrics, including data collection rate, collision avoidance, successful landing, regulatory compliance, and energy efficiency. These results validate the effectiveness of our approach in addressing UAV trajectory planning key challenges under constraints of the low-altitude economy networking.

[327] CoMind: Towards Community-Driven Agents for Machine Learning Engineering

Sijie Li, Weiwei Sun, Shanda Li, Ameet Talwalkar, Yiming Yang

Main category: cs.AI

TL;DR: CoMind is a multi-agent system that integrates external knowledge from simulated research communities to automate ML engineering, achieving top performance in both past and live Kaggle competitions.

Details

Motivation: To bridge the gap where current LLM agents operate in isolation without engaging with broader research communities, unlike human researchers who gain insights through knowledge sharing.

Method: MLE-Live evaluation framework assesses agent communication with simulated Kaggle communities; CoMind uses iterative parallel exploration to develop multiple solutions simultaneously, balancing breadth and depth.

Result: CoMind achieved 36% medal rate on 75 past Kaggle competitions and outperformed 92.6% of human competitors in live competitions, placing top 5% on three leaderboards and top 1% on one.

Conclusion: The framework enables effective knowledge integration from research communities, and CoMind demonstrates state-of-the-art performance in automated ML engineering through collaborative multi-agent approaches.

Abstract: Large language model (LLM) agents show promise in automating machine learning (ML) engineering. However, existing agents typically operate in isolation on a given research problem, without engaging with the broader research community, where human researchers often gain insights and contribute by sharing knowledge. To bridge this gap, we introduce MLE-Live, a live evaluation framework designed to assess an agent’s ability to communicate with and leverage collective knowledge from a simulated Kaggle research community. Building on this framework, we propose CoMind, an multi-agent system designed to actively integrate external knowledge. CoMind employs an iterative parallel exploration mechanism, developing multiple solutions simultaneously to balance exploratory breadth with implementation depth. On 75 past Kaggle competitions within our MLE-Live framework, CoMind achieves a 36% medal rate, establishing a new state of the art. Critically, when deployed in eight live, ongoing competitions, CoMind outperforms 92.6% of human competitors on average, placing in the top 5% on three official leaderboards and the top 1% on one.

[328] Augur: Modeling Covariate Causal Associations in Time Series via Large Language Models

Zhiqing Cui, Binwu Wang, Qingxiang Liu, Yeqiang Wang, Zhengyang Zhou, Yuxuan Liang, Yang Wang

Main category: cs.AI

TL;DR: Augur is an LLM-driven time series forecasting framework that uses causal reasoning to discover directed causal associations among covariates through a two-stage teacher-student architecture.

Details

Motivation: Existing LLM-based time series forecasting approaches have limitations including marginalized roles in model architectures, reliance on coarse statistical text prompts, and lack of interpretability.

Method: Two-stage teacher-student architecture: teacher LLM infers directed causal graph using heuristic search and pairwise causality testing; student agent refines graph and fine-tunes on high-confidence causal associations encoded as rich textual prompts.

Result: Extensive experiments on real-world datasets with 26 baselines show Augur achieves competitive performance and robust zero-shot generalization.

Conclusion: Augur improves predictive accuracy while providing transparent, traceable reasoning about variable interactions through LLM-driven causal discovery.

Abstract: Large language models (LLM) have emerged as a promising avenue for time series forecasting, offering the potential to integrate multimodal data. However, existing LLM-based approaches face notable limitations-such as marginalized role in model architectures, reliance on coarse statistical text prompts, and lack of interpretability. In this work, we introduce Augur, a fully LLM driven time series forecasting framework that exploits LLM causal reasoning to discover and use directed causal associations among covariates. Augur uses a two stage teacher student architecture where a powerful teacher LLM infers a directed causal graph from time series using heuristic search together with pairwise causality testing. A lightweight student agent then refines the graph and fine tune on high confidence causal associations that are encoded as rich textual prompts to perform forecasting. This design improves predictive accuracy while yielding transparent, traceable reasoning about variable interactions. Extensive experiments on real-world datasets with 26 baselines demonstrate that Augur achieves competitive performance and robust zero-shot generalization.

[329] Operationalizing Pluralistic Values in Large Language Model Alignment Reveals Trade-offs in Safety, Inclusivity, and Model Behavior

Dalia Ali, Dora Zhao, Allison Koenecke, Orestis Papakyriakopoulos

Main category: cs.AI

TL;DR: LLM alignment overlooks human social diversity. This study shows incorporating pluralistic values from diverse demographics and optimizing technical parameters (rating scales, disagreement handling, optimization methods) significantly affects model behavior and safety outcomes.

Details

Motivation: Current LLM alignment processes often ignore human social diversity, potentially leading to models that don't fairly represent diverse values and perspectives across different demographic groups.

Method: Collected alignment data from 1,095 US and German participants (27,375 ratings) across five dimensions. Fine-tuned LLMs using group-specific preferences while varying rating scales, disagreement handling methods, and optimization techniques.

Result: Systematic demographic effects: male participants rated responses 18% less toxic than females; conservative and Black participants rated emotional awareness 27.9% and 44% higher than liberal and White participants. Technical choices mattered: preserving disagreement achieved 53% greater toxicity reduction than majority voting; 5-point scales yielded 22% more reduction than binary formats; DPO outperformed GRPO.

Conclusion: Alignment should balance expert-driven and user-driven signals to ensure both safety and fair representation, with technical design choices significantly impacting outcomes.

Abstract: Although large language models (LLMs) are increasingly trained using human feedback for safety and alignment with human values, alignment decisions often overlook human social diversity. This study examines how incorporating pluralistic values affects LLM behavior by systematically evaluating demographic variation and design parameters in the alignment pipeline. We collect alignment data from US and German participants (N = 1,095 participants, 27,375 ratings) who rated LLM responses across five dimensions: Toxicity, Emotional Awareness (EA), Sensitivity, Stereotypical Bias, and Helpfulness. We fine-tuned multiple Large Language Models and Large Reasoning Models using preferences from different social groups while varying rating scales, disagreement handling methods, and optimization techniques. The results revealed systematic demographic effects: male participants rated responses 18% less toxic than female participants; conservative and Black participants rated responses 27.9% and 44% higher on EA than liberal and White participants, respectively. Models fine-tuned on group-specific preferences exhibited distinct behaviors. Technical design choices showed strong effects: the preservation of rater disagreement achieved roughly 53% greater toxicity reduction than majority voting, and 5-point scales yielded about 22% more reduction than binary formats; and Direct Preference Optimization (DPO) consistently outperformed Group Relative Policy Optimization (GRPO) in multi-value optimization. These findings represent a preliminary step in answering a critical question: How should alignment balance expert-driven and user-driven signals to ensure both safety and fair representation?

[330] Heterogeneous Multi-Agent Proximal Policy Optimization for Power Distribution System Restoration

Parya Dolatyabi, Ali Farajzadeh Bavil, Mahdi Khodayar

Main category: cs.AI

TL;DR: Heterogeneous-Agent Reinforcement Learning (HARL) with HAPPO enables coordinated power distribution system restoration across interconnected microgrids, outperforming traditional methods in convergence speed and restored power.

Details

Motivation: Conventional optimization and value-based RL approaches are computationally inefficient and difficult to scale for power distribution system restoration due to sequential switching operations, nonlinear constraints, and coordination of distributed energy resources.

Method: Uses Heterogeneous-Agent Proximal Policy Optimization (HAPPO) with decentralized actor policies trained with a centralized critic. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts. Physics-informed OpenDSS environment provides power flow feedback with differentiable penalty signals.

Result: HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX on IEEE 123-bus and IEEE 8500-node systems.

Conclusion: Incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex power distribution system restoration.

Abstract: Restoring power distribution systems (PDS) after large-scale outages requires sequential switching operations that reconfigure feeder topology and coordinate distributed energy resources (DERs) under nonlinear constraints such as power balance, voltage limits, and thermal ratings. These challenges make conventional optimization and value-based RL approaches computationally inefficient and difficult to scale. This paper applies a Heterogeneous-Agent Reinforcement Learning (HARL) framework, instantiated through Heterogeneous-Agent Proximal Policy Optimization (HAPPO), to enable coordinated restoration across interconnected microgrids. Each agent controls a distinct microgrid with different loads, DER capacities, and switch counts, introducing practical structural heterogeneity. Decentralized actor policies are trained with a centralized critic to compute advantage values for stable on-policy updates. A physics-informed OpenDSS environment provides full power flow feedback and enforces operational limits via differentiable penalty signals rather than invalid action masking. The total DER generation is capped at 2400 kW, and each microgrid must satisfy local supply-demand feasibility. Experiments on the IEEE 123-bus and IEEE 8500-node systems show that HAPPO achieves faster convergence, higher restored power, and smoother multi-seed training than DQN, PPO, MAES, MAGDPG, MADQN, Mean-Field RL, and QMIX. Results demonstrate that incorporating microgrid-level heterogeneity within the HARL framework yields a scalable, stable, and constraint-aware solution for complex PDS restoration.

[331] KRAL: Knowledge and Reasoning Augmented Learning for LLM-assisted Clinical Antimicrobial Therapy

Zhe Li, Yehan Qiu, Yujie Chen, Xiang Zhou

Main category: cs.AI

TL;DR: KRAL is a novel paradigm that enhances clinical LLMs by automatically distilling knowledge and reasoning from teacher models, using semi-supervised data augmentation and agentic reinforcement learning to improve medical knowledge and reasoning capabilities while reducing costs and preserving privacy.

Details

Motivation: Current LLMs face limitations in clinical decision-making due to knowledge gaps, privacy concerns, high deployment costs, and limited reasoning capabilities, making them unsuitable for high-stakes medical applications.

Method: KRAL uses teacher-model reasoning via answer-to-question reverse generation, heuristic learning for semi-supervised data augmentation (reducing manual annotation by ~80%), and agentic reinforcement learning to jointly enhance knowledge and reasoning while optimizing computational efficiency.

Result: KRAL significantly outperforms RAG and SFT methods: improves knowledge QA (Accuracy@1 on MEDQA by 1.8% vs. SFT, 3.6% vs. RAG) and reasoning (Pass@1 on PUMCH Antimicrobial by 27% vs. SFT, 27.2% vs. RAG) at ~20% of SFT’s training costs.

Conclusion: KRAL establishes an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support systems.

Abstract: Clinical antimicrobial therapy requires the dynamic integration of pathogen profiles,host factors, pharmacological properties of antimicrobials,and the severity of infection. This complexity imposes fundamental limitations on the applicability of Large Language Models (LLMs) in high-stakes clinical decision-making including knowledge gaps, data privacy concerns, high deployment costs, and limited reasoning capabilities. To address these challenges, we propose KRAL (Knowledge and Reasoning Augmented Learning), a low-cost, scalable, privacy-preserving paradigm that leverages teacher-model reasoning to automatically distill knowledge and reasoning trajectories via answer-to-question reverse generation, employs heuristic learning for semi-supervised data augmentation (reducing manual annotation requirements by approximately 80%), and utilizes agentic reinforcement learning to jointly enhance medical knowledge and reasoning while optimizing computational and memory efficiency. A hierarchical evaluation employing diverse teacher-model proxies reduces assessment costs, while modular interface design facilitates seamless system updates. Experimental results demonstrate that KRAL significantly outperforms traditional Retrieval-Augmented Generation (RAG) and Supervised Fine-Tuning (SFT) methods. It improves knowledge question-answering capability (Accuracy@1 on the external open-source benchmark MEDQA increased by 1.8% vs. SFT and 3.6% vs. RAG) and reasoning capability (Pass@1 on the external benchmark PUMCH Antimicrobial increased by 27% vs. SFT and 27.2% vs. RAG), achieved at about 20% of SFT’s long-term training costs. This establishes KRAL as an effective solution for enhancing local LLMs’ clinical diagnostic capabilities, enabling low-cost, high-safety deployment in complex medical decision support.

[332] Simulated Self-Assessment in Large Language Models: A Psychometric Approach to AI Self-Efficacy

Daniel I Jackson, Emma L Jensen, Syed-Amad Hussain, Emre Sezgin

Main category: cs.AI

TL;DR: LLMs show stable but inaccurate self-assessments using adapted psychological scales, with self-efficacy scores not reliably reflecting actual task performance across computational, social, and summarization tasks.

Details

Motivation: To evaluate LLMs' self-assessment capabilities using established psychological measures, moving beyond traditional accuracy-focused evaluations to understand their simulated self-efficacy.

Method: Adapted the 10-item General Self-Efficacy Scale (GSES) to test ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization, with follow-up confidence prompts.

Result: Models showed stable GSES responses but significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. Self-assessment did not reliably reflect actual ability - high self-efficacy models sometimes performed poorly, and vice versa.

Conclusion: Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates, revealing that higher self-efficacy corresponds to more assertive, anthropomorphic reasoning styles.

Abstract: Self-assessment is a key aspect of reliable intelligence, yet evaluations of large language models (LLMs) focus mainly on task accuracy. We adapted the 10-item General Self-Efficacy Scale (GSES) to elicit simulated self-assessments from ten LLMs across four conditions: no task, computational reasoning, social reasoning, and summarization. GSES responses were highly stable across repeated administrations and randomized item orders. However, models showed significantly different self-efficacy levels across conditions, with aggregate scores lower than human norms. All models achieved perfect accuracy on computational and social questions, whereas summarization performance varied widely. Self-assessment did not reliably reflect ability: several low-scoring models performed accurately, while some high-scoring models produced weaker summaries. Follow-up confidence prompts yielded modest, mostly downward revisions, suggesting mild overestimation in first-pass assessments. Qualitative analysis showed that higher self-efficacy corresponded to more assertive, anthropomorphic reasoning styles, whereas lower scores reflected cautious, de-anthropomorphized explanations. Psychometric prompting provides structured insight into LLM communication behavior but not calibrated performance estimates.

[333] Failure Modes in LLM Systems: A System-Level Taxonomy for Reliable AI Applications

Vaishali Vinay

Main category: cs.AI

TL;DR: A taxonomy of 15 hidden failure modes in real-world LLM applications, highlighting the gap between current evaluation practices and production reliability needs, with design principles for robust LLM systems.

Details

Motivation: LLMs are being rapidly deployed in production systems but their failure patterns differ fundamentally from traditional ML models, and current evaluation methods don't address system-level reliability issues.

Method: Developed a system-level taxonomy of 15 hidden failure modes through analysis of real-world LLM applications, examined evaluation gaps, and outlined design principles for reliable LLM systems.

Result: Identified critical failure modes including reasoning drift, latent inconsistency, context degradation, tool invocation errors, version drift, and cost-driven performance collapse that aren’t captured by current benchmarks.

Conclusion: LLM reliability should be treated as a system-engineering problem rather than purely model-centric, requiring new evaluation methodologies and design principles for robust, maintainable, and cost-aware LLM deployment.

Abstract: Large language models (LLMs) are being rapidly integrated into decision-support tools, automation workflows, and AI-enabled software systems. However, their behavior in production environments remains poorly understood, and their failure patterns differ fundamentally from those of traditional machine learning models. This paper presents a system-level taxonomy of fifteen hidden failure modes that arise in real-world LLM applications, including multi-step reasoning drift, latent inconsistency, context-boundary degradation, incorrect tool invocation, version drift, and cost-driven performance collapse. Using this taxonomy, we analyze the growing gap in evaluation and monitoring practices: existing benchmarks measure knowledge or reasoning but provide little insight into stability, reproducibility, drift, or workflow integration. We further examine the production challenges associated with deploying LLMs - including observability limitations, cost constraints, and update-induced regressions - and outline high-level design principles for building reliable, maintainable, and cost-aware LLM systems. Finally, we outline high-level design principles for building reliable, maintainable, and cost-aware LLM-based systems. By framing LLM reliability as a system-engineering problem rather than a purely model-centric one, this work provides an analytical foundation for future research on evaluation methodology, AI system robustness, and dependable LLM deployment.

[334] Universe of Thoughts: Enabling Creative Reasoning with Large Language Models

Yuto Suzuki, Farnoush Banaei-Kashani

Main category: cs.AI

TL;DR: The paper introduces a computational framework for creative reasoning with LLMs, proposing three paradigms (combinational, exploratory, transformative) and implementing them through the Universe of Thoughts (UoT) method, showing superior performance in creative problem-solving tasks.

Details

Motivation: Existing LLM reasoning methods focus on conventional problem-solving but lack creative reasoning capabilities needed for domains with expansive solution spaces like drug discovery and business strategy, where innovative solutions are crucial.

Method: Proposes a computational framework with three creative reasoning paradigms and implements them through Universe of Thoughts (UoT) - a novel method that systematically explores the universe of thoughts to generate creative solutions using LLMs.

Result: UoT demonstrates superior performance in creative reasoning compared to state-of-the-art reasoning techniques and commercial models, evaluated through three novel tasks assessing creativity from feasibility, utility, and novelty perspectives.

Conclusion: The Universe of Thoughts framework effectively addresses the gap in creative reasoning for LLMs, enabling systematic exploration of solution spaces to generate innovative solutions in domains requiring creative problem-solving.

Abstract: Reasoning based on Large Language Models (LLMs) has garnered increasing attention due to outstanding performance of these models in mathematical and complex logical tasks. Beginning with the Chain-of-Thought (CoT) prompting technique, numerous reasoning methods have emerged that decompose problems into smaller, sequential steps (or thoughts). However, existing reasoning models focus on conventional problem-solving and do not necessarily generate creative solutions by ``creative reasoning’’. In domains where the solution space is expansive and conventional solutions are suboptimal, such as drug discovery or business strategization, creative reasoning to discover innovative solutions is crucial. To address this gap, first we introduce a computational framework for creative reasoning inspired by established cognitive science principles. With this framework, we propose three core creative reasoning paradigms, namely, \textit{combinational}, \textit{exploratory}, and \textit{transformative} reasoning, where each offers specific directions for systematic exploration of the universe of thoughts to generate creative solutions. Next, to materialize this framework using LLMs, we introduce the \textit{Universe of Thoughts} (or \textit{UoT}, for short), a novel set of methods to implement the aforementioned three creative processes. Finally, we introduce three novel tasks that necessitate creative problem-solving, along with an evaluation benchmark to assess creativity from three orthogonal perspectives: feasibility as constraint, and utility and novelty as metrics. With a comparative analysis against the state-of-the-art (SOTA) reasoning techniques as well as representative commercial models with reasoning capability, we show that UoT demonstrates superior performance in creative reasoning.

[335] FRAGMENTA: End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization

Yuto Suzuki, Paul Awolade, Daniel V. LaBarbera, Farnoush Banaei-Kashani

Main category: cs.AI

TL;DR: FRAGMENTA is an end-to-end framework for drug lead optimization that combines a novel generative model using dynamic Q-learning for fragmentation and generation with an agentic AI system for conversational feedback from domain experts, enabling autonomous tuning and outperforming traditional approaches in cancer drug discovery.

Details

Motivation: Molecule generation for drug discovery faces challenges with small class-specific datasets (often <100 examples), limited diversity from heuristic fragmentation methods, and slow collaboration between medicinal chemists and AI engineers during model tuning.

Method: 1) A generative model that reframes fragmentation as vocabulary selection using dynamic Q-learning to jointly optimize fragmentation and generation; 2) An agentic AI system that refines objectives via conversational feedback from domain experts, enabling autonomous tuning.

Result: In real-world cancer drug discovery experiments, FRAGMENTA’s Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. The fully autonomous Agent-Agent system outperformed traditional Human-Human tuning.

Conclusion: FRAGMENTA demonstrates the efficacy of agentic tuning in capturing expert intent and enables automated drug lead optimization, removing the need for AI engineers in the tuning loop while progressively learning domain knowledge.

Abstract: Molecule generation using generative AI is vital for drug discovery, yet class-specific datasets often contain fewer than 100 training examples. While fragment-based models handle limited data better than atom-based approaches, existing heuristic fragmentation limits diversity and misses key fragments. Additionally, model tuning typically requires slow, indirect collaboration between medicinal chemists and AI engineers. We introduce FRAGMENTA, an end-to-end framework for drug lead optimization comprising: 1) a novel generative model that reframes fragmentation as a “vocabulary selection” problem, using dynamic Q-learning to jointly optimize fragmentation and generation; and 2) an agentic AI system that refines objectives via conversational feedback from domain experts. This system removes the AI engineer from the loop and progressively learns domain knowledge to eventually automate tuning. In real-world cancer drug discovery experiments, FRAGMENTA’s Human-Agent configuration identified nearly twice as many high-scoring molecules as baselines. Furthermore, the fully autonomous Agent-Agent system outperformed traditional Human-Human tuning, demonstrating the efficacy of agentic tuning in capturing expert intent.

[336] PaTAS: A Parallel System for Trust Propagation in Neural Networks Using Subjective Logic

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Frank Kargl

Main category: cs.AI

TL;DR: PaTAS is a framework for modeling trust in neural networks using Subjective Logic, providing interpretable trust estimates that complement accuracy and expose reliability gaps in adversarial or uncertain scenarios.

Details

Motivation: Conventional metrics like accuracy fail to capture uncertainty and reliability of model predictions, especially under adversarial or degraded conditions, creating a need for trustworthy AI systems in safety-critical applications.

Method: Uses Subjective Logic with Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across networks in parallel with standard computation. Includes Parameter Trust Update mechanism and Inference-Path Trust Assessment method.

Result: Produces interpretable, symmetric, and convergent trust estimates that effectively distinguish between benign and adversarial inputs, and identify cases where model confidence diverges from actual reliability.

Conclusion: PaTAS provides a principled foundation for transparent and quantifiable trust reasoning within neural architectures, enabling reliable model evaluation across the AI lifecycle.

Abstract: Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics such as accuracy and precision fail to capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a principled foundation for evaluating model reliability across the AI lifecycle.

cs.SD

[337] Seeing Beyond Sound: Visualization and Abstraction in Audio Data Representation

Ashlae Blum’e

Main category: cs.SD

TL;DR: Adding dimensionality and interactivity to visualization tools improves audio signal processing workflows by aligning with human perception and modern needs.

Details

Motivation: Traditional software tools with hidden historical assumptions risk misalignment with modern workflows, while tools that match emergent needs enhance analytical and creative outputs.

Method: Explores adding dimensionality and interactivity to visualization tools using the Jellyfish Dynamite software for audio information research.

Result: Enhanced pattern recognition through visual representations that align with human perceptual systems.

Conclusion: Creating tools that align with emergent needs improves analytical and creative outputs due to increased user affinity.

Abstract: In audio signal processing, the interpretation of complex information using visual representation enhances pattern recognition through its alignment with human perceptual systems. Software tools that carry hidden assumptions inherited from their historical contexts risk misalignment with modern workflows as design origins become obscured. We argue that creating tools that align with emergent needs improves analytical and creative outputs due to an increased affinity for using them. This paper explores the potentials associated with adding dimensionality and interactivity into visualization tools to facilitate complex workflows in audio information research using the Jellyfish Dynamite software.

[338] Musical Score Understanding Benchmark: Evaluating Large Language Models’ Comprehension of Complete Musical Scores

Congren Dai, Yue Yang, Krinos Li, Huichi Zhou, Shijie Liang, Zhang Bo, Enyang Liu, Ge Jin, Hongran An, Haosen Zhang, Peiyuan Jing, KinHei Lee, Zhenxuan Zhang, Xiaobing Li, Maosong Sun

Main category: cs.SD

TL;DR: MSU-Bench is the first large-scale benchmark for evaluating musical score understanding across text (ABC notation) and visual (PDF) modalities, featuring 1,800 QA pairs from classical composers with four progressive comprehension levels.

Details

Motivation: Despite LLMs and VLMs' progress in natural language and multimodal tasks, their ability to comprehend musical notation remains underexplored, creating a need for standardized evaluation.

Method: Created MSU-Bench with 1,800 human-curated QA pairs from classical composers’ works, organized into four comprehension levels: Onset Information, Notation & Note, Chord & Harmony, and Texture & Form. Evaluated 15+ SOTA models in zero-shot and fine-tuned settings.

Result: Revealed sharp modality gaps, fragile level-wise success rates, and difficulty sustaining multilevel correctness. Fine-tuning significantly improved performance in both modalities while preserving general knowledge.

Conclusion: MSU-Bench establishes a rigorous foundation for future research at the intersection of AI, musicology, and multimodal reasoning, addressing the gap in musical notation understanding.

Abstract: Understanding complete musical scores requires reasoning over symbolic structures such as pitch, rhythm, harmony, and form. Despite the rapid progress of Large Language Models (LLMs) and Vision-Language Models (VLMs) in natural language and multimodal tasks, their ability to comprehend musical notation remains underexplored. We introduce Musical Score Understanding Benchmark (MSU-Bench), the first large-scale, human-curated benchmark for evaluating score-level musical understanding across both textual (ABC notation) and visual (PDF) modalities. MSU-Bench comprises 1,800 generative question-answer (QA) pairs drawn from works spanning Bach, Beethoven, Chopin, Debussy, and others, organised into four progressive levels of comprehension: Onset Information, Notation & Note, Chord & Harmony, and Texture & Form. Through extensive zero-shot and fine-tuned evaluations of over 15+ state-of-the-art (SOTA) models, we reveal sharp modality gaps, fragile level-wise success rates, and the difficulty of sustaining multilevel correctness. Fine-tuning markedly improves performance in both modalities while preserving general knowledge, establishing MSU-Bench as a rigorous foundation for future research at the intersection of Artificial Intelligence (AI), musicological, and multimodal reasoning.

[339] SingingSDS: A Singing-Capable Spoken Dialogue System for Conversational Roleplay Applications

Jionghao Han, Jiatong Shi, Masao Someki, Yuxun Tang, Lan Liu, Yiwen Zhao, Wenhao Feng, Shinji Watanabe

Main category: cs.SD

TL;DR: SingingSDS is a spoken dialogue system that responds through singing instead of speaking, creating more affective and memorable interactions for character-based roleplay and entertainment.

Details

Motivation: Most existing spoken dialogue systems are limited to conventional spoken responses, lacking the emotional and memorable qualities that singing can provide for interactive entertainment scenarios.

Method: Uses a modular ASR-LLM-SVS pipeline with configurable components including character personas, ASR/LLM backends, singing voice synthesis models, melody sources, and voice profiles to balance latency, quality, and musical style.

Result: Developed a plug-and-play web demo with modular open-source code that supports customization and extension, available as a public demo and GitHub repository.

Conclusion: SingingSDS successfully demonstrates the feasibility of singing-based dialogue systems that can enhance affective interaction in character-based roleplay and entertainment applications through its flexible and customizable architecture.

Abstract: With recent advances in automatic speech recognition (ASR), large language models (LLMs), and text-to-speech (TTS) technologies, spoken dialogue systems (SDS) have become widely accessible. However, most existing SDS are limited to conventional spoken responses. We present SingingSDS, a cascaded SDS that responds through singing rather than speaking, fostering more affective, memorable, and pleasurable interactions in character-based roleplay and interactive entertainment scenarios. SingingSDS employs a modular ASR-LLM-SVS pipeline and supports a wide range of configurations across character personas, ASR and LLM backends, SVS models, melody sources, and voice profiles, tailored to different needs in terms of latency, quality, and musical style. SingingSDS is available as a plug-and-play web demo, featuring modular, open-source code that supports customization and extension. Demo: https://huggingface.co/spaces/espnet/SingingSDS. Code: https://github.com/SingingSDS/SingingSDS.

[340] CartoonSing: Unifying Human and Nonhuman Timbres in Singing Generation

Jionghao Han, Jiatong Shi, Zhuoyan Tao, Yuxun Tang, Yiwen Zhao, Gus Xia, Shinji Watanabe

Main category: cs.SD

TL;DR: CartoonSing is a unified framework for non-human singing generation that bridges human and non-human singing synthesis/conversion using a two-stage pipeline with score representation encoding and timbre-aware vocoding.

Details

Motivation: Existing singing voice systems are limited to human timbres and cannot generate voices outside human range, which are increasingly needed in creative applications like video games, movies, and virtual characters.

Method: Two-stage pipeline: (1) score representation encoder trained with annotated human singing, (2) timbre-aware vocoder that reconstructs waveforms for both human and non-human audio, integrating SVS and SVC capabilities.

Result: CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS/SVC toward creative non-human singing generation.

Conclusion: The proposed framework effectively addresses the challenges of NHSG including data scarcity, lack of symbolic alignment, and timbral gap, enabling musically coherent non-human singing generation for creative applications.

Abstract: Singing voice synthesis (SVS) and singing voice conversion (SVC) have achieved remarkable progress in generating natural-sounding human singing. However, existing systems are restricted to human timbres and have limited ability to synthesize voices outside the human range, which are increasingly demanded in creative applications such as video games, movies, and virtual characters. We introduce Non-Human Singing Generation (NHSG), covering non-human singing voice synthesis (NHSVS) and non-human singing voice conversion (NHSVC), as a novel machine learning task for generating musically coherent singing with non-human timbral characteristics. NHSG is particularly challenging due to the scarcity of non-human singing data, the lack of symbolic alignment, and the wide timbral gap between human and non-human voices. To address these challenges, we propose CartoonSing, a unified framework that integrates singing voice synthesis and conversion while bridging human and non-human singing generation. CartoonSing employs a two-stage pipeline: a score representation encoder trained with annotated human singing and a timbre-aware vocoder that reconstructs waveforms for both human and non-human audio. Experiments demonstrate that CartoonSing successfully generates non-human singing voices, generalizes to novel timbres, and extends conventional SVS and SVC toward creative, non-human singing generation.

[341] Acoustic neural networks: Identifying design principles and exploring physical feasibility

Ivan Kalthoff, Marcel Rey, Raphael Wittkowski

Main category: cs.SD

TL;DR: A framework for designing acoustic neural networks that use sound waves for computation, enabling low-power analog computing through physically realizable acoustic components.

Details

Motivation: To develop energy-efficient analog computing systems using acoustic waveguides, particularly for environments where electronics are inefficient or limited, and to provide a systematic design approach for acoustic neural networks.

Method: Using a digital-twin approach to train neural networks under physical constraints (non-negative signals/weights, no bias terms, intensity-based nonlinearities), connecting learnable components to measurable acoustic properties, and proposing the SincHSRNN model combining acoustic bandpass filters with hierarchical temporal processing.

Result: Achieved up to 95% accuracy on AudioMNIST dataset for speech classification, with learned parameters corresponding to measurable material properties like attenuation and transmission, while maintaining compatibility with passive acoustic components.

Conclusion: Established general design principles for physically realizable acoustic neural networks, providing a pathway toward low-power, wave-based neural computing systems.

Abstract: Wave-guide-based physical systems provide a promising route toward energy-efficient analog computing beyond traditional electronics. Within this landscape, acoustic neural networks represent a promising approach for achieving low-power computation in environments where electronics are inefficient or limited, yet their systematic design has remained largely unexplored. Here we introduce a framework for designing and simulating acoustic neural networks, which perform computation through the propagation of sound waves. Using a digital-twin approach, we train conventional neural network architectures under physically motivated constraints including non-negative signals and weights, the absence of bias terms, and nonlinearities compatible with intensity-based, non-negative acoustic signals. Our work provides a general framework for acoustic neural networks that connects learnable network components directly to physically measurable acoustic properties, enabling the systematic design of realizable acoustic computing systems. We demonstrate that constrained recurrent and hierarchical architectures can perform accurate speech classification, and we propose the SincHSRNN, a hybrid model that combines learnable acoustic bandpass filters with hierarchical temporal processing. The SincHSRNN achieves up to 95% accuracy on the AudioMNIST dataset while remaining compatible with passive acoustic components. Beyond computational performance, the learned parameters correspond to measurable material and geometric properties such as attenuation and transmission. Our results establish general design principles for physically realizable acoustic neural networks and outline a pathway toward low-power, wave-based neural computing.

[342] Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale

Yicheng Zhong, Peiji Yang, Zhisheng Wang

Main category: cs.SD

TL;DR: Proposes multi-reward GRPO framework to optimize single-codebook TTS LLMs, addressing prosody instability and speaker drift through reinforcement learning with intelligibility, speaker similarity, and three rule-based rewards including LLM-annotated prosody alignment.

Details

Motivation: Single-codebook TTS LLMs are efficient but suffer from unstable prosody, speaker drift, and degraded naturalness despite their compact and streamable architecture.

Method: Multi-reward Group Relative Policy Optimization (GRPO) framework with intelligibility, speaker similarity, length penalty, entropy regularization, and LLM-annotated prosody alignment rewards. Uses external reasoning LLM for pause structure prediction via in-context learning.

Result: Consistently enhances prosodic stability, speaker similarity, and overall speech naturalness across different data sizes and model scales. Additional gains observed when attaching flow-matching decoder to GRPO-optimized backbone.

Conclusion: The proposed GRPO framework effectively optimizes single-codebook TTS LLMs, improving their performance and demonstrating scalability across various configurations.

Abstract: Recent advances in Large Language Models (LLMs) have transformed text-to-speech (TTS) synthesis, inspiring autoregressive frameworks that represent speech as sequences of discrete codec tokens. Among them, single-codebook TTS LLMs have emerged as compact and streamable architectures that jointly model semantic and acoustic integration. However, despite their efficiency, these models often exhibit unstable prosody, speaker drift, and degraded naturalness. To address these issues, we propose a multi-reward Group Relative Policy Optimization (GRPO) framework that directly optimizes the token generation policy of single-codebook TTS LLMs. Beyond standard intelligibility and speaker similarity objectives, our design integrates three rule-based rewards: a length penalty for duration consistency, an entropy regularization reward for decoding stability, and an LLM-annotated prosody alignment reward that explicitly supervises rhythm. In this prosody reward, an external reasoning LLM predicts multiple plausible pause structures via in-context learning, providing a human-preference-aligned supervisory signal for GRPO training. To assess universality, we further attach a flow-matching (FM) decoder on top of the GRPO-optimized AR backbone and observe consistent additional gains, indicating that our reinforcement optimization enhances the intrinsic AR policy. We further conduct a scalability analysis across data sizes and model scales, revealing that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.

[343] SONAR: Spectral-Contrastive Audio Residuals for Generalizable Deepfake Detection

Ido Nitzan HIdekel, Gal lifshitz, Khen Cohen, Dan Raviv

Main category: cs.SD

TL;DR: SONAR is a frequency-guided framework that addresses spectral bias in deepfake audio detection by explicitly disentangling audio signals into low-frequency and high-frequency residual representations, using contrastive learning to separate real and fake audio in the latent space.

Details

Motivation: Deepfake audio detectors struggle with generalization due to spectral bias, where neural networks prioritize low-frequency learning, causing them to miss high-frequency artifacts left by DF generators that could be crucial for detection.

Method: Uses an XLSR encoder for low-frequency content and a cloned path with learnable SRM and high-pass filters for high-frequency residuals. Employs frequency cross-attention and frequency-aware Jensen-Shannon contrastive loss to reunite representations and sharpen decision boundaries.

Result: Achieves state-of-the-art performance on ASVspoof 2021 and in-the-wild benchmarks, converges four times faster than strong baselines, and creates distinct manifolds for genuine and synthetic audio in the latent space.

Conclusion: SONAR provides an architecture-agnostic, frequency-guided contrastive framework that elevates high-frequency residuals as key learning signals, enabling more effective deepfake audio detection by exploiting subtle high-frequency cues.

Abstract: Deepfake (DF) audio detectors still struggle to generalize to out of distribution inputs. A central reason is spectral bias, the tendency of neural networks to learn low-frequency structure before high-frequency (HF) details, which both causes DF generators to leave HF artifacts and leaves those same artifacts under-exploited by common detectors. To address this gap, we propose Spectral-cONtrastive Audio Residuals (SONAR), a frequency-guided framework that explicitly disentangles an audio signal into complementary representations. An XLSR encoder captures the dominant low-frequency content, while the same cloned path, preceded by learnable SRM, value-constrained high-pass filters, distills faint HF residuals. Frequency cross-attention reunites the two views for long- and short-range frequency dependencies, and a frequency-aware Jensen-Shannon contrastive loss pulls real content-noise pairs together while pushing fake embeddings apart, accelerating optimization and sharpening decision boundaries. Evaluated on the ASVspoof 2021 and in-the-wild benchmarks, SONAR attains state-of-the-art performance and converges four times faster than strong baselines. By elevating faint high-frequency residuals to first-class learning signals, SONAR unveils a fully data-driven, frequency-guided contrastive framework that splits the latent space into two disjoint manifolds: natural-HF for genuine audio and distorted-HF for synthetic audio, thereby sharpening decision boundaries. Because the scheme operates purely at the representation level, it is architecture-agnostic and, in future work, can be seamlessly integrated into any model or modality where subtle high-frequency cues are decisive.

[344] Spike Encoding for Environmental Sound: A Comparative Benchmark

Andres Larroza, Javier Naranjo-Alcazar, Vicent Ortiz, Maximo Cobos, Pedro Zuccarello

Main category: cs.SD

TL;DR: Analysis of three spike encoding methods (TAE, SF, MW) for environmental sound processing in SNNs, showing TAE consistently outperforms others in reconstruction quality, energy efficiency, and classification performance.

Details

Motivation: SNNs offer energy-efficient edge processing, but conventional sensor data needs spike conversion. Environmental sound poses challenges with variable frequencies, noise, and overlapping events, while most spike encoding research focuses on speech.

Method: Analyzed three spike encoding methods (TAE, SF, MW) across three datasets (ESC10, UrbanSound8K, TAU Urban Acoustic Scenes) using multiband analysis and downstream classification with standard SNN.

Result: TAE consistently outperforms SF and MW in reconstruction quality per frequency band and per class, yields lowest spike firing rates (superior energy efficiency), and achieves best performance in environmental sound classification.

Conclusion: Provides foundational insights and comparative benchmark to guide spike encoder selection for neuromorphic environmental sound processing, with TAE emerging as the superior method.

Abstract: Spiking Neural Networks (SNNs) offer energy efficient processing suitable for edge applications, but conventional sensor data must first be converted into spike trains for neuromorphic processing. Environmental sound, including urban soundscapes, poses challenges due to variable frequencies, background noise, and overlapping acoustic events, while most spike based audio encoding research has focused on speech. This paper analyzes three spike encoding methods, Threshold Adaptive Encoding (TAE), Step Forward (SF), and Moving Window (MW) across three datasets: ESC10, UrbanSound8K, and TAU Urban Acoustic Scenes. Our multiband analysis shows that TAE consistently outperforms SF and MW in reconstruction quality, both per frequency band and per class across datasets. Moreover, TAE yields the lowest spike firing rates, indicating superior energy efficiency. For downstream environmental sound classification with a standard SNN, TAE also achieves the best performance among the compared encoders. Overall, this work provides foundational insights and a comparative benchmark to guide the selection of spike encoders for neuromorphic environmental sound processing.

[345] Generating Separated Singing Vocals Using a Diffusion Model Conditioned on Music Mixtures

Genís Plaja-Roglans, Yun-Ning Hung, Xavier Serra, Igor Pereira

Main category: cs.SD

TL;DR: This paper explores singing voice separation using diffusion models trained to generate solo vocals from music mixtures, achieving competitive performance with controllable quality-efficiency trade-offs.

Details

Motivation: To address the essential task of separating individual elements in musical mixtures for music analysis and practice, leveraging the flexibility and generalization capabilities of generative diffusion models for this complex task.

Method: Using a diffusion model trained to generate solo vocals conditioned on the corresponding music mixture, with iterative sampling that allows user control over quality-efficiency trade-offs.

Result: The approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data.

Conclusion: Diffusion models provide an effective framework for singing voice separation with controllable sampling parameters and refinement capabilities, as demonstrated through an ablation study of the sampling algorithm.

Abstract: Separating the individual elements in a musical mixture is an essential process for music analysis and practice. While this is generally addressed using neural networks optimized to mask or transform the time-frequency representation of a mixture to extract the target sources, the flexibility and generalization capabilities of generative diffusion models are giving rise to a novel class of solutions for this complicated task. In this work, we explore singing voice separation from real music recordings using a diffusion model which is trained to generate the solo vocals conditioned on the corresponding mixture. Our approach improves upon prior generative systems and achieves competitive objective scores against non-generative baselines when trained with supplementary data. The iterative nature of diffusion sampling enables the user to control the quality-efficiency trade-off, and also refine the output when needed. We present an ablation study of the sampling algorithm, highlighting the effects of the user-configurable parameters.

[346] HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

Kexin Li, Xiao Hu, Ilya Grishchenko, David Lie

Main category: cs.SD

TL;DR: HarmonicAttack is an efficient audio watermark removal method that uses a dual-path convolutional autoencoder with GAN training to remove watermarks from AI-generated audio, outperforming previous methods while requiring only the ability to generate watermarks from targeted schemes.

Details

Motivation: To address security challenges from AI-generated audio misuse (misinformation, voice-cloning fraud) by studying effective watermark removal techniques to objectively evaluate watermark robustness, as previous methods either require impractical knowledge or are computationally expensive.

Method: Uses a dual-path convolutional autoencoder operating in temporal and frequency domains with GAN-style training to separate watermarks from original audio. Only requires the basic ability to generate watermarks from targeted schemes.

Result: Demonstrates greater watermark removal ability than previous methods against state-of-the-art schemes (AudioSeal, WavMark, Silentcipher) with near real-time performance. Successfully transfers to out-of-distribution samples with minimal performance degradation.

Conclusion: HarmonicAttack provides an efficient and practical watermark removal method that enables objective evaluation of audio watermark robustness, addressing limitations of previous approaches while maintaining strong performance across different scenarios.

Abstract: The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. As those seeking to misuse AI-generated audio may thus seek to remove audio watermarks, studying effective watermark removal techniques is critical to being able to objectively evaluate the robustness of audio watermarks against removal. Previous watermark removal schemes either assume impractical knowledge of the watermarks they are designed to remove or are computationally expensive, potentially generating a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, an efficient audio watermark removal method that only requires the basic ability to generate the watermarks from the targeted scheme and nothing else. With this, we are able to train a general watermark removal model that is able to remove the watermarks generated by the targeted scheme from any watermarked audio sample. HarmonicAttack employs a dual-path convolutional autoencoder that operates in both temporal and frequency domains, along with GAN-style training, to separate the watermark from the original audio. When evaluated against state-of-the-art watermark schemes AudioSeal, WavMark, and Silentcipher, HarmonicAttack demonstrates greater watermark removal ability than previous watermark removal methods with near real-time performance. Moreover, while HarmonicAttack requires training, we find that it is able to transfer to out-of-distribution samples with minimal degradation in performance.

[347] Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension

Benoît Giniès, Xiaoyu Bie, Olivier Fercoq, Gaël Richard

Main category: cs.SD

TL;DR: Bandwidth extension framed as audio token prediction using transformer-based language model on discrete representations from a novel disentangled neural audio codec guided by Harmonic-Percussive decomposition.

Details

Motivation: To leverage recent advances in neural architectures for bandwidth extension by framing it as an audio token prediction problem and improve performance through better alignment between codec design and generative modeling.

Method: Train transformer-based language model on discrete representations from a novel disentangled neural audio codec that uses Harmonic-Percussive decomposition to guide disentanglement, with explicit design for downstream token prediction task.

Result: High-quality reconstructions of original signals as measured by both objective metrics and subjective evaluations.

Conclusion: Importance of aligning codec disentanglement and representation learning with generative modeling stage, demonstrating potential of global, representation-aware design for advancing bandwidth extension.

Abstract: Bandwidth extension, the task of reconstructing the high-frequency components of an audio signal from its low-pass counterpart, is a long-standing problem in audio processing. While traditional approaches have evolved alongside the broader trends in signal processing, recent advances in neural architectures have significantly improved performance across a wide range of audio tasks, In this work, we extend these advances by framing bandwidth extension as an audio token prediction problem. Specifically, we train a transformer-based language model on the discrete representations produced by a disentangled neural audio codec, where the disentanglement is guided by a Harmonic-Percussive decomposition of the input signals, highlighting spectral structures particularly relevant for bandwidth extension. Our approach introduces a novel codec design that explicitly accounts for the downstream token prediction task, enabling a more effective coupling between codec structure and transformer modeling. This joint design yields high-quality reconstructions of the original signal, as measured by both objective metrics and subjective evaluations. These results highlight the importance of aligning codec disentanglement and representation learning with the generative modeling stage, and demonstrate the potential of global, representation-aware design for advancing bandwidth extension.

cs.LG

[348] Prototype-Guided Non-Exemplar Continual Learning for Cross-subject EEG Decoding

Dan Li, Hye-Bin Shin, Yeon-Woo Choi

Main category: cs.LG

TL;DR: ProNECL is a continual learning framework for EEG decoding that uses class-level prototypes instead of storing historical data to prevent forgetting while maintaining privacy and memory efficiency.

Details

Motivation: EEG signals vary significantly across individuals, causing knowledge from previous subjects to be overwritten in continual learning. Existing methods require storing historical data which raises privacy concerns and memory constraints.

Method: Constructs class-level prototypes to summarize discriminative representations from each subject, then incrementally aligns new feature spaces with global prototype memory using cross-subject feature alignment and knowledge distillation.

Result: Validated on BCI Competition IV 2a and 2b datasets, effectively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding tasks.

Conclusion: ProNECL framework successfully preserves prior knowledge without accessing historical EEG samples, addressing privacy and memory constraints while maintaining performance in continual EEG decoding.

Abstract: Due to the significant variability in electroencephalogram (EEG) signals across individuals, knowledge acquired from previous subjects is often overwritten as new subjects are introduced in continual EEG decoding task. Current works mainly rely on storing the historical data of seen subjects as a replay buffer to prevent forgetting. However, privacy concerns or memory constraints make keeping such data impractical. Instead, we propose a Prototype-guided Non-Exemplar Continual Learning (ProNECL)framework that preserves prior knowledge without accessing any historical EEG samples. ProNECL constructs class-level prototypes to summarize discriminative representations from each subject and incrementally aligns new feature spaces with the global prototype memory through cross-subject feature alignment and knowledge distillation. Validated on the BCI Competition IV 2a and 2b datasets, our framework effectively balances knowledge retention and adaptability, achieving superior performance in cross-subject continual EEG decoding tasks.

[349] On the Role of Hidden States of Modern Hopfield Network in Transformer

Tsubasa Masumura, Masato Taki

Main category: cs.LG

TL;DR: The paper establishes a generalized correspondence between modern Hopfield networks and Transformers by introducing hidden states from MHN into self-attention, creating modern Hopfield attention (MHA) that improves attention weight quality and addresses rank collapse/token uniformity issues in deep Transformers.

Details

Motivation: To go beyond the adiabatic approximation and investigate the deeper relationship between modern Hopfield networks and self-attention in Transformers, aiming to improve Transformer architecture using Hopfield network insights.

Method: Proposed modern Hopfield attention (MHA) by adding hidden states derived from modern Hopfield networks to self-attention, allowing inheritance of attention scores across Transformer layers.

Result: MHA significantly improves rank collapse and token uniformity problems in deep Transformers, and systematically enhances accuracy without adding training parameters in Vision Transformer and GPT models.

Conclusion: Hopfield networks provide a useful perspective for improving Transformer architecture, with MHA offering a generalized correspondence that enhances attention mechanisms and addresses key limitations of deep Transformers.

Abstract: Associative memory models based on Hopfield networks and self-attention based on key-value mechanisms have been popular approaches in the study of memory mechanisms in deep learning. It has been pointed out that the state update rule of the modern Hopfield network (MHN) in the adiabatic approximation is in agreement with the self-attention layer of Transformer. In this paper, we go beyond this approximation and investigate the relationship between MHN and self-attention. Our results show that the correspondence between Hopfield networks and Transformers can be established in a more generalized form by adding a new variable, the hidden state derived from the MHN, to self-attention. This new attention mechanism, modern Hopfield attention (MHA), allows the inheritance of attention scores from the input layer of the Transformer to the output layer, which greatly improves the nature of attention weights. In particular, we show both theoretically and empirically that MHA hidden states significantly improve serious problem of deep Transformers known as rank collapse and token uniformity. We also confirm that MHA can systematically improve accuracy without adding training parameters to the Vision Transformer or GPT. Our results provide a new case in which Hopfield networks can be a useful perspective for improving the Transformer architecture.

[350] Post-Pruning Accuracy Recovery via Data-Free Knowledge Distillation

Chinmay Tripurwar, Utkarsh Maurya, Dishant

Main category: cs.LG

TL;DR: Data-free knowledge distillation using DeepInversion to recover pruned model accuracy without accessing original training data.

Details

Motivation: Privacy regulations restrict access to original training data post-deployment, making traditional fine-tuning impossible for pruned models.

Method: Use DeepInversion to synthesize images from pre-trained teacher model via BN statistics inversion, then distill knowledge to pruned student network.

Result: Significantly recovers accuracy lost during pruning on CIFAR-10 across ResNet, MobileNet, VGG architectures without real data.

Conclusion: Proposed framework bridges model compression and data privacy, enabling effective pruning in privacy-sensitive domains.

Abstract: Model pruning is a widely adopted technique to reduce the computational complexity and memory footprint of Deep Neural Networks (DNNs). However, global unstructured pruning often leads to significant degradation in accuracy, typically necessitating fine-tuning on the original training dataset to recover performance. In privacy-sensitive domains such as healthcare or finance, access to the original training data is often restricted post-deployment due to regulations (e.g., GDPR, HIPAA). This paper proposes a Data-Free Knowledge Distillation framework to bridge the gap between model compression and data privacy. We utilize DeepInversion to synthesize privacy-preserving ``dream’’ images from the pre-trained teacher model by inverting Batch Normalization (BN) statistics. These synthetic images serve as a transfer set to distill knowledge from the original teacher to the pruned student network. Experimental results on CIFAR-10 across various architectures (ResNet, MobileNet, VGG) demonstrate that our method significantly recovers accuracy lost during pruning without accessing a single real data point.

[351] Pretraining Transformer-Based Models on Diffusion-Generated Synthetic Graphs for Alzheimer’s Disease Prediction

Abolfazl Moslemi, Hossein Peyvandi

Main category: cs.LG

TL;DR: Transformer-based framework combining diffusion synthetic data generation with graph representation learning for Alzheimer’s disease diagnosis, addressing data limitations and class imbalance.

Details

Motivation: Early and accurate AD detection is crucial but challenging due to limited labeled data, multi-site heterogeneity, and class imbalance in clinical datasets.

Method: Class-conditional DDPM generates balanced synthetic data; modality-specific Graph Transformers pretrained on synthetic data then frozen; neural classifier trained on original data embeddings.

Result: Outperforms baselines including MaGNet, achieving higher AUC, accuracy, sensitivity, and specificity under subject-wise cross-validation on NACC dataset.

Conclusion: Diffusion-based synthetic pretraining with Graph Transformers improves generalization in low-sample, imbalanced clinical prediction settings.

Abstract: Early and accurate detection of Alzheimer’s disease (AD) is crucial for enabling timely intervention and improving outcomes. However, developing reliable machine learning (ML) models for AD diagnosis is challenging due to limited labeled data, multi-site heterogeneity, and class imbalance. We propose a Transformer-based diagnostic framework that combines diffusion-based synthetic data generation with graph representation learning and transfer learning. A class-conditional denoising diffusion probabilistic model (DDPM) is trained on the real-world NACC dataset to generate a large synthetic cohort that mirrors multimodal clinical and neuroimaging feature distributions while balancing diagnostic classes. Modality-specific Graph Transformer encoders are first pretrained on this synthetic data to learn robust, class-discriminative representations and are then frozen while a neural classifier is trained on embeddings from the original NACC data. We quantify distributional alignment between real and synthetic cohorts using metrics such as Maximum Mean Discrepancy (MMD), Frechet distance, and energy distance, and complement discrimination metrics with calibration and fixed-specificity sensitivity analyses. Empirically, our framework outperforms standard baselines, including early and late fusion deep neural networks and the multimodal graph-based model MaGNet, yielding higher AUC, accuracy, sensitivity, and specificity under subject-wise cross-validation on NACC. These results show that diffusion-based synthetic pretraining with Graph Transformers can improve generalization in low-sample, imbalanced clinical prediction settings.

[352] Solving Diffusion Inverse Problems with Restart Posterior Sampling

Bilal Ahmed, Joseph G. Makin

Main category: cs.LG

TL;DR: RePS is an efficient framework for solving inverse problems using pre-trained diffusion models with restart-based sampling, avoiding expensive gradient backpropagation and working for both linear and non-linear measurement models.

Details

Motivation: Existing diffusion-based methods for inverse problems rely on strong posterior approximations, require expensive gradient backpropagation, or are limited to linear models, creating a need for more general and efficient approaches.

Method: Extends restart-based sampling to posterior inference using a conditioned ODE for any differentiable measurement model, with a simplified restart strategy that contracts approximation errors during sampling without backpropagation through the score network.

Result: RePS achieves faster convergence and superior reconstruction quality compared to existing diffusion-based baselines across various linear and non-linear inverse problems.

Conclusion: RePS provides a general and computationally efficient framework for solving inverse problems using diffusion models, overcoming limitations of previous methods while delivering improved performance.

Abstract: Inverse problems are fundamental to science and engineering, where the goal is to infer an underlying signal or state from incomplete or noisy measurements. Recent approaches employ diffusion models as powerful implicit priors for such problems, owing to their ability to capture complex data distributions. However, existing diffusion-based methods for inverse problems often rely on strong approximations of the posterior distribution, require computationally expensive gradient backpropagation through the score network, or are restricted to linear measurement models. In this work, we propose Restart for Posterior Sampling (RePS), a general and efficient framework for solving both linear and non-linear inverse problems using pre-trained diffusion models. RePS builds on the idea of restart-based sampling, previously shown to improve sample quality in unconditional diffusion, and extends it to posterior inference. Our method employs a conditioned ODE applicable to any differentiable measurement model and introduces a simplified restart strategy that contracts accumulated approximation errors during sampling. Unlike some of the prior approaches, RePS avoids backpropagation through the score network, substantially reducing computational cost. We demonstrate that RePS achieves faster convergence and superior reconstruction quality compared to existing diffusion-based baselines across a range of inverse problems, including both linear and non-linear settings.

[353] Active Slice Discovery in Large Language Models

Minhui Zhang, Prahar Ijner, Yoav Wald, Elliot Creager

Main category: cs.LG

TL;DR: Active Slice Discovery formalizes grouping model errors into systematic error slices using limited manual annotation, showing uncertainty-based active learning can achieve competitive accuracy with only 2-10% of slice membership information.

Details

Motivation: LLMs exhibit systematic errors on specific data subsets (error slices), but manual identification is challenging and annotation-intensive. The goal is to reduce annotation burden by actively grouping likely related errors.

Method: Formalized Active Slice Discovery approach using feature representations and active learning algorithms to group errors that likely belong to the same slice, with limited annotator verification of shared error patterns.

Result: Uncertainty-based active learning algorithms were most effective, achieving competitive accuracy using only 2-10% of available slice membership information, significantly outperforming baselines on toxicity classification slices.

Conclusion: Active Slice Discovery with uncertainty-based active learning provides an efficient approach to identify systematic error patterns in LLMs while minimizing annotation costs.

Abstract: Large Language Models (LLMs) often exhibit systematic errors on specific subsets of data, known as error slices. For instance, a slice can correspond to a certain demographic, where a model does poorly in identifying toxic comments regarding that demographic. Identifying error slices is crucial to understanding and improving models, but it is also challenging. An appealing approach to reduce the amount of manual annotation required is to actively group errors that are likely to belong to the same slice, while using limited access to an annotator to verify whether the chosen samples share the same pattern of model mistake. In this paper, we formalize this approach as Active Slice Discovery and explore it empirically on a problem of discovering human-defined slices in toxicity classification. We examine the efficacy of active slice discovery under different choices of feature representations and active learning algorithms. On several slices, we find that uncertainty-based active learning algorithms are most effective, achieving competitive accuracy using 2-10% of the available slice membership information, while significantly outperforming baselines.

[354] ST-PPO: Stabilized Off-Policy Proximal Policy Optimization for Multi-Turn Agents Training

Chenliang Li, Adel Elmahdy, Alex Boyd, Zhongruo Wang, Alfredo Garcia, Parminder Bhatia, Taha Kass-Hout, Cao Xiao, Mingyi Hong

Main category: cs.LG

TL;DR: PPO for multi-turn LLM training suffers from instability due to token-level importance sampling misalignment and inaccurate advantage estimates. The paper introduces turn-level importance sampling and clipping-bias correction to stabilize training.

Details

Motivation: PPO is widely used for training LLMs in multi-turn dialogue and reasoning tasks but suffers from performance instability and collapse, which limits its effectiveness.

Method: Proposes two stabilization techniques: (1) turn-level importance sampling to align with multi-turn environment structure, and (2) clipping-bias correction to normalize gradients by downweighting unreliable off-policy samples. Introduces three variants: Turn-PPO, S-PPO, and ST-PPO.

Result: Experiments on multi-turn search tasks across QA benchmarks show ST-PPO and S-PPO prevent performance collapses, maintain lower clipping ratios, and achieve higher task performance than standard token-level PPO.

Conclusion: Combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

Abstract: PPO has been widely adopted for training large language models (LLMs) at the token level in multi-turn dialogue and reasoning tasks. However, its performance is often unstable and prone to collapse. Through empirical analysis, we identify two main sources of instability in this setting: (1)~token-level importance sampling, which is misaligned with the natural granularity of multi-turn environments that have distinct turn-level stages, and (2) inaccurate advantage estimates from off-policy samples, where the critic has not learned to evaluate certain state-action pairs, resulting in high-variance gradients and unstable updates. To address these challenges, we introduce two complementary stabilization techniques: (1) turn-level importance sampling, which aligns optimization with the natural structure of multi-turn reasoning, and (2) clipping-bias correction, which normalizes gradients by downweighting unreliable, highly off-policy samples. Depending on how these components are combined, we obtain three variants: Turn-PPO (turn-level sampling only), S-PPO (clipping-bias correction applied to token-level PPO), and ST-PPO (turn-level sampling combined with clipping-bias correction). In our experiments, we primarily study ST-PPO and S-PPO, which together demonstrate how the two stabilization mechanisms address complementary sources of instability. Experiments on multi-turn search tasks across general QA, multi-hop QA, and medical multiple-choice QA benchmarks show that ST-PPO and S-PPO consistently prevent the performance collapses observed in large-model training, maintain lower clipping ratios throughout optimization, and achieve higher task performance than standard token-level PPO. These results demonstrate that combining turn-level importance sampling with clipping-bias correction provides a practical and scalable solution for stabilizing multi-turn LLM agent training.

[355] Gradient Descent Algorithm Survey

Deng Fucheng, Wang Wanjie, Gong Ao, Wang Xiaoqi, Wang Fan

Main category: cs.LG

TL;DR: Systematic analysis of five major deep learning optimization algorithms (SGD, Mini-batch SGD, Momentum, Adam, Lion) focusing on their advantages, limitations, and practical configuration recommendations.

Details

Motivation: To address practical configuration needs of optimization algorithms in deep learning and provide standardized references for algorithm selection, parameter tuning, and performance improvement across different model scales and training scenarios.

Method: Systematic analysis of core advantages, limitations, and key practical recommendations for each of the five optimization algorithms.

Result: Comprehensive understanding of each algorithm’s characteristics and practical guidance for their application in academic research and engineering practice.

Conclusion: The analysis provides valuable reference for solving optimization challenges in deep learning by enabling reasonable algorithm selection and parameter configuration across various training scenarios.

Abstract: Focusing on the practical configuration needs of optimization algorithms in deep learning, this article concentrates on five major algorithms: SGD, Mini-batch SGD, Momentum, Adam, and Lion. It systematically analyzes the core advantages, limitations, and key practical recommendations of each algorithm. The research aims to gain an in-depth understanding of these algorithms and provide a standardized reference for the reasonable selection, parameter tuning, and performance improvement of optimization algorithms in both academic research and engineering practice, helping to solve optimization challenges in different scales of models and various training scenarios.

[356] Learning from Risk: LLM-Guided Generation of Safety-Critical Scenarios with Prior Knowledge

Yuhang Wang, Heye Huang, Zhenhua Xu, Kailai Sun, Baoshen Guo, Jinhua Zhao

Main category: cs.LG

TL;DR: A framework combining CVAE and LLM for generating high-fidelity driving scenarios that cover rare long-tail events and complex multi-agent interactions, enabling better safety validation of autonomous systems.

Details

Motivation: Address the scarcity of rare long-tail events and complex multi-agent interactions in real-world data, which are crucial for robust safety validation of autonomous driving systems.

Method: Integrates conditional variational autoencoder (CVAE) to learn latent traffic structures from historical data and generate physically consistent base scenarios, then uses LLM as adversarial reasoning engine to parse scene descriptions into domain-specific loss functions and guide scenario generation across risk levels.

Result: Substantially increases coverage of high-risk and long-tail events, improves consistency between simulated and real-world traffic distributions, and exposes autonomous systems to more challenging interactions than existing methods in CARLA and SMARTS simulations.

Conclusion: Establishes a new pathway for principled stress-testing of autonomous systems under rare but consequential events through knowledge-driven scenario generation that balances realism with controllability.

Abstract: Autonomous driving faces critical challenges in rare long-tail events and complex multi-agent interactions, which are scarce in real-world data yet essential for robust safety validation. This paper presents a high-fidelity scenario generation framework that integrates a conditional variational autoencoder (CVAE) with a large language model (LLM). The CVAE encodes historical trajectories and map information from large-scale naturalistic datasets to learn latent traffic structures, enabling the generation of physically consistent base scenarios. Building on this, the LLM acts as an adversarial reasoning engine, parsing unstructured scene descriptions into domain-specific loss functions and dynamically guiding scenario generation across varying risk levels. This knowledge-driven optimization balances realism with controllability, ensuring that generated scenarios remain both plausible and risk-sensitive. Extensive experiments in CARLA and SMARTS demonstrate that our framework substantially increases the coverage of high-risk and long-tail events, improves consistency between simulated and real-world traffic distributions, and exposes autonomous driving systems to interactions that are significantly more challenging than those produced by existing rule- or data-driven methods. These results establish a new pathway for safety validation, enabling principled stress-testing of autonomous systems under rare but consequential events.

[357] Spatio-Temporal Trajectory Foundation Model - Recent Advances and Future Directions

Sean Bin Yang, Ying Sun, Yunyao Cheng, Yan Lin, Kristian Torp, Jilin Hu

Main category: cs.LG

TL;DR: This tutorial provides a comprehensive overview of trajectory foundation models (TFMs), a crucial subclass of spatio-temporal foundation models, addressing the current lack of systematic investigation in this area.

Details

Motivation: Foundation models have shown success across scientific fields, and researchers are now exploring spatio-temporal foundation models (STFMs) to improve adaptability and generalization in spatio-temporal tasks. However, there's a significant gap in systematic investigation of trajectory foundation models (TFMs).

Method: The tutorial offers a comprehensive overview of recent advances in TFMs, including a taxonomy of existing methodologies and critical analysis of their strengths and limitations.

Result: The tutorial identifies that despite rapid progress in STFMs, systematic investigation of TFMs is largely lacking, highlighting the need for focused research in this specific subclass.

Conclusion: The tutorial outlines open challenges and promising research directions to advance spatio-temporal general intelligence through developing robust, responsible, and transferable trajectory foundation models.

Abstract: Foundation models (FMs) have emerged as a powerful paradigm, enabling a diverse range of data analytics and knowledge discovery tasks across scientific fields. Inspired by the success of FMs, particularly large language models, researchers have recently begun to explore spatio-temporal foundation models (STFMs) to improve adaptability and generalization across a wide spectrum of spatio-temporal (ST) tasks. Despite rapid progress, a systematic investigation of trajectory foundation models (TFMs), a crucial subclass of STFMs, is largely lacking. This tutorial addresses this gap by offering a comprehensive overview of recent advances in TFMs, including a taxonomy of existing methodologies and a critical analysis of their strengths and limitations. In addition, the tutorial highlights open challenges and outlines promising research directions to advance spatio-temporal general intelligence through the development of robust, responsible, and transferable TFMs.

[358] CHiQPM: Calibrated Hierarchical Interpretable Image Classification

Thomas Norrenbrock, Timo Kaiser, Sovan Biswas, Neslihan Kose, Ramesh Manuvinakurike, Bodo Rosenhahn

Main category: cs.LG

TL;DR: CHiQPM is a globally interpretable model that provides comprehensive global and local explanations while maintaining 99% accuracy of non-interpretable models, with built-in conformal prediction.

Details

Motivation: To enable trustworthy AI in safety-critical domains by providing both global interpretability and detailed local explanations to support human experts during inference.

Method: Calibrated Hierarchical QPM (CHiQPM) that offers contrastive explanations for majority classes, hierarchical explanations similar to human reasoning, and built-in interpretable conformal prediction.

Result: Achieves state-of-the-art accuracy as point predictor (99% of non-interpretable models), competitively efficient calibrated set prediction, and provides interpretable hierarchical predictions.

Conclusion: CHiQPM demonstrates that interpretability can be incorporated without sacrificing accuracy, paving the way for human-AI complementarity in safety-critical applications.

Abstract: Globally interpretable models are a promising approach for trustworthy AI in safety-critical domains. Alongside global explanations, detailed local explanations are a crucial complement to effectively support human experts during inference. This work proposes the Calibrated Hierarchical QPM (CHiQPM) which offers uniquely comprehensive global and local interpretability, paving the way for human-AI complementarity. CHiQPM achieves superior global interpretability by contrastively explaining the majority of classes and offers novel hierarchical explanations that are more similar to how humans reason and can be traversed to offer a built-in interpretable Conformal prediction (CP) method. Our comprehensive evaluation shows that CHiQPM achieves state-of-the-art accuracy as a point predictor, maintaining 99% accuracy of non-interpretable models. This demonstrates a substantial improvement, where interpretability is incorporated without sacrificing overall accuracy. Furthermore, its calibrated set prediction is competitively efficient to other CP methods, while providing interpretable predictions of coherent sets along its hierarchical explanation.

[359] Physics Steering: Causal Control of Cross-Domain Concepts in a Physics Foundation Model

Rio Alexa Fear, Payel Mukhopadhyay, Michael McCabe, Alberto Bietti, Miles Cranmer

Main category: cs.LG

TL;DR: Physics foundation models develop internal representations of abstract physical concepts that can be manipulated to causally control model behavior and steer predictions.

Details

Motivation: To investigate whether mechanistic interpretability findings from language models extend to scientific foundation models, specifically whether they develop human-understandable abstract representations of physical principles.

Method: Extracted activation vectors from a physics foundation model during forward passes over different physical regimes, computed “delta” representations between regimes as concept directions, and injected these directions back during inference to steer predictions.

Result: Successfully demonstrated causal control over physical behaviors by inducing or removing specific physical features from simulations through concept direction manipulation.

Conclusion: Scientific foundation models learn generalized representations of physical principles rather than relying on superficial correlations, opening new avenues for understanding and controlling AI-enabled scientific discovery.

Abstract: Recent advances in mechanistic interpretability have revealed that large language models (LLMs) develop internal representations corresponding not only to concrete entities but also distinct, human-understandable abstract concepts and behaviour. Moreover, these hidden features can be directly manipulated to steer model behaviour. However, it remains an open question whether this phenomenon is unique to models trained on inherently structured data (ie. language, images) or if it is a general property of foundation models. In this work, we investigate the internal representations of a large physics-focused foundation model. Inspired by recent work identifying single directions in activation space for complex behaviours in LLMs, we extract activation vectors from the model during forward passes over simulation datasets for different physical regimes. We then compute “delta” representations between the two regimes. These delta tensors act as concept directions in activation space, encoding specific physical features. By injecting these concept directions back into the model during inference, we can steer its predictions, demonstrating causal control over physical behaviours, such as inducing or removing some particular physical feature from a simulation. These results suggest that scientific foundation models learn generalised representations of physical principles. They do not merely rely on superficial correlations and patterns in the simulations. Our findings open new avenues for understanding and controlling scientific foundation models and has implications for AI-enabled scientific discovery.

[360] Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction

Yusong Wu, Stephen Brade, Teng Ma, Tia-Jane Fowler, Enning Yang, Berker Banar, Aaron Courville, Natasha Jaques, Cheng-Zhi Anna Huang

Main category: cs.LG

TL;DR: Proposes adversarial training method to prevent reward hacking in RL post-training for live musical jamming, improving diversity while maintaining harmonic coherence.

Details

Motivation: Live jamming requires real-time coordination and diversity, but RL post-training often reduces output diversity through reward hacking, which is especially harmful for musical creativity.

Method: Adversarial training with co-evolving discriminator that separates policy trajectories from data distribution, while policy maximizes discriminator output plus coherence rewards to prevent collapse.

Result: Improved output diversity, harmonic coherence, adaptation speed and user agency in both simulation and user studies with expert musicians.

Conclusion: Demonstrates effective method to mitigate reward hacking in RL post-training of generative sequence models for real-time interactive applications.

Abstract: Most applications of generative AI involve a sequential interaction in which a person inputs a prompt and waits for a response, and where reaction time and adaptivity are not important factors. In contrast, live jamming is a collaborative interaction that requires real-time coordination and adaptation without access to the other player’s future moves, while preserving diversity to sustain a creative flow. Reinforcement learning post-training enables effective adaptation through on-policy interaction, yet it often reduces output diversity by exploiting coherence-based rewards. This collapse, known as ``reward hacking’’, affects many RL post-training pipelines, but is especially harmful in live jamming, where musical creativity relies on dynamic variation and mutual responsiveness. In this paper, we propose a novel adversarial training method on policy-generated trajectories to mitigate reward hacking in RL post-training for melody-to-chord accompaniment. A co-evolving discriminator separates policy trajectories from the data distribution, while the policy maximizes the discriminator output in addition to coherence rewards to prevent collapse to trivial outputs. We evaluate accompaniment quality and output diversity in simulation with both fixed test melodies and learned melody agents, and we conduct a user study with the model deployed in a real-time interactive system with expert musicians. Quantitative evaluation and user feedback demonstrate improved output diversity, harmonic coherence, adaptation speed and user agency. Our results demonstrate a simple yet effective method to mitigate reward hacking in RL post-training of generative sequence models.

[361] Conformal Safety Monitoring for Flight Testing: A Case Study in Data-Driven Safety Learning

Aaron O. Feldman, D. Isaiah Harp, Joseph Duncan, Mac Schwager

Main category: cs.LG

TL;DR: A data-driven approach for runtime safety monitoring in flight testing that uses offline stochastic trajectory simulation to learn calibrated statistical models of short-term safety risk, enabling pilots to abort maneuvers before safety violations occur.

Details

Motivation: Flight testing involves pilots performing maneuvers on aircraft with uncertain parameters, which can lead to unexpected safety violations. Pilots need clear, preemptive criteria to abort maneuvers before safety violations occur.

Method: Three-component approach: 1) Model to predict future state from recent observations, 2) Nearest neighbor model to classify safety of predicted state, 3) Classifier calibration via conformal prediction using offline stochastic trajectory simulation.

Result: The method reliably identifies unsafe scenarios, matches theoretical guarantees, and outperforms baseline approaches in preemptive classification of risk when evaluated on a flight dynamics model with uncertain parameters.

Conclusion: The approach provides an effective data-driven solution for runtime safety monitoring in flight testing that enables preemptive risk assessment and decision-making under uncertainty.

Abstract: We develop a data-driven approach for runtime safety monitoring in flight testing, where pilots perform maneuvers on aircraft with uncertain parameters. Because safety violations can arise unexpectedly as a result of these uncertainties, pilots need clear, preemptive criteria to abort the maneuver in advance of safety violation. To solve this problem, we use offline stochastic trajectory simulation to learn a calibrated statistical model of the short-term safety risk facing pilots. We use flight testing as a motivating example for data-driven learning/monitoring of safety due to its inherent safety risk, uncertainty, and human-interaction. However, our approach consists of three broadly-applicable components: a model to predict future state from recent observations, a nearest neighbor model to classify the safety of the predicted state, and classifier calibration via conformal prediction. We evaluate our method on a flight dynamics model with uncertain parameters, demonstrating its ability to reliably identify unsafe scenarios, match theoretical guarantees, and outperform baseline approaches in preemptive classification of risk.

[362] Effects of Initialization Biases on Deep Neural Network Training Dynamics

Nicholas Pellegrino, David Szczecina, Paul W. Fieguth

Main category: cs.LG

TL;DR: Untrained neural networks exhibit Initial Guessing Bias, favoring few classes after random initialization, which affects early training dynamics and interacts with loss function choices.

Details

Motivation: To understand how Initial Guessing Bias in untrained neural networks affects early training dynamics and how different loss functions interact with this bias.

Method: Analysis of how loss functions (including Blurry and Piecewise-zero loss) interact with Initial Guessing Bias during early training phases of neural networks.

Result: Loss function choice dramatically affects early phase training, with some loss functions becoming unable to steer training direction when exposed to Initial Guessing Bias.

Conclusion: Careful consideration of how Initial Guessing Bias interacts with training components is necessary, as loss function choice significantly impacts early training dynamics.

Abstract: Untrained large neural networks, just after random initialization, tend to favour a small subset of classes, assigning high predicted probabilities to these few classes and approximately zero probability to all others. This bias, termed Initial Guessing Bias, affects the early training dynamics, when the model is fitting to the coarse structure of the data. The choice of loss function against which to train the model has a large impact on how these early dynamics play out. Two recent loss functions, Blurry and Piecewise-zero loss, were designed for robustness to label errors but can become unable to steer the direction of training when exposed to this initial bias. Results indicate that the choice of loss function has a dramatic effect on the early phase training of networks, and highlights the need for careful consideration of how Initial Guessing Bias may interact with various components of the training scheme.

[363] Autoregressive Surrogate Modeling of the Solar Wind with Spherical Fourier Neural Operator

Reza Mansouri, Dustin Kempton, Pete Riley, Rafal Angryk

Main category: cs.LG

TL;DR: First autoregressive machine learning surrogate for solar wind velocity using Spherical Fourier Neural Operator (SFNO) that iteratively propagates solutions outward, outperforming traditional MHD models and HUX surrogate.

Details

Motivation: Traditional 3D magnetohydrodynamic models for solar wind prediction are computationally expensive, limiting rapid exploration of boundary condition uncertainties for space weather forecasting.

Method: Uses Spherical Fourier Neural Operator (SFNO) with autoregressive approach - predicts limited radial range and iteratively propagates solution outward to improve accuracy in distant regions.

Result: SFNO demonstrates superior or comparable performance to numerical HUX surrogate while providing flexible, trainable, data-driven alternative for solar wind modeling.

Conclusion: Establishes novel methodology for high-fidelity solar wind modeling with autoregressive machine learning approach that overcomes computational limitations of traditional models.

Abstract: The solar wind, a continuous outflow of charged particles from the Sun’s corona, shapes the heliosphere and impacts space systems near Earth. Accurate prediction of features such as high-speed streams and coronal mass ejections is critical for space weather forecasting, but traditional three-dimensional magnetohydrodynamic (MHD) models are computationally expensive, limiting rapid exploration of boundary condition uncertainties. We introduce the first autoregressive machine learning surrogate for steady-state solar wind radial velocity using the Spherical Fourier Neural Operator (SFNO). By predicting a limited radial range and iteratively propagating the solution outward, the model improves accuracy in distant regions compared to a single-step approach. Compared with the numerical HUX surrogate, SFNO demonstrates superior or comparable performance while providing a flexible, trainable, and data-driven alternative, establishing a novel methodology for high-fidelity solar wind modeling. The source code and additional visual results are available at https://github.com/rezmansouri/solarwind-sfno-velocity-autoregressive.

[364] Multi-Agent Cross-Entropy Method with Monotonic Nonlinear Critic Decomposition

Yan Wang, Ke Deng, Yongli Ren

Main category: cs.LG

TL;DR: Proposes MCEM with monotonic nonlinear critic decomposition to overcome centralized-decentralized mismatch in multi-agent RL, enabling nonlinear value decomposition without centralized gradients.

Details

Motivation: Address the trade-off between expressiveness and centralized-decentralized mismatch in cooperative multi-agent RL, where linear value decomposition limits representation while nonlinear approaches reintroduce CDM.

Method: Multi-agent cross-entropy method (MCEM) updates policies by increasing probability of high-value joint actions, combined with monotonic nonlinear critic decomposition and off-policy learning with modified k-step return and Retrace.

Result: MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks in experiments.

Conclusion: MCEM with nonlinear critic decomposition effectively overcomes the centralized-decentralized mismatch trade-off in cooperative multi-agent reinforcement learning.

Abstract: Cooperative multi-agent reinforcement learning (MARL) commonly adopts centralized training with decentralized execution (CTDE), where centralized critics leverage global information to guide decentralized actors. However, centralized-decentralized mismatch (CDM) arises when the suboptimal behavior of one agent degrades others’ learning. Prior approaches mitigate CDM through value decomposition, but linear decompositions allow per-agent gradients at the cost of limited expressiveness, while nonlinear decompositions improve representation but require centralized gradients, reintroducing CDM. To overcome this trade-off, we propose the multi-agent cross-entropy method (MCEM), combined with monotonic nonlinear critic decomposition (NCD). MCEM updates policies by increasing the probability of high-value joint actions, thereby excluding suboptimal behaviors. For sample efficiency, we extend off-policy learning with a modified k-step return and Retrace. Analysis and experiments demonstrate that MCEM outperforms state-of-the-art methods across both continuous and discrete action benchmarks.

[365] Primal: A Unified Deterministic Framework for Quasi-Orthogonal Hashing and Manifold Learning

Vladimer Khasia

Main category: cs.LG

TL;DR: Primal is a deterministic feature mapping framework using prime square roots to create robust vector representations with tunable properties, offering both static temporal encodings and dynamic input-dependent projections.

Details

Motivation: To create more robust and mathematically rigorous alternatives to stochastic projection methods like Random Fourier Features, leveraging number-theoretic properties for better control and performance.

Method: Uses prime square roots’ number-theoretic independence and Besicovitch property to create irrational frequency modulations. Offers two variants: StaticPrime for temporal position encodings and DynamicPrime with tunable scaling parameter σ that enables different mathematical utilities.

Result: Superior orthogonality retention and distribution tightness compared to normalized Gaussian baselines, with the ability to function as both isometric kernel maps (low-frequency) and maximum-entropy one-way hashes (high-frequency).

Conclusion: Primal provides a computationally efficient, mathematically rigorous deterministic alternative to random matrix projections, unifying disparate mathematical utilities through a single tunable parameter.

Abstract: We present Primal, a deterministic feature mapping framework that harnesses the number-theoretic independence of prime square roots to construct robust, tunable vector representations. Diverging from standard stochastic projections (e.g., Random Fourier Features), our method exploits the Besicovitch property to create irrational frequency modulations that guarantee infinite non-repeating phase trajectories. We formalize two distinct algorithmic variants: (1) StaticPrime, a sequence generation method that produces temporal position encodings empirically approaching the theoretical Welch bound for quasi-orthogonality; and (2) DynamicPrime, a tunable projection layer for input-dependent feature mapping. A central novelty of the dynamic framework is its ability to unify two disparate mathematical utility classes through a single scaling parameter σ. In the low-frequency regime, the method acts as an isometric kernel map, effectively linearizing non-convex geometries (e.g., spirals) to enable high-fidelity signal reconstruction and compressive sensing. Conversely, the high-frequency regime induces chaotic phase wrapping, transforming the projection into a maximum-entropy one-way hash suitable for Hyperdimensional Computing and privacy-preserving Split Learning. Empirical evaluations demonstrate that our framework yields superior orthogonality retention and distribution tightness compared to normalized Gaussian baselines, establishing it as a computationally efficient, mathematically rigorous alternative to random matrix projections. The code is available at https://github.com/VladimerKhasia/primal

[366] Pre-train to Gain: Robust Learning Without Clean Labels

David Szczecina, Nicholas Pellegrino, Paul Fieguth

Main category: cs.LG

TL;DR: Self-supervised pre-training (SimCLR, Barlow Twins) improves noise robustness in deep networks by enabling better feature learning before supervised training on noisy datasets, outperforming ImageNet pre-trained models under high noise.

Details

Motivation: Training deep networks with noisy labels causes poor generalization due to overfitting to label noise. Existing methods require clean data subsets, which may not be available.

Method: Pre-train feature extractor backbone using self-supervised learning (SSL) without labels, followed by standard supervised training on the noisy dataset. Evaluated SimCLR and Barlow Twins on CIFAR-10/100 with synthetic and real-world noise.

Result: SSL pre-training consistently improves classification accuracy and label-error detection across all noise rates. Performance gap widens with increasing noise, achieving comparable results to ImageNet pre-training at low noise and substantially outperforming under high noise.

Conclusion: Self-supervised pre-training provides effective noise robustness without requiring clean labeled data, making it a practical solution for learning with noisy labels.

Abstract: Training deep networks with noisy labels leads to poor generalization and degraded accuracy due to overfitting to label noise. Existing approaches for learning with noisy labels often rely on the availability of a clean subset of data. By pre-training a feature extractor backbone without labels using self-supervised learning (SSL), followed by standard supervised training on the noisy dataset, we can train a more noise robust model without requiring a subset with clean labels. We evaluate the use of SimCLR and Barlow~Twins as SSL methods on CIFAR-10 and CIFAR-100 under synthetic and real world noise. Across all noise rates, self-supervised pre-training consistently improves classification accuracy and enhances downstream label-error detection (F1 and Balanced Accuracy). The performance gap widens as the noise rate increases, demonstrating improved robustness. Notably, our approach achieves comparable results to ImageNet pre-trained models at low noise levels, while substantially outperforming them under high noise conditions.

[367] Selecting Belief-State Approximations in Simulators with Latent States

Nan Jiang

Main category: cs.LG

TL;DR: This paper addresses the problem of state resetting in simulators with latent variables, showing it reduces to conditional distribution selection and developing algorithms for belief-state sampling under sampling-only access.

Details

Motivation: State resetting is fundamental for planning and calibration but challenging in simulators with latent variables, requiring sampling from belief states. The paper aims to provide methods for selecting among approximate belief-state samplers when only sampling access is available.

Method: The paper develops a new algorithm and analysis for belief-state selection under sampling-only access. It presents two formulations: latent state-based selection (targeting latent state distribution) and observation-based selection (targeting observation distribution), and analyzes their interaction with different roll-out methods (Single-Reset vs Repeated-Reset).

Result: The paper reveals that observation-based selection may fail under Single-Reset roll-out but enjoys guarantees under Repeated-Reset roll-out. It provides theoretical analysis of distribution shift and sampling policy choices, uncovering a rich landscape of algorithmic options.

Conclusion: The seemingly simple problem of state resetting in simulators with latent variables involves complex theoretical nuances, multiple algorithmic choices, and open questions regarding the interaction between belief-state selection methods and roll-out strategies.

Abstract: State resetting is a fundamental but often overlooked capability of simulators. It supports sample-based planning by allowing resets to previously encountered simulation states, and enables calibration of simulators using real data by resetting to states observed in real-system traces. While often taken for granted, state resetting in complex simulators can be nontrivial: when the simulator comes with latent variables (states), state resetting requires sampling from the posterior over the latent state given the observable history, a.k.a. the belief state (Silver and Veness, 2010). While exact sampling is often infeasible, many approximate belief-state samplers can be constructed, raising the question of how to select among them using only sampling access to the simulator. In this paper, we show that this problem reduces to a general conditional distribution-selection task and develop a new algorithm and analysis under sampling-only access. Building on this reduction, the belief-state selection problem admits two different formulations: latent state-based selection, which directly targets the conditional distribution of the latent state, and observation-based selection, which targets the induced distribution over the observation. Interestingly, these formulations differ in how their guarantees interact with the downstream roll-out methods: perhaps surprisingly, observation-based selection may fail under the most natural roll-out method (which we call Single-Reset) but enjoys guarantees under the less conventional alternative (which we call Repeated-Reset). Together with discussion on issues such as distribution shift and the choice of sampling policies, our paper reveals a rich landscape of algorithmic choices, theoretical nuances, and open questions, in this seemingly simple problem.

[368] Representation Integrity in Temporal Graph Learning Methods

Elahe Kooshafar

Main category: cs.LG

TL;DR: The paper introduces representation integrity as a framework to evaluate dynamic graph embeddings, proposing metrics that measure how well embedding changes reflect actual graph changes, and validates these metrics through synthetic scenarios and empirical studies.

Details

Motivation: Current benchmarks for dynamic graph learners focus on task-specific scores but don't assess whether embeddings truthfully reflect the evolving network structure, creating a need for more interpretable evaluation methods.

Method: The authors formalize representation integrity, derive a family of indexes to measure embedding-graph alignment, test 42 candidate indexes on three synthetic scenarios (Gradual Merge, Abrupt Move, Periodic Re-wiring), and validate one recommended index through theoretical and empirical analysis.

Result: The validated metric consistently ranks provably stable models (UASE and IPP) highest, reveals scenario-specific strengths of neural methods, and shows strong positive correlation with link-prediction AUC, providing task-agnostic evaluation of embedding quality.

Conclusion: The representation integrity framework offers an interpretable, task-agnostic tool for evaluating dynamic graph representation quality, providing explicit guidance for model selection and future architecture design.

Abstract: Real-world systems ranging from airline routes to cryptocurrency transfers are naturally modelled as dynamic graphs whose topology changes over time. Conventional benchmarks judge dynamic-graph learners by a handful of task-specific scores, yet seldom ask whether the embeddings themselves remain a truthful, interpretable reflection of the evolving network. We formalize this requirement as representation integrity and derive a family of indexes that measure how closely embedding changes follow graph changes. Three synthetic scenarios, Gradual Merge, Abrupt Move, and Periodic Re-wiring, are used to screen forty-two candidate indexes. Based on which we recommend one index that passes all of our theoretical and empirical tests. In particular, this validated metric consistently ranks the provably stable UASE and IPP models highest. We then use this index to do a comparative study on representation integrity of common dynamic graph learning models. This study exposes the scenario-specific strengths of neural methods, and shows a strong positive rank correlation with one-step link-prediction AUC. The proposed integrity framework, therefore, offers a task-agnostic and interpretable evaluation tool for dynamic-graph representation quality, providing more explicit guidance for model selection and future architecture design.

[369] Probabilistic Hash Embeddings for Online Learning of Categorical Features

Aodong Li, Abishek Sankararaman, Balakrishnan Narayanaswamy

Main category: cs.LG

TL;DR: Proposes probabilistic hash embeddings (PHE) for online learning with evolving categorical vocabularies, addressing order sensitivity and forgetting issues in deterministic embeddings.

Details

Motivation: Existing feature hashing methods work well in offline settings but suffer from order sensitivity and forgetting in online learning when categorical vocabularies change and grow unboundedly over time.

Method: Probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning with scalable inference algorithm for incremental learning.

Result: PHE achieves superior performance in classification, sequence modeling, and recommendation systems while maintaining high memory efficiency (2-4x less memory than one-hot embeddings).

Conclusion: PHE effectively handles evolving vocabularies, prevents forgetting, is order-invariant, and maintains bounded parameters, making it suitable for online learning with categorical features.

Abstract: We study streaming data with categorical features where the vocabulary of categorical feature values is changing and can even grow unboundedly over time. Feature hashing is commonly used as a pre-processing step to map these categorical values into a feature space of fixed size before learning their embeddings. While these methods have been developed and evaluated for offline or batch settings, in this paper we consider online settings. We show that deterministic embeddings are sensitive to the arrival order of categories and suffer from forgetting in online learning, leading to performance deterioration. To mitigate this issue, we propose a probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning to learn incrementally from data. Based on the structure of PHE, we derive a scalable inference algorithm to learn model parameters and infer/update the posteriors of hash embeddings and other latent variables. Our algorithm (i) can handle an evolving vocabulary of categorical items, (ii) is adaptive to new items without forgetting old items, (iii) is implementable with a bounded set of parameters that does not grow with the number of distinct observed values on the stream, and (iv) is invariant to the item arrival order. Experiments in classification, sequence modeling, and recommendation systems in online learning setups demonstrate the superior performance of PHE while maintaining high memory efficiency (consumes as low as 2~4 memory of a one-hot embedding table). Supplementary materials are at https://github.com/aodongli/probabilistic-hash-embeddings

[370] Evolved SampleWeights for Bias Mitigation: Effectiveness Depends on Optimization Objectives

Anil K. Saini, Jose Guadalupe Hernandez, Emily F. Wong, Debanshi Misra, Jason H. Moore

Main category: cs.LG

TL;DR: Genetic Algorithm evolved sample weights achieve better fairness-performance trade-offs than dataset-based or equal weighting methods, with optimization objective choice significantly impacting results.

Details

Motivation: Machine learning models trained on real-world data may make biased predictions that negatively impact marginalized communities, requiring methods to mitigate such bias.

Method: Compared three weighting methods: (1) Genetic Algorithm evolved weights, (2) dataset-characteristic computed weights, (3) equal weights. Used paired predictive (accuracy, AUC-ROC) and fairness metrics (demographic parity difference, subgroup false negative fairness) for evaluation and GA optimization.

Result: Evolved sample weights produced models with better fairness-performance trade-offs than alternative methods. Benefits magnitude depends on optimization objectives - optimizing with accuracy and demographic parity difference yielded most datasets where evolved weights significantly outperformed other strategies.

Conclusion: Genetic Algorithm evolved weights can effectively balance fairness and predictive performance, with objective selection being crucial for optimal results across diverse datasets.

Abstract: Machine learning models trained on real-world data may inadvertently make biased predictions that negatively impact marginalized communities. Reweighting is a method that can mitigate such bias in model predictions by assigning a weight to each data point used during model training. In this paper, we compare three methods for generating these weights: (1) evolving them using a Genetic Algorithm (GA), (2) computing them using only dataset characteristics, and (3) assigning equal weights to all data points. Model performance under each strategy was evaluated using paired predictive and fairness metrics, which also served as optimization objectives for the GA during evolution. Specifically, we used two predictive metrics (accuracy and area under the Receiver Operating Characteristic curve) and two fairness metrics (demographic parity difference and subgroup false negative fairness). Using experiments on eleven publicly available datasets (including two medical datasets), we show that evolved sample weights can produce models that achieve better trade-offs between fairness and predictive performance than alternative weighting methods. However, the magnitude of these benefits depends strongly on the choice of optimization objectives. Our experiments reveal that optimizing with accuracy and demographic parity difference metrics yields the largest number of datasets for which evolved weights are significantly better than other weighting strategies in optimizing both objectives.

[371] Exploring Time-Step Size in Reinforcement Learning for Sepsis Treatment

Yingchuan Sun, Shengpu Tang

Main category: cs.LG

TL;DR: This paper empirically compares four different time-step sizes (1, 2, 4, 8 hours) for offline RL in sepsis management, finding that finer time-step sizes (1-2 hours) generally achieve better performance than the conventional 4-hour setup.

Details

Motivation: Existing RL approaches for sepsis management use 4-hour time steps, but concerns exist about whether this coarse granularity distorts patient dynamics and leads to suboptimal treatment policies. The practical impact of time-step size remains unexplored.

Method: Used an identical offline RL pipeline with four time-step sizes (1, 2, 4, 8 hours). Designed action re-mapping methods for fair cross-time-step evaluation and conducted cross-time-step model selection under two policy learning setups.

Result: Performance trends vary with learning setups, but policies learned at finer time-step sizes (1-2 hours) using a static behavior policy achieve the overall best performance and stability.

Conclusion: Time-step size is a core design choice in offline RL for healthcare, and evidence supports using finer time-step sizes (1-2 hours) as alternatives to the conventional 4-hour setup.

Abstract: Existing studies on reinforcement learning (RL) for sepsis management have mostly followed an established problem setup, in which patient data are aggregated into 4-hour time steps. Although concerns have been raised regarding the coarseness of this time-step size, which might distort patient dynamics and lead to suboptimal treatment policies, the extent to which this is a problem in practice remains unexplored. In this work, we conducted empirical experiments for a controlled comparison of four time-step sizes ($Δt!=!1,2,4,8$ h) on this domain, following an identical offline RL pipeline. To enable a fair comparison across time-step sizes, we designed action re-mapping methods that allow for evaluation of policies on datasets with different time-step sizes, and conducted cross-$Δt$ model selections under two policy learning setups. Our goal was to quantify how time-step size influences state representation learning, behavior cloning, policy training, and off-policy evaluation. Our results show that performance trends across $Δt$ vary as learning setups change, while policies learned at finer time-step sizes ($Δt = 1$ h and $2$ h) using a static behavior policy achieve the overall best performance and stability. Our work highlights time-step size as a core design choice in offline RL for healthcare and provides evidence supporting alternatives beyond the conventional 4-hour setup.

[372] Operationalizing Quantized Disentanglement

Vitoria Barin-Pacela, Kartik Ahuja, Simon Lacoste-Julien, Pascal Vincent

Main category: cs.LG

TL;DR: The paper proposes Cliff, a method for unsupervised disentanglement by encouraging axis-aligned discontinuities in latent factor densities, outperforming existing baselines on disentanglement benchmarks.

Details

Motivation: While theoretical work established identifiability of quantized factors under diffeomorphisms, translating this principle into practical criteria remains challenging, especially for nonlinear maps.

Method: Develop a criterion that encourages axis-aligned discontinuities (cliffs) in the estimated density of factors, ensuring cliff locations along one factor are independent of other factors’ values.

Result: Cliff outperforms all baseline methods on disentanglement benchmarks, demonstrating superior effectiveness in unsupervised disentanglement.

Conclusion: The proposed method successfully translates theoretical principles into practical disentanglement by leveraging axis-aligned discontinuities, achieving state-of-the-art performance.

Abstract: Recent theoretical work established the unsupervised identifiability of quantized factors under any diffeomorphism. The theory assumes that quantization thresholds correspond to axis-aligned discontinuities in the probability density of the latent factors. By constraining a learned map to have a density with axis-aligned discontinuities, we can recover the quantization of the factors. However, translating this high-level principle into an effective practical criterion remains challenging, especially under nonlinear maps. Here, we develop a criterion for unsupervised disentanglement by encouraging axis-aligned discontinuities. Discontinuities manifest as sharp changes in the estimated density of factors and form what we call cliffs. Following the definition of independent discontinuities from the theory, we encourage the location of the cliffs along a factor to be independent of the values of the other factors. We show that our method, Cliff, outperforms the baselines on all disentanglement benchmarks, demonstrating its effectiveness in unsupervised disentanglement.

[373] Semantic Superiority vs. Forensic Efficiency: A Comparative Analysis of Deep Learning and Psycholinguistics for Business Email Compromise Detection

Yaw Osei Adjei

Main category: cs.LG

TL;DR: This paper compares two BEC detection approaches: CatBoost for psycholinguistic analysis (fast, interpretable) and DistilBERT for semantic understanding (high accuracy, higher cost). Both achieve excellent performance with >99.96% ROI when optimized through cost-sensitive learning.

Details

Motivation: BEC causes massive financial losses ($2.9B annually) with extreme cost asymmetry where false negatives (fraud losses) are 5,480x more expensive than false positives (manual reviews).

Method: Two detection streams: 1) Forensic Psycholinguistic Stream using CatBoost for psycholinguistic cues, 2) Semantic Stream using DistilBERT for contextual language understanding. Evaluated on adversarially poisoned dataset (N=7,990) using Black Hole protocol on Tesla T4 GPU.

Result: DistilBERT achieved perfect detection (AUC=1.0000, F1=0.9981) with 7.403ms latency. CatBoost achieved competitive performance (AUC=0.9905, F1=0.9486) with 8.4x lower latency (0.885ms) and negligible resource consumption.

Conclusion: DistilBERT is optimal for GPU-equipped organizations requiring maximum accuracy, while CatBoost is better for edge deployments or cost-sensitive environments. Both approaches provide >99.96% ROI when optimized through cost-sensitive learning.

Abstract: Business Email Compromise (BEC) is a sophisticated social engineering threat that manipulates organizational hierarchies and exploits psychological vulnerabilities, leading to significant financial damage. According to the 2024 FBI Internet Crime Report, BEC accounts for over $2.9 billion in annual adjusted losses, presenting significant economic asymmetry: the cost of a False Negative (fraud loss) exceeds the cost of a False Positive (manual review) by orders of magnitude (approximately 1 to 5,480). This paper examines two detection paradigms for BEC: the Forensic Psycholinguistic Stream, which utilizes CatBoost to analyze psycholinguistic cues with high interpretability and low latency, and the Semantic Stream, which employs DistilBERT for deep learning-based contextual language understanding, offering superior accuracy at higher computational cost. We evaluated DistilBERT on an adversarially poisoned dataset (N = 7,990) generated via our Black Hole protocol, benchmarked on Tesla T4 GPU infrastructure, achieving superior detection (AUC = 1.0000, F1 = 0.9981) with acceptable real-time latency (7.403 milliseconds). CatBoost achieves competitive detection (AUC = 0.9905, F1 = 0.9486) at 8.4x lower latency (0.885 milliseconds), consuming negligible computational resources. For organizations with GPU infrastructure, DistilBERT offers superior accuracy. CatBoost is preferable for edge deployments or cost-sensitive environments due to comparable security and lower operational costs. Both approaches demonstrate return on investment exceeding 99.96% when optimized through cost-sensitive learning, by significantly reducing false negatives and associated financial losses.

[374] Dataset Poisoning Attacks on Behavioral Cloning Policies

Akansha Kalra, Soumil Datta, Ethan Gilmore, Duc La, Guanhong Tao, Daniel S. Brown

Main category: cs.LG

TL;DR: First analysis of clean-label backdoor attacks on Behavior Cloning policies, showing they remain vulnerable even with minimal dataset poisoning while maintaining deceptive baseline performance.

Details

Motivation: As Behavior Cloning policies are increasingly deployed in real-world systems, understanding their robustness and vulnerabilities to backdoor attacks is crucial for safety and security.

Method: Inject visual triggers into demonstration datasets to create spurious correlations, and introduce novel entropy-based test-time trigger attacks that identify critical states for maximum performance degradation.

Result: BC policies trained on minimally poisoned datasets show near-baseline task performance but are highly vulnerable to backdoor attacks during deployment, with effectiveness scaling with poison fraction and trigger strength.

Conclusion: Urgent need for more research into BC policy robustness, especially as large-scale datasets are used for real-world cyber-physical systems, due to deceptive vulnerability to backdoor attacks.

Abstract: Behavior Cloning (BC) is a popular framework for training sequential decision policies from expert demonstrations via supervised learning. As these policies are increasingly being deployed in the real world, their robustness and potential vulnerabilities are an important concern. In this work, we perform the first analysis of the efficacy of clean-label backdoor attacks on BC policies. Our backdoor attacks poison a dataset of demonstrations by injecting a visual trigger to create a spurious correlation that can be exploited at test time. We evaluate how policy vulnerability scales with the fraction of poisoned data, the strength of the trigger, and the trigger type. We also introduce a novel entropy-based test-time trigger attack that substantially degrades policy performance by identifying critical states where test-time triggering of the backdoor is expected to be most effective at degrading performance. We empirically demonstrate that BC policies trained on even minimally poisoned datasets exhibit deceptively high, near-baseline task performance despite being highly vulnerable to backdoor trigger attacks during deployment. Our results underscore the urgent need for more research into the robustness of BC policies, particularly as large-scale datasets are increasingly used to train policies for real-world cyber-physical systems. Videos and code are available at https://sites.google.com/view/dataset-poisoning-in-bc.

[375] Subgoal Graph-Augmented Planning for LLM-Guided Open-World Reinforcement Learning

Shanwei Fan

Main category: cs.LG

TL;DR: SGA-ACR framework addresses LLM planning-execution misalignment in RL by integrating environment-specific subgoal graphs with multi-LLM planning pipeline that separates generation, critique, and refinement.

Details

Motivation: LLMs have strong high-level planning for RL but suffer from poor planning-execution alignment due to semantically plausible but infeasible subgoals and conflated generation-verification processes.

Method: Proposes SGA-ACR framework with environment-specific subgoal graph, structured entity knowledge, multi-LLM planning pipeline (generation, critique, refinement), and subgoal tracker for execution monitoring and adaptive graph updates.

Result: Experimental results on 22 diverse tasks in ‘Crafter’ game demonstrate effectiveness of the proposed method.

Conclusion: The framework successfully addresses LLM planning-execution misalignment by producing executable and verifiable subgoals through structured multi-stage planning and environment-aware adaptation.

Abstract: Large language models (LLMs) offer strong high-level planning capabilities for reinforcement learning (RL) by decomposing tasks into subgoals. However, their practical utility is limited by poor planning-execution alignment, which reflects a critical gap between abstract plans and actionable, environment-compatible behaviors. This misalignment arises from two interrelated limitations: (1) LLMs often produce subgoals that are semantically plausible but infeasible or irrelevant in the target environment due to insufficient grounding in environment-specific knowledge, and (2) single-LLM planning conflates generation with self-verification, resulting in overconfident yet unreliable subgoals that frequently fail during execution. To address these challenges, we propose Subgoal Graph-Augmented Actor-Critic-Refiner (SGA-ACR), a framework that integrates an environment-specific subgoal graph and structured entity knowledge with a multi-LLM planning pipeline that explicitly separates generation, critique, and refinement to produce executable and verifiable subgoals. A subgoal tracker further monitors execution progress, provides auxiliary rewards, and adaptively updates the subgoal graph to maintain alignment between plans and actions. Experimental results on 22 diverse tasks in the open-world game “Crafter” demonstrate the effectiveness of our proposed method.

[376] FANoise: Singular Value-Adaptive Noise Modulation for Robust Multimodal Representation Learning

Jiaoyang Li, Jun Fang, Tianhao Gao, Xiaohui Zhang, Zhiyuan Liu, Chao Liu, Pengzhang Liu, Qixia Jiang

Main category: cs.LG

TL;DR: FANoise is a feature-adaptive noise injection strategy for multimodal representation learning that dynamically adjusts noise based on feature distributions during training, improving performance over static noise methods.

Details

Motivation: Existing noise injection methods for representation learning rely on heuristic or static noise, failing to account for the dynamic nature of feature distributions during training, which limits their effectiveness.

Method: Proposed FANoise, a feature-adaptive noise injection strategy that leverages contrastive learning dynamics to mitigate negative noise impacts while preserving benefits, using InfoNCE loss as the foundation.

Result: Comprehensive experiments show FANoise consistently improves overall performance on multimodal tasks across various base VLM models compared to static noise methods.

Conclusion: Feature-adaptive noise injection through FANoise provides a theoretically grounded framework that effectively enhances representation learning by dynamically adjusting to training dynamics.

Abstract: Representation learning is fundamental to modern machine learning, powering applications such as text retrieval and multimodal understanding. However, learning robust and generalizable representations remains challenging. While prior work has demonstrated that active noise injection, a form of data augmentation, can enhance encoding performance, most existing methods rely on heuristic or static noise, overlooking the dynamic nature of feature distributions during training. In this work, we systematically study the role of noise in representation learning from both gradient-based and feature distribution perspectives, using InfoNCE loss as a representative example. Focusing on multimodal representation learning, we propose FANoise, a novel feature-adaptive noise injection strategy. By leveraging the dynamics of contrastive learning, FANoise effectively mitigates the negative impacts of noise while preserving its benefits. Under this theoretically grounded framework, comprehensive experiments demonstrate that FANoise consistently improves overall performance on multimodal tasks across various base VLM models.

[377] Estimating Ising Models in Total Variation Distance

Constantinos Daskalakis, Vardis Kandiros, Rui Yao

Main category: cs.LG

TL;DR: The paper provides a unified analysis of Maximum Pseudo-Likelihood Estimator (MPLE) for Ising models, achieving polynomial-time estimation in Total Variation distance for two general classes: models with bounded operator norm satisfying Modified Log-Sobolev Inequality, and models with bounded infinity norm.

Details

Motivation: While statistical complexity of Ising model estimation is understood, finding computationally and statistically efficient algorithms has been challenging. Previous work focused on specific cases like trees, Gaussian interactions, or special eigenvalue distributions, but no unified polynomial-time framework existed for general TV distance estimation.

Method: The authors use Maximum Pseudo-Likelihood Estimator (MPLE) with a unified analysis approach. They employ tools including tensorization inequalities, measure decompositions, and concentration bounds to analyze MPLE for two general classes of Ising models.

Result: The analysis yields polynomial-time algorithms with optimal or near-optimal sample complexity guarantees for various settings. The framework works for models with bounded operator norm satisfying MLSI, and models with bounded infinity norm (bounded width).

Conclusion: The paper provides a unified framework for polynomial-time estimation of Ising models in TV distance, addressing a long-standing challenge by covering general classes beyond previously studied special cases.

Abstract: We consider the problem of estimating Ising models over $n$ variables in Total Variation (TV) distance, given $l$ independent samples from the model. While the statistical complexity of the problem is well-understood [DMR20], identifying computationally and statistically efficient algorithms has been challenging. In particular, remarkable progress has occurred in several settings, such as when the underlying graph is a tree [DP21, BGPV21], when the entries of the interaction matrix follow a Gaussian distribution [GM24, CK24], or when the bulk of its eigenvalues lie in a small interval [AJK+24, KLV24], but no unified framework for polynomial-time estimation in TV exists so far. Our main contribution is a unified analysis of the Maximum Pseudo-Likelihood Estimator (MPLE) for two general classes of Ising models. The first class includes models that have bounded operator norm and satisfy the Modified Log-Sobolev Inequality (MLSI), a functional inequality that was introduced to study the convergence of the associated Glauber dynamics to stationarity. In the second class of models, the interaction matrix has bounded infinity norm (or bounded width), which is the most common assumption in the literature for structure learning of Ising models. We show how our general results for these classes yield polynomial-time algorithms and optimal or near-optimal sample complexity guarantees in a variety of settings. Our proofs employ a variety of tools from tensorization inequalities to measure decompositions and concentration bounds.

[378] ChatGpt Content detection: A new approach using xlm-roberta alignment

Md Tasnin Tanvir, Dr Santanu Kumar Dash, Ishan Shahnan, Nafis Fuad, Tanvir Rahman, Abdullah Al Faisal, Asadullah Al Mamun

Main category: cs.LG

TL;DR: This paper presents an AI-generated text detection system using XLM-RoBERTa that achieves high accuracy in distinguishing human and AI content, with feature analysis showing perplexity and attention features as key differentiators.

Details

Motivation: The urgent need to detect AI-generated text as generative AI technologies like ChatGPT become more widely available, to maintain academic integrity and promote transparency in AI systems.

Method: Used XLM-RoBERTa transformer model with rigorous preprocessing and feature extraction (perplexity, semantic, readability features), fine-tuned on balanced human/AI text dataset.

Result: The model demonstrated high accuracy and robust performance across various text genres, with feature analysis revealing perplexity and attention-based features as critical differentiators.

Conclusion: Provides a valuable tool for academic integrity and AI ethics, with future research directions including exploring other advanced models and expanding datasets for better generalizability.

Abstract: The challenge of separating AI-generated text from human-authored content is becoming more urgent as generative AI technologies like ChatGPT become more widely available. In this work, we address this issue by looking at both the detection of content that has been entirely generated by AI and the identification of human text that has been reworded by AI. In our work, a comprehensive methodology to detect AI- generated text using XLM-RoBERTa, a state-of-the-art multilingual transformer model. Our approach includes rigorous preprocessing, and feature extraction involving perplexity, semantic, and readability features. We fine-tuned the XLM-RoBERTa model on a balanced dataset of human and AI-generated texts and evaluated its performance. The model demonstrated high accuracy and robust performance across various text genres. Additionally, we conducted feature analysis to understand the model’s decision-making process, revealing that perplexity and attention-based features are critical in differentiating between human and AI-generated texts. Our findings offer a valuable tool for maintaining academic integrity and contribute to the broader field of AI ethics by promoting transparency and accountability in AI systems. Future research directions include exploring other advanced models and expanding the dataset to enhance the model’s generalizability.

[379] Staggered Environment Resets Improve Massively Parallel On-Policy Reinforcement Learning

Sid Bharthulwar, Stone Tao, Hao Su

Main category: cs.LG

TL;DR: Staggered resets reduce harmful nonstationarity in massively parallel GPU RL training by initializing environments at varied points in the task horizon, improving sample efficiency and performance.

Details

Motivation: Standard synchronous resets in massively parallel RL training with high update-to-data ratios introduce harmful nonstationarity that skews learning signals and destabilizes training.

Method: Introduce staggered resets where environments are initialized and reset at varied points within the task horizon, creating training batches with greater temporal diversity.

Result: Achieved significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance in challenging high-dimensional robotics environments, with better scaling to more parallel environments.

Conclusion: Staggered resets are a simple yet effective technique that reduces nonstationarity in parallel RL training, leading to improved training stability and performance.

Abstract: Massively parallel GPU simulation environments have accelerated reinforcement learning (RL) research by enabling fast data collection for on-policy RL algorithms like Proximal Policy Optimization (PPO). To maximize throughput, it is common to use short rollouts per policy update, increasing the update-to-data (UTD) ra- tio. However, we find that, in this setting, standard synchronous resets introduce harmful nonstationarity, skewing the learning signal and destabilizing training. We introduce staggered resets, a simple yet effective technique where environments are initialized and reset at varied points within the task horizon. This yields training batches with greater temporal diversity, reducing the nonstationarity induced by synchronized rollouts. We characterize dimensions along which RL environments can benefit significantly from staggered resets through illustrative toy environ- ments. We then apply this technique to challenging high-dimensional robotics environments, achieving significantly higher sample efficiency, faster wall-clock convergence, and stronger final performance. Finally, this technique scales better with more parallel environments compared to naive synchronized rollouts.

[380] Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

Liangzu Peng, Aditya Chattopadhyay, Luca Zancato, Elvis Nunez, Wei Xia, Stefano Soatto

Main category: cs.LG

TL;DR: Gated KalmaNet (GKA) is a linear state-space model layer that solves online ridge regression to maintain full past information efficiently, achieving constant memory and linear compute while outperforming other SSMs on both short and long-context tasks.

Details

Motivation: Linear state-space models are efficient but maintain only lossy summaries of the past, leading to inferior performance in recall-oriented tasks. GKA aims to bridge this gap by accounting for the full past while maintaining efficiency.

Method: GKA solves online ridge regression using Kalman Filter inspiration with two key innovations: adaptive regularization with input-dependent gating for numerical stability, and Chebyshev Iteration for stable low-precision computation. Includes hardware-aware chunk-wise implementation and custom backpropagation kernels.

Result: GKA outperforms existing SSM layers (Mamba2, GLA, Gated DeltaNet) on short-context language understanding tasks. On long-context tasks up to 128k tokens, achieves >10% relative improvement over fading memory baselines in RAG and LongQA tasks.

Conclusion: GKA successfully bridges the performance gap between efficient SSMs and full-context models by maintaining complete past information through stable online ridge regression, demonstrating strong capabilities across both short and long-context scenarios.

Abstract: As efficient alternatives to softmax Attention, linear state-space models (SSMs) achieve constant memory and linear compute, but maintain only a lossy, fading summary of the past, often leading to inferior performance in recall oriented tasks. We propose Gated KalmaNet (GKA), a layer that reduces this gap by accounting for the full past when predicting the next token, while maintaining SSM-style efficiency. GKA achieves this by solving an online ridge regression problem at test time, with constant memory and linear compute cost in the sequence length. Drawing inspiration from the Kalman Filter, we iteratively solve the online ridge regression problem. However, a critical insight is that standard Kalman filter equations are numerically unstable in low-precision environments (like bfloat16) and difficult to parallelize in modern hardware. We address both challenges through two key innovations: (1) an adaptive regularization strategy with input-dependent gating that controls the condition number of the ridge regression, ensuring numerical stability while balancing memory retention. And (2) the use of Chebyshev Iteration instead of other conventional iterative solvers, which we demonstrate to be more stable in low-precision settings. To further improve scalability, we develop a hardware-aware chunk-wise implementation of Chebyshev Iteration along with custom kernels for backpropagating through our adaptive regularization and gating mechanisms. Empirically, GKA shows strong language understanding capabilites on short-context tasks outperforming existing SSM layers (like Mamba2, GLA and Gated DeltaNet). On long-context, GKA excels at real-world RAG and LongQA tasks up to 128k tokens, achieving more than $10$% relative improvement over other fading memory baselines.

[381] Probabilistic Wildfire Spread Prediction Using an Autoregressive Conditional Generative Adversarial Network

Taehoon Kang, Taeyong Kim

Main category: cs.LG

TL;DR: Proposes an autoregressive conditional GAN for probabilistic wildfire spread prediction that outperforms conventional deep learning models in accuracy and boundary delineation while capturing nonlinear dynamics.

Details

Motivation: Climate change intensifies wildfires, requiring rapid prediction. Physics-based simulators are computationally intensive, while existing deep learning models produce overly smooth predictions that miss complex wildfire dynamics.

Method: Autoregressive conditional generative adversarial network (CGAN) that learns sequential state transitions for long-term prediction stability, formulated as an autoregressive problem.

Result: Outperforms conventional deep learning models in both overall predictive accuracy and boundary delineation of fire perimeters, capturing strong nonlinearity and uncertainty of wildfire spread.

Conclusion: The CGAN-based autoregressive framework enhances accuracy and physical interpretability of wildfire spread prediction, providing a foundation for time-sensitive response and evacuation planning.

Abstract: Climate change has intensified the frequency and severity of wildfires, making rapid and accurate prediction of fire spread essential for effective mitigation and response. Physics-based simulators such as FARSITE offer high-fidelity predictions but are computationally intensive, limiting their applicability in real-time decision-making, while existing deep learning models often yield overly smooth predictions that fail to capture the complex, nonlinear dynamics of wildfire propagation. This study proposes an autoregressive conditional generative adversarial network (CGAN) for probabilistic wildfire spread prediction. By formulating the prediction task as an autoregressive problem, the model learns sequential state transitions, ensuring long-term prediction stability. Experimental results demonstrate that the proposed CGAN-based model outperforms conventional deep learning models in both overall predictive accuracy and boundary delineation of fire perimeters. These results demonstrate that adversarial learning allows the model to capture the strong nonlinearity and uncertainty of wildfire spread, instead of simply fitting the pixel average. Furthermore, the autoregressive framework facilitates systematic temporal forecasting of wildfire evolution. The proposed CGAN-based autoregressive framework enhances both the accuracy and physical interpretability of wildfire spread prediction, offering a promising foundation for time-sensitive response and evacuation planning.

[382] A Probabilistic Framework for Temporal Distribution Generalization in Industry-Scale Recommender Systems

Yuxuan Zhu, Cong Fu, Yabo Ni, Anxiang Zeng, Yuan Fang

Main category: cs.LG

TL;DR: ELBO_TDS is a probabilistic framework that addresses temporal distribution shift in recommender systems through causal modeling and data augmentation, achieving 2.33% GMV uplift and deployed in Shopee Product Search.

Details

Motivation: Temporal distribution shift erodes recommender system accuracy, and existing methods like invariant learning and self-supervised learning suffer from unstable generalization, representation collapse, or inefficient data utilization.

Method: Statistical analysis identifies shifting factors, followed by data augmentation resampling time-varying factors. A causal graph models the temporal recommendation scenario, deriving a self-supervised variational objective ELBO_TDS.

Result: Superior temporal generalization with 2.33% uplift in GMV per user, successfully deployed in Shopee Product Search.

Conclusion: ELBO_TDS effectively addresses temporal distribution shift through causal modeling and data augmentation, providing stable temporal generalization for industrial recommender systems.

Abstract: Temporal distribution shift (TDS) erodes the long-term accuracy of recommender systems, yet industrial practice still relies on periodic incremental training, which struggles to capture both stable and transient patterns. Existing approaches such as invariant learning and self-supervised learning offer partial solutions but often suffer from unstable temporal generalization, representation collapse, or inefficient data utilization. To address these limitations, we propose ELBO$\text{TDS}$, a probabilistic framework that integrates seamlessly into industry-scale incremental learning pipelines. First, we identify key shifting factors through statistical analysis of real-world production data and design a simple yet effective data augmentation strategy that resamples these time-varying factors to extend the training support. Second, to harness the benefits of this extended distribution while preventing representation collapse, we model the temporal recommendation scenario using a causal graph and derive a self-supervised variational objective, ELBO$\text{TDS}$, grounded in the causal structure. Extensive experiments supported by both theoretical and empirical analysis demonstrate that our method achieves superior temporal generalization, yielding a 2.33% uplift in GMV per user and has been successfully deployed in Shopee Product Search. Code is available at https://github.com/FuCongResearchSquad/ELBO4TDS.

[383] Prediction of Herd Life in Dairy Cows Using Multi-Head Attention Transformers

Mahdi Saki, Justin Lipman

Main category: cs.LG

TL;DR: AI model predicts cow longevity using historical data with 83% accuracy to help farmers make better culling decisions.

Details

Motivation: Dairy farmers need objective tools to identify resilient cows that can complete more lactations, as current decision-making is complex with significant economic and environmental implications.

Method: Used Multi-Head Attention Transformers to analyze 780,000 records from 19,000 cows across 7 Australian farms, leveraging historical multivariate time-series data from birth.

Result: The model achieved an overall determination coefficient of 83% in predicting herd life across the studied farms.

Conclusion: The AI-driven model shows strong potential for practical application in dairy herd management to improve culling decisions and identify more resilient cows.

Abstract: Dairy farmers should decide to keep or cull a cow based on an objective assessment of her likely performance in the herd. For this purpose, farmers need to identify more resilient cows, which can cope better with farm conditions and complete more lactations. This decision-making process is inherently complex, with significant environmental and economic implications. In this study, we develop an AI-driven model to predict cow longevity using historical multivariate time-series data recorded from birth. Leveraging advanced AI techniques, specifically Multi-Head Attention Transformers, we analysed approximately 780,000 records from 19,000 unique cows across 7 farms in Australia. The results demonstrate that our model achieves an overall determination coefficient of 83% in predicting herd life across the studied farms, highlighting its potential for practical application in dairy herd management.

[384] RAVQ-HoloNet: Rate-Adaptive Vector-Quantized Hologram Compression

Shima Rafiei, Zahra Nabizadeh Shahr Babak, Shadrokh Samavi, Shahram Shirani

Main category: cs.LG

TL;DR: RAVQ-HoloNet is a rate-adaptive vector quantization framework for holography compression that achieves superior performance at low bit rates compared to existing methods.

Details

Motivation: Holography has great potential for AR/VR but faces adoption challenges due to high data compression demands, and current deep learning approaches lack rate adaptivity within single networks.

Method: A rate-adaptive vector quantization framework called RAVQ-HoloNet that enables high-fidelity reconstructions at various bit rates.

Result: Outperforms state-of-the-art methods with -33.91% BD-Rate improvement and 1.02 dB BD-PSNR gain in low bit rate scenarios, as shown by rate-distortion curves.

Conclusion: The proposed framework successfully addresses the rate adaptivity limitation in holography compression and delivers superior performance at low and ultra-low bit rates.

Abstract: Holography offers significant potential for AR/VR applications, yet its adoption is limited by the high demands of data compression. Existing deep learning approaches generally lack rate adaptivity within a single network. We present RAVQ-HoloNet, a rate-adaptive vector quantization framework that achieves high-fidelity reconstructions at low and ultra-low bit rates, outperforming current state-of-the-art methods. In low bit, our method exceeds by -33.91% in BD-Rate and achieves a BD-PSNR of 1.02 dB from the best existing method demonstrated by the rate-distortion curve.

[385] CNN-LSTM Hybrid Architecture for Over-the-Air Automatic Modulation Classification Using SDR

Dinanath Padhya, Krishna Acharya, Bipul Kumar Dahal, Dinesh Baniya Kshatri

Main category: cs.LG

TL;DR: A hybrid CNN-LSTM architecture for automatic modulation classification achieves 93.48% accuracy by combining spatial feature extraction with temporal dependency modeling, validated through over-the-air testing.

Details

Motivation: AMC is essential for cognitive radio, spectrum monitoring, and intelligent communication networks to identify modulation schemes without prior knowledge.

Method: Hybrid CNN-LSTM architecture integrated with SDR platform, using CNN for spatial features and LSTM for temporal dependencies, trained on hybrid dataset (RadioML2018 + custom) with SNRs 0-30dB.

Result: Achieved 93.48% accuracy, 93.53% precision, 93.48% recall, 93.45% F1 score, with AUC-ROC confirming discriminative power in noisy conditions. Successfully identified OTA signals from FM transmitter.

Conclusion: The hybrid CNN-LSTM architecture is effective for AMC and has potential applications in adaptive spectrum management and advanced cognitive radio systems.

Abstract: Automatic Modulation Classification (AMC) is a core technology for future wireless communication systems, enabling the identification of modulation schemes without prior knowledge. This capability is essential for applications in cognitive radio, spectrum monitoring, and intelligent communication networks. We propose an AMC system based on a hybrid Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) architecture, integrated with a Software Defined Radio (SDR) platform. The proposed architecture leverages CNNs for spatial feature extraction and LSTMs for capturing temporal dependencies, enabling efficient handling of complex, time-varying communication signals. The system’s practical ability was demonstrated by identifying over-the-air (OTA) signals from a custom-built FM transmitter alongside other modulation schemes. The system was trained on a hybrid dataset combining the RadioML2018 dataset with a custom-generated dataset, featuring samples at Signal-to-Noise Ratios (SNRs) from 0 to 30dB. System performance was evaluated using accuracy, precision, recall, F1 score, and the Area Under the Receiver Operating Characteristic Curve (AUC-ROC). The optimized model achieved 93.48% accuracy, 93.53% precision, 93.48% recall, and an F1 score of 93.45%. The AUC-ROC analysis confirmed the model’s discriminative power, even in noisy conditions. This paper’s experimental results validate the effectiveness of the hybrid CNN-LSTM architecture for AMC, suggesting its potential application in adaptive spectrum management and advanced cognitive radio systems.

[386] FedAPA: Federated Learning with Adaptive Prototype Aggregation Toward Heterogeneous Wi-Fi CSI-based Crowd Counting

Jingtao Guo, Yuyi Mao, Ivan Wang-Hei Ho

Main category: cs.LG

TL;DR: FedAPA is a federated learning approach for Wi-Fi CSI-based sensing that uses adaptive prototype aggregation to handle data heterogeneity and reduce communication overhead while improving accuracy in crowd counting tasks.

Details

Motivation: Large-scale deployment of Wi-Fi CSI-based sensing is limited by the need for extensive site-specific training data. Federated learning can help but faces challenges with heterogeneous sensing data and device resources.

Method: Uses adaptive prototype aggregation (APA) strategy with similarity-based weights for peer prototypes, creating personalized global prototypes. Combines classification learning with representation contrastive learning during local training to align local and global knowledge.

Result: Outperforms baselines in real-world distributed Wi-Fi crowd counting with 6 environments and up to 20 people: 9.65% accuracy increase, 9% F1 score gain, 0.29 MAE reduction, and 95.94% communication overhead reduction.

Conclusion: FedAPA effectively addresses data heterogeneity in federated Wi-Fi sensing, providing significant improvements in accuracy and communication efficiency for practical deployment.

Abstract: Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.

[387] Breaking the Safety-Capability Tradeoff: Reinforcement Learning with Verifiable Rewards Maintains Safety Guardrails in LLMs

Dongkyu Derek Cho, Huan Song, Arijit Ghosh Chowdhury, Haotian An, Yawei Wang, Rohit Thekkanal, Negin Sokhandan, Sharlina Keshava, Hannah Marlowe

Main category: cs.LG

TL;DR: RLVR (reinforcement learning with verifiable rewards) can enhance reasoning capabilities while maintaining or improving safety, challenging the conventional safety-capability tradeoff in LLM fine-tuning.

Details

Motivation: Current fine-tuning methods (SFT, RLHF) for LLMs show a fundamental safety-capability tradeoff where improved task performance degrades safety alignment, even on benign datasets. The safety implications of RLVR remain unexplored.

Method: Comprehensive theoretical analysis with safety drift bounds under KL-constrained optimization, plus extensive empirical experiments across five adversarial safety benchmarks, examining optimization algorithms, model scale, and task domains.

Result: RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails, with theoretical conditions identified where safety degradation is eliminated.

Conclusion: The findings challenge the prevailing assumption of an inevitable safety-capability trade-off, showing that specific training methodologies can achieve both objectives simultaneously, providing insights for safe deployment of reasoning-capable LLMs.

Abstract: Fine-tuning large language models (LLMs) for downstream tasks typically exhibit a fundamental safety-capability tradeoff, where improving task performance degrades safety alignment even on benign datasets. This degradation persists across standard approaches including supervised finetuning (SFT) and reinforcement learning from human feedback (RLHF). While reinforcement learning with verifiable rewards (RLVR) has emerged as a promising alternative that optimizes models on objectively measurable tasks, its safety implications remain unexplored. We present the first comprehensive theoretical and empirical analysis of safety properties in RLVR. Theoretically, we derive upper bounds on safety drift under KL-constrained optimization and prove conditions under which safety degradation is eliminated. Empirically, we conduct extensive experiments across five adversarial safety benchmarks, demonstrating that RLVR can simultaneously enhance reasoning capabilities while maintaining or improving safety guardrails. Our comprehensive ablation studies examine the effects of optimization algorithms, model scale, and task domains. Our findings challenge the prevailing assumption of an inevitable safety capability trade-off, and establish that a specific training methodology can achieve both objectives simultaneously, providing insights for the safe deployment of reasoning-capable LLMs.

[388] Efficient Diffusion Planning with Temporal Diffusion

Jiaming Guo, Rui Zhang, Zerun Li, Yunkai Gao, Shaohui Peng, Siming Lan, Xing Hu, Zidong Du, Xishan Zhang, Ling Li

Main category: cs.LG

TL;DR: Temporal Diffusion Planner (TDP) improves decision efficiency in diffusion planning by distributing denoising steps across time, reducing computational overhead while maintaining performance.

Details

Motivation: Previous diffusion planning methods generate new plans at each time step, causing significant computational overhead and lower decision frequencies, while humans create detailed short-term and vague long-term plans that adjust over time.

Method: TDP generates initial plans that become progressively more vague over time, then updates previous plans with small denoising steps at each time step rather than generating entirely new plans, plus automated replanning to prevent plan-reality deviations.

Result: On D4RL benchmarks, TDP improves decision-making frequency by 11-24.8 times compared to previous methods while achieving higher or comparable performance.

Conclusion: TDP successfully addresses computational inefficiency in diffusion planning by mimicking human planning strategies, enabling much higher decision frequencies without sacrificing performance.

Abstract: Diffusion planning is a promising method for learning high-performance policies from offline data. To avoid the impact of discrepancies between planning and reality on performance, previous works generate new plans at each time step. However, this incurs significant computational overhead and leads to lower decision frequencies, and frequent plan switching may also affect performance. In contrast, humans might create detailed short-term plans and more general, sometimes vague, long-term plans, and adjust them over time. Inspired by this, we propose the Temporal Diffusion Planner (TDP) which improves decision efficiency by distributing the denoising steps across the time dimension. TDP begins by generating an initial plan that becomes progressively more vague over time. At each subsequent time step, rather than generating an entirely new plan, TDP updates the previous one with a small number of denoising steps. This reduces the average number of denoising steps, improving decision efficiency. Additionally, we introduce an automated replanning mechanism to prevent significant deviations between the plan and reality. Experiments on D4RL show that, compared to previous works that generate new plans every time step, TDP improves the decision-making frequency by 11-24.8 times while achieving higher or comparable performance.

[389] A Unified Understanding of Offline Data Selection and Online Self-refining Generation for Post-training LLMs

Quan Xiao, Tianyi Chen

Main category: cs.LG

TL;DR: A unified framework for offline data selection and online self-refining generation in LLM fine-tuning, using bilevel optimization to assign learned weights to data samples.

Details

Motivation: To improve LLM adaptation to specific tasks by enhancing data quality through systematic offline selection and online refinement processes.

Method: Bilevel data selection for offline filtering and treating online self-refining generation as model adaptation, with learned data weights for questions and responses.

Result: Theoretical demonstration of bilevel data selection effectiveness and performance gains over unfiltered baselines, validated on quality enhancement and safety-aware fine-tuning.

Conclusion: Combining offline data selection with validation-weighted online generations enhances fine-tuning performance for LLMs.

Abstract: Offline data selection and online self-refining generation, which enhance the data quality, are crucial steps in adapting large language models (LLMs) to specific downstream tasks. We tackle offline data selection and online self-refining generations through an optimization perspective. Specifically, bilevel data selection is used for offline data selection with respect to the validation dataset, and we treat online self-refining generation as a model adaptation step of selecting the model trained on current responses that best fits the validation data. Our framework offers a unified understanding of offline data selection and self-refining generation by assigning a learned data weight to each question and response, either explicitly or implicitly. For the first time, we theoretically demonstrate the effectiveness of the bilevel data selection framework and demonstrate its performance gains over unfiltered direct mixing baselines. By combining offline data with validation-weighted online generations, our method enhances fine-tuning performance. Experiments on quality enhancement and safety-aware LLM fine-tuning validate its effectiveness.

[390] G-Net: A Provably Easy Construction of High-Accuracy Random Binary Neural Networks

Alireza Aghasi, Nicholas Marshall, Saeid Pourmand, Wyatt Whiting

Main category: cs.LG

TL;DR: Proposes G-Nets, a novel family of randomized binary neural networks inspired by hyperdimensional computing that achieve competitive accuracy with theoretical guarantees.

Details

Motivation: To bridge neural networks with randomized binary neural networks using hyperdimensional computing principles, offering efficient hardware implementation and robustness to model corruptions.

Method: Uses binary embeddings in hypercube with Hamming distance, creates floating-point G-Nets that can mimic standard layers, and provides randomized binary embeddings (EHD G-Nets) with theoretical guarantees via concentration of measure.

Result: Binary models match CNN accuracies and outperform prior HDC models by large margins (e.g., ~30% higher accuracy on CIFAR-10).

Conclusion: G-Nets provide a theoretically justified bridge between neural networks and randomized binary neural networks, opening new directions for robust binary/quantized deep learning models.

Abstract: We propose a novel randomized algorithm for constructing binary neural networks with tunable accuracy. This approach is motivated by hyperdimensional computing (HDC), which is a brain-inspired paradigm that leverages high-dimensional vector representations, offering efficient hardware implementation and robustness to model corruptions. Unlike traditional low-precision methods that use quantization, we consider binary embeddings of data as points in the hypercube equipped with the Hamming distance. We propose a novel family of floating-point neural networks, G-Nets, which are general enough to mimic standard network layers. Each floating-point G-Net has a randomized binary embedding, an embedded hyperdimensional (EHD) G-Net, that retains the accuracy of its floating-point counterparts, with theoretical guarantees, due to the concentration of measure. Empirically, our binary models match convolutional neural network accuracies and outperform prior HDC models by large margins, for example, we achieve almost 30% higher accuracy on CIFAR-10 compared to prior HDC models. G-Nets are a theoretically justified bridge between neural networks and randomized binary neural networks, opening a new direction for constructing robust binary/quantized deep learning models. Our implementation is available at https://github.com/GNet2025/GNet.

[391] Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

Zhenchao Tang, Fang Wang, Haohuai He, Jiale Zhou, Tianxu Lv, Jun Zhu, Shouzhi Chen, Minghao Yang, Yu Wang, Jiayang Wu, Yidong Song, Jianhua Yao

Main category: cs.LG

TL;DR: BFT is an efficient post-training method that enables LLMs to learn complex biomedical reasoning from sparse data without external rewards, outperforming SFT and specialized agents through token-level and sample-level weighting mechanisms.

Details

Motivation: Current approaches for aligning LLMs with biomedical knowledge face limitations: SFT overfits to surface patterns without internalizing fragmented scientific knowledge, while RL is impractical due to the prohibitive cost of experimental validation for reward signals.

Method: BFT uses a two-layer weighting mechanism: 1) token-level scaling of loss via prediction probabilities to stabilize gradients and prevent overfitting; 2) sample-level “minimum group confidence” to adaptively enhance learning of hard samples.

Result: BFT significantly outperforms SFT in medical tasks, enabling LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent in biological process reasoning. BFT embeddings can be directly applied to downstream tasks like gene interaction and single-cell perturbation prediction.

Conclusion: BFT facilitates broad applications of LLMs in biomedical research by enabling effective learning from sparse data without external rewards, making it a practical solution for aligning LLMs with specialized biomedical knowledge.

Abstract: Effective post-training is essential to align Large Language Models (LLMs) with specialized biomedical knowledge to accelerate life science research. However, current approaches face significant limitations. First, biomedical reasoning involves intricate mechanisms often represented by sparse textual data. Standard Supervised Fine-Tuning (SFT) tends to overfit to surface-level instruction patterns without effectively internalizing this fragmented scientific knowledge. Second, Reinforcement Learning (RL) is impractical for this domain, as defining meaningful rewards often necessitates prohibitive experimental validation (e.g., wet-lab verification of drug responses), rendering real-time feedback unfeasible. We propose Balanced Fine-Tuning (BFT), an efficient post-training method designed to learn complex reasoning from sparse data without external reward signals. BFT operates through a two-layer weighting mechanism: 1. At the token level, it scales loss via prediction probabilities to stabilize gradients and prevent overfitting; 2. At the sample level, it uses “minimum group confidence” to adaptively enhance the learning of hard samples. Experiments demonstrate that BFT significantly outperforms SFT. In medical tasks, it enables LLMs to acquire knowledge that SFT misses. In biological tasks, BFT-based LLMs surpass GeneAgent (an accurate agent for biology analysis) in biological process reasoning. Moreover, the text embeddings generated by BFT can be directly applied to downstream tasks, such as gene interaction and single-cell perturbation response prediction. These results indicate that BFT facilitates broad applications of LLMs in biomedical research.

[392] Deceptron: Learned Local Inverses for Fast and Stable Physics Inversion

Aaditya L. Kachhadiya

Main category: cs.LG

TL;DR: The Deceptron is a lightweight bidirectional module that learns local inverses for ill-conditioned inverse problems, enabling faster convergence through output-space gradient descent with inverse preconditioning.

Details

Motivation: Inverse problems in physical sciences are often ill-conditioned, making progress step-size sensitive and requiring many iterations for convergence.

Method: Proposes Deceptron module with training combining supervised fit, forward-reverse consistency, spectral penalty, soft bias tie, and Jacobian Composition Penalty (JCP). Uses D-IPG (Deceptron Inverse-Preconditioned Gradient) that takes descent steps in output space and pulls back through learned inverse.

Result: Achieves ~20x fewer iterations on Heat-1D and ~2-3x fewer on Damped Oscillator problems compared to projected gradient, competitive with Gauss-Newton. JCP reduces composition error and tracks iteration gains.

Conclusion: Deceptron enables efficient inverse problem solving with significantly fewer iterations, and DeceptronNet shows promise for fast convergence in 2D problems under strict fairness protocols.

Abstract: Inverse problems in the physical sciences are often ill-conditioned in input space, making progress step-size sensitive. We propose the Deceptron, a lightweight bidirectional module that learns a local inverse of a differentiable forward surrogate. Training combines a supervised fit, forward-reverse consistency, a lightweight spectral penalty, a soft bias tie, and a Jacobian Composition Penalty (JCP) that encourages $J_g(f(x)),J_f(x)!\approx!I$ via JVP/VJP probes. At solve time, D-IPG (Deceptron Inverse-Preconditioned Gradient) takes a descent step in output space, pulls it back through $g$, and projects under the same backtracking and stopping rules as baselines. On Heat-1D initial-condition recovery and a Damped Oscillator inverse problem, D-IPG reaches a fixed normalized tolerance with $\sim$20$\times$ fewer iterations on Heat and $\sim$2-3$\times$ fewer on Oscillator than projected gradient, competitive in iterations and cost with Gauss-Newton. Diagnostics show JCP reduces a measured composition error and tracks iteration gains. We also preview a single-scale 2D instantiation, DeceptronNet (v0), that learns few-step corrections under a strict fairness protocol and exhibits notably fast convergence.

[393] MLPMoE: Zero-Shot Architectural Metamorphosis of Dense LLM MLPs into Static Mixture-of-Experts

Ivan Novikov

Main category: cs.LG

TL;DR: MLPMoE is a training-free method that converts dense transformer MLPs into static mixture-of-experts without requiring calibration data or router training, using tensor slicing and structured sparsity techniques.

Details

Motivation: Dense transformer deployment is computationally inefficient as all parameters are activated for every token. Existing upcycling methods require clustering, profiling, or calibration data, which limits their practicality.

Method: Uses tensor slicing and summation to reinterpret tensor parallelism algebra as topological conversion. Implements Fractal Fade (differential branch sparsity) and Compensated Pruning (variance-preserving branch reduction) for structured sparsity.

Result: On Qwen2.5-0.5B and DeepSeek-R1-Distill-Llama-8B, MLPMoE changes perplexity by <0.05% while keeping parameters constant. Differential sparsity removes ~20% of MLP parameters with perplexity within ~2% of dense baseline.

Conclusion: MLPMoE provides an efficient post-hoc transformation for existing checkpoints without requiring gradients, calibration sets, or router training, enabling computational efficiency while maintaining performance.

Abstract: Large Language Models (LLMs) are predominantly deployed as dense transformers, where every parameter in every feed-forward block is activated for every token. While architecturally simple, this is computationally inefficient, since inference costs scale linearly with parameter count. Recent upcycling methods such as MoEfication, CMoE, ToMoE, and MoORE reveal that much of the useful computation lives in sparse, semi-modular substructures inside dense feed-forward networks, but these approaches typically rely on clustering, activation profiling, singular value decomposition, or custom routing that requires calibration data. This paper introduces MLPMoE (MLP Mixture-of-Experts), a training-free, deterministic transformation that restructures the dense MLP in transformer blocks into a static, high-cardinality mixture of experts. The transformation uses simple tensor slicing and summation, reinterpreting the algebra of tensor parallelism as a topological conversion rather than a distributed training pattern. We further introduce Fractal Fade (differential branch sparsity) and Compensated Pruning (variance-preserving branch reduction) as lightweight mechanisms for structured sparsity. On Qwen2.5-0.5B-Instruct and DeepSeek-R1-Distill-Llama-8B, the zero-shot MLPMoE transform changes a proxy perplexity metric by less than 0.05 percent while keeping the parameter count effectively constant. On the 8B model, differential sparsity removes about 20 percent of MLP parameters while keeping perplexity within about 2 percent of the dense baseline. The method operates entirely post hoc on existing checkpoints and does not require gradients, calibration sets, or router training. Code is available at https://gist.github.com/iwallarm/fc2ef1eddf226ca7814f9e5e2ae9bad1

[394] MNM : Multi-level Neuroimaging Meta-analysis with Hyperbolic Brain-Text Representations

Seunghun Baek, Jaejin Lee, Jaeyoon Sim, Minjae Jeong, Won Hwa Kim

Main category: cs.LG

TL;DR: A hyperbolic geometry framework that bridges neuroscience literature and brain activation maps by embedding both text and images into a shared hyperbolic space, enabling multi-level neuroimaging meta-analysis.

Details

Motivation: Traditional meta-analysis approaches in neuroimaging often overlook the hierarchical structure of the brain and rely on linear mappings or keyword retrieval, limiting their ability to capture complex brain organization patterns.

Method: Proposes a framework using Lorentz model hyperbolic geometry to embed research article text and corresponding brain images into a shared hyperbolic space, enabling semantic alignment and hierarchical relationship preservation between text and brain activations.

Result: Experimental results show the model outperforms baseline methods, providing a robust and interpretable paradigm for multi-level neuroimaging meta-analysis.

Conclusion: Hyperbolic brain-text representation offers an effective approach for capturing both semantic similarity and hierarchical organization in neuroimaging data, advancing neuroimaging meta-analysis capabilities.

Abstract: Various neuroimaging studies suffer from small sample size problem which often limit their reliability. Meta-analysis addresses this challenge by aggregating findings from different studies to identify consistent patterns of brain activity. However, traditional approaches based on keyword retrieval or linear mappings often overlook the rich hierarchical structure in the brain. In this work, we propose a novel framework that leverages hyperbolic geometry to bridge the gap between neuroscience literature and brain activation maps. By embedding text from research articles and corresponding brain images into a shared hyperbolic space via the Lorentz model, our method captures both semantic similarity and hierarchical organization inherent in neuroimaging data. In the hyperbolic space, our method performs multi-level neuroimaging meta-analysis (MNM) by 1) aligning brain and text embeddings for semantic correspondence, 2) guiding hierarchy between text and brain activations, and 3) preserving the hierarchical relationships within brain activation patterns. Experimental results demonstrate that our model outperforms baselines, offering a robust and interpretable paradigm of multi-level neuroimaging meta-analysis via hyperbolic brain-text representation.

[395] Generative Early Stage Ranking

Juhee Hong, Meng Liu, Shengzhi Wang, Xiaoheng Mao, Huihui Cheng, Leon Gao, Christopher Leung, Jin Zhou, Chandra Mouli Sekar, Zhao Zhu, Ruochen Liu, Tuan Trieu, Dawei Sun, Jeet Kanjani, Rui Li, Jing Qian, Xuan Cao, Minjie Fan, Mingze Gao

Main category: cs.LG

TL;DR: Proposes Generative Early Stage Ranking (GESR) with Mixture of Attention (MoA) modules to bridge effectiveness gap in early stage ranking systems, achieving substantial improvements in recommendation metrics.

Details

Motivation: Early Stage Ranking systems using user-item decoupling are efficient but limited in capturing fine-grained user-item affinities and cross-signals, creating an effectiveness gap.

Method: Introduces GESR paradigm with MoA modules: Hard Matching Attention for explicit cross-signals, Target-Aware Self Attention for personalized user representations, and Cross Attention for enriched user-item interactions, refined by Multi-Logit Parameterized Gating.

Result: Substantial improvements in topline metrics, engagement, and consumption tasks validated by offline and online experiments. First successful deployment of full target-aware attention sequence modeling at ESR scale.

Conclusion: GESR paradigm successfully bridges the effectiveness-efficiency trade-off in large-scale recommendation systems through specialized attention mechanisms and optimization techniques.

Abstract: Large-scale recommendations commonly adopt a multi-stage cascading ranking system paradigm to balance effectiveness and efficiency. Early Stage Ranking (ESR) systems utilize the “user-item decoupling” approach, where independently learned user and item representations are only combined at the final layer. While efficient, this design is limited in effectiveness, as it struggles to capture fine-grained user-item affinities and cross-signals. To address these, we propose the Generative Early Stage Ranking (GESR) paradigm, introducing the Mixture of Attention (MoA) module which leverages diverse attention mechanisms to bridge the effectiveness gap: the Hard Matching Attention (HMA) module encodes explicit cross-signals by computing raw match counts between user and item features; the Target-Aware Self Attention module generates target-aware user representations conditioned on the item, enabling more personalized learning; and the Cross Attention modules facilitate early and more enriched interactions between user-item features. MoA’s specialized attention encodings are further refined in the final layer through a Multi-Logit Parameterized Gating (MLPG) module, which integrates the newly learned embeddings via gating and produces secondary logits that are fused with the primary logit. To address the efficiency and latency challenges, we have introduced a comprehensive suite of optimization techniques. These span from custom kernels that maximize the capabilities of the latest hardware to efficient serving solutions powered by caching mechanisms. The proposed GESR paradigm has shown substantial improvements in topline metrics, engagement, and consumption tasks, as validated by both offline and online experiments. To the best of our knowledge, this marks the first successful deployment of full target-aware attention sequence modeling within an ESR stage at such a scale.

[396] How to Correctly Report LLM-as-a-Judge Evaluations

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

Main category: cs.LG

TL;DR: A framework for bias correction and confidence interval construction in LLM-based evaluation, addressing noise from imperfect specificity and sensitivity.

Details

Motivation: LLMs are increasingly used as evaluators but their judgments are noisy due to imperfect specificity and sensitivity, leading to biased accuracy estimates. Existing bias-correction methods are underutilized and assume exact knowledge of specificity/sensitivity.

Method: Proposes a simple plug-in framework that corrects bias and constructs confidence intervals reflecting uncertainty from both test and calibration datasets. Also introduces an adaptive algorithm for efficient calibration sample size allocation.

Result: The framework enables practical and statistically sound LLM-based evaluation by properly handling uncertainty in specificity and sensitivity estimates.

Conclusion: The proposed approach provides a statistically rigorous method for LLM-based evaluation that corrects bias and properly accounts for uncertainty, with an adaptive algorithm to optimize calibration efficiency.

Abstract: Large language models (LLMs) are increasingly used as evaluators in lieu of humans. While scalable, their judgments are noisy due to imperfect specificity and sensitivity of LLMs, leading to biased accuracy estimates. Although bias-correction methods exist, they are underutilized in LLM research and typically assume exact knowledge of the model’s specificity and sensitivity. Furthermore, in general we only have estimates of these values and it is not well known how to properly construct confidence intervals using only estimates. This work presents a simple plug-in framework that corrects such bias and constructs confidence intervals reflecting uncertainty from both test and calibration dataset, enabling practical and statistically sound LLM-based evaluation. Additionally, to reduce uncertainty in the accuracy estimate, we introduce an adaptive algorithm that efficiently allocates calibration sample sizes.

[397] From Bits to Rounds: Parallel Decoding with Exploration for Diffusion Language Models

Hengyu Fu, Baihe Huang, Virginia Adams, Charles Wang, Venkat Srinivasan, Jiantao Jiao

Main category: cs.LG

TL;DR: DLMs face an information bottleneck when relying on high-confidence tokens. The paper proposes Explore-Then-Exploit (ETE) decoding to maximize information throughput by exploring uncertain tokens first.

Details

Motivation: Standard DLM decoding strategies encounter an inherent information-theoretic bottleneck that restricts decoding progress and slows generation by prioritizing high-confidence tokens that carry negligible information.

Method: Proposes Explore-Then-Exploit (ETE) - a training-free decoding strategy that combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape conditional distributions and trigger confident predictions.

Result: ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality, verifying the theoretical bounds about information throughput.

Conclusion: Prioritizing high-confidence tokens is inherently inefficient for DLMs, and the proposed ETE strategy effectively overcomes this bottleneck by maximizing information throughput through strategic exploration of uncertain tokens.

Abstract: Diffusion Language Models (DLMs) have recently emerged as a strong alternative to autoregressive language models (LMs). DLMs offer comparable accuracy with faster inference speed via parallel decoding. However, standard DLM decoding strategies relying on high-confidence tokens encounter an inherent information-theoretic bottleneck that restricts decoding progress and ultimately slows generation. We demonstrate both theoretically and empirically that prioritizing high-confidence tokens is inherently inefficient. High-probability tokens carry negligible information and strictly relying on them limits the effective progress made in each decoding round. We prove that the number of decoding rounds must grow linearly with the sample’s total information (negative log-likelihood) and inversely with the per-round information budget, establishing a bits-to-rounds principle. We also propose Explore-Then-Exploit (ETE), a training-free decoding strategy that maximizes information throughput and decoding efficiency. ETE combines cross-block decoding with targeted exploration of high-uncertainty tokens to reshape the conditional distribution and trigger cascades of confident predictions. Experiments verify our theoretical bounds and demonstrate that ETE consistently reduces the required number of decoding rounds compared to confidence-only baselines without compromising generation quality.

[398] BRIDGE: Building Representations In Domain Guided Program Verification

Robert Joseph George, Carson Eisenach, Udaya Ghai, Dominique Perrault-Joncas, Anima Anandkumar, Dean Foster

Main category: cs.LG

TL;DR: BRIDGE introduces structured prompting for verified program generation by decomposing verification into Code, Specifications, and Proofs domains, improving accuracy and efficiency in formal language programming.

Details

Motivation: Large language models struggle with program verification in interactive proof frameworks like Lean4, particularly with scalability across code, specifications, and proofs simultaneously.

Method: BRIDGE decomposes verification into three interconnected domains (Code, Specifications, Proofs) and uses structured prompting to elicit distinct reasoning behaviors as intermediate representations that preserve semantic structure.

Result: Functional reasoning improves Lean4 code correctness by nearly 1.5x (pass@5) over baselines and is 2x more efficient. Specification-driven prompting boosts Python coding pass rates by up to 17.5%.

Conclusion: Structured domain alignment is a promising direction for advancing verified synthesis, establishing a foundation for training via expert iteration or RLVR to internalize reasoning strategies across code, specifications, and proofs.

Abstract: Large language models (LLMs) have achieved impressive results in code generation, yet struggle with program verification, especially in interactive proof frameworks such as Lean4. A central challenge is scalability: verified synthesis requires not just code, but also precise specifications and correctness proofs, and existing approaches rarely span all three domains. We present BRIDGE, the first systematic study of structured prompting for scalable verified program generation. BRIDGE decomposes verification into three interconnected domains: Code (executable implementations), Specifications (formal intent statements), and Proofs (constructive correctness arguments). Our key idea is to elicit distinct reasoning behaviors functional, specification-driven, and proof-oriented as intermediate representations that preserve semantic structure and connect these domains. Through systematic ablations, we show that this approach substantially improves both accuracy and efficiency beyond standard error feedback methods. For example, functional reasoning improves correctness of code in formal languages (Lean4) by nearly 1.5x (pass@5) over direct baselines. In inference-time compute, functional reasoning is also 2x more efficient, achieving higher pass rates with fewer generations and lower total sampling budgets. Similarly, we find that specification-driven prompting boosts Python coding pass rates by up to 17.5%. These findings suggest that structured domain alignment is a promising direction for advancing verified synthesis. BRIDGE establishes a foundation for training via expert iteration or RLVR, enabling models to internalize these reasoning strategies across code, specifications, and proofs.

[399] Dynamic Stratified Contrastive Learning with Upstream Augmentation for MILP Branching

Tongkai Lu, Shuai Ma, Chongyang Tao

Main category: cs.LG

TL;DR: Dynamic Stratified Contrastive Training Framework (SCT) for MILP branching that addresses semantic variation across depths, upstream node scarcity, and costly strong branching sample collection through stratified grouping and upstream-augmented instance generation.

Details

Motivation: Existing neural-based branching methods struggle with semantic variation across B&B tree depths, scarcity of upstream nodes, and expensive collection of strong branching samples, limiting their effectiveness in solving Mixed Integer Linear Programming problems.

Method: Proposes SCT framework that groups B&B nodes by feature distributions, trains GCNN model to progressively separate nodes across groups, and introduces upstream-augmented MILP derivation to generate equivalent and perturbed instances for addressing data scarcity.

Result: Extensive experiments on standard MILP benchmarks show enhanced branching accuracy, reduced solving time, and effective generalization to unseen instances, particularly improving performance for upstream nodes.

Conclusion: SCT effectively models subtle semantic differences between nodes, significantly enhancing branching accuracy and solving efficiency in MILP problems while addressing key limitations of existing neural-based branching methods.

Abstract: Mixed Integer Linear Programming (MILP) is a fundamental class of NP-hard problems that has garnered significant attention from both academia and industry. The Branch-and-Bound (B&B) method is the dominant approach for solving MILPs and the branching plays an important role in B&B methods. Neural-based learning frameworks have recently been developed to enhance branching policies and the efficiency of solving MILPs. However, these methods still struggle with semantic variation across depths, the scarcity of upstream nodes, and the costly collection of strong branching samples. To address these issues, we propose \ours, a Dynamic \underline{\textbf{S}}tratified \underline{\textbf{C}}ontrastive Training Framework for \underline{\textbf{MILP}} Branching. It groups branch-and-bound nodes based on their feature distributions and trains a GCNN-based discriminative model to progressively separate nodes across groups, learning finer-grained node representations throughout the tree. To address data scarcity and imbalance at upstream nodes, we introduce an upstream-augmented MILP derivation procedure that generates both theoretically equivalent and perturbed instances. \ours~effectively models subtle semantic differences between nodes, significantly enhancing branching accuracy and solving efficiency, particularly for upstream nodes. Extensive experiments on standard MILP benchmarks demonstrate that our method enhances branching accuracy, reduces solving time, and generalizes effectively to unseen instances.

[400] BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning

Ariful Islam, Md Rifat Hossen, Abir Ahmed, B M Taslimul Haque

Main category: cs.LG

TL;DR: BanglaASTE is a novel framework for Aspect Sentiment Triplet Extraction in Bangla that creates the first annotated dataset and uses ensemble models to achieve state-of-the-art performance.

Details

Motivation: Aspect-Based Sentiment Analysis for Bangla is significantly underexplored due to lack of comprehensive datasets and specialized frameworks for triplet extraction in this low-resource language.

Method: Created first annotated Bangla ASTE dataset (3,345 reviews), developed hybrid classification with graph-based aspect-opinion matching and semantic similarity, implemented ensemble model combining BanglaBERT embeddings with XGBoost.

Result: Ensemble approach achieved 89.9% accuracy and 89.1% F1-score, significantly outperforming baseline models across all evaluation metrics.

Conclusion: The research advances state-of-the-art in low-resource language sentiment analysis and provides scalable solution for Bangla e-commerce analytics, effectively addressing challenges like informal expressions and spelling variations.

Abstract: Aspect-Based Sentiment Analysis (ABSA) has emerged as a critical tool for extracting fine-grained sentiment insights from user-generated content, particularly in e-commerce and social media domains. However, research on Bangla ABSA remains significantly underexplored due to the absence of comprehensive datasets and specialized frameworks for triplet extraction in this language. This paper introduces BanglaASTE, a novel framework for Aspect Sentiment Triplet Extraction (ASTE) that simultaneously identifies aspect terms, opinion expressions, and sentiment polarities from Bangla product reviews. Our contributions include: (1) creation of the first annotated Bangla ASTE dataset containing 3,345 product reviews collected from major e-commerce platforms including Daraz, Facebook, and Rokomari; (2) development of a hybrid classification framework that employs graph-based aspect-opinion matching with semantic similarity techniques; and (3) implementation of an ensemble model combining BanglaBERT contextual embeddings with XGBoost boosting algorithms for enhanced triplet extraction performance. Experimental results demonstrate that our ensemble approach achieves superior performance with 89.9% accuracy and 89.1% F1-score, significantly outperforming baseline models across all evaluation metrics. The framework effectively addresses key challenges in Bangla text processing including informal expressions, spelling variations, and data sparsity. This research advances the state-of-the-art in low-resource language sentiment analysis and provides a scalable solution for Bangla e-commerce analytics applications.

[401] Interpretable Fair Clustering

Mudi Jiang, Jiahui Zhou, Xinying Liu, Zengyou He, Zhikui Chen

Main category: cs.LG

TL;DR: Proposes an interpretable fair clustering framework using decision trees with fairness constraints, including a variant that eliminates fairness hyperparameter tuning through post-pruning.

Details

Motivation: Existing fair clustering methods lack interpretability, limiting their use in high-stakes scenarios where understanding clustering decisions is essential.

Method: Integrates fairness constraints into decision tree structure for clustering, with a variant that post-prunes trees constructed without fairness constraints to avoid hyperparameter tuning.

Result: Extensive experiments show competitive clustering performance, improved fairness, interpretability, and ability to handle multiple sensitive attributes under complex fairness constraints.

Conclusion: The method enables equitable and transparent clustering with robust performance under complex fairness requirements, opening new possibilities for interpretable fair clustering.

Abstract: Fair clustering has gained increasing attention in recent years, especially in applications involving socially sensitive attributes. However, existing fair clustering methods often lack interpretability, limiting their applicability in high-stakes scenarios where understanding the rationale behind clustering decisions is essential. In this work, we address this limitation by proposing an interpretable and fair clustering framework, which integrates fairness constraints into the structure of decision trees. Our approach constructs interpretable decision trees that partition the data while ensuring fair treatment across protected groups. To further enhance the practicality of our framework, we also introduce a variant that requires no fairness hyperparameter tuning, achieved through post-pruning a tree constructed without fairness constraints. Extensive experiments on both real-world and synthetic datasets demonstrate that our method not only delivers competitive clustering performance and improved fairness, but also offers additional advantages such as interpretability and the ability to handle multiple sensitive attributes. These strengths enable our method to perform robustly under complex fairness constraints, opening new possibilities for equitable and transparent clustering.

[402] Trustless Federated Learning at Edge-Scale: A Compositional Architecture for Decentralized, Verifiable, and Incentive-Aligned Coordination

Pius Onobhayedo, Paul Osemudiame Oamen

Main category: cs.LG

TL;DR: This paper addresses key gaps in federated learning by proposing cryptographic proofs for aggregation correctness, geometric novelty measurement to prevent gaming, parallel object ownership for scalability, and time-locked policies against retroactive manipulation.

Details

Motivation: The motivation is to realize the democratic vision of distributed AI where edge devices can collectively improve models without surrendering raw data, overcoming current limitations in accountability, economic mechanisms, scalability, and governance.

Method: The method uses cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation.

Result: The proposed approach addresses the compositional gaps in federated learning systems, enabling more secure, scalable, and accountable distributed AI model training.

Conclusion: The work successfully bridges key gaps in federated learning systems, paving the way for realizing the democratic vision of distributed AI creation and improvement through edge devices.

Abstract: Artificial intelligence is retracing the Internet’s path from centralized provision to distributed creation. Initially, resource-intensive computation concentrates within institutions capable of training and serving large models.Eventually, as federated learning matures, billions of edge devices holding sensitive data will be able to collectively improve models without surrendering raw information, enabling both contribution and consumption at scale. This democratic vision remains unrealized due to certain compositional gaps; aggregators handle updates without accountability, economic mechanisms are lacking and even when present remain vulnerable to gaming, coordination serializes state modifications limiting scalability, and governance permits retroactive manipulation. This work addresses these gaps by leveraging cryptographic receipts to prove aggregation correctness, geometric novelty measurement to prevent incentive gaming, parallel object ownership to achieve linear scalability, and time-locked policies to check retroactive manipulation.

[403] Subjective Depth and Timescale Transformers: Learning Where and When to Compute

Frederico Wieser, Martin Benfeghoul, Haitham Bou Ammar, Jun Wang, Zafeirios Fountas

Main category: cs.LG

TL;DR: SDT and STT architectures use Bayesian surprise signals to dynamically route computation in Transformers, reducing computation by 75% and KV-cache by 50% through conditional computation.

Details

Motivation: Standard Transformers have rigid, uniform computation allocation that limits efficiency and scalability for large models and long sequences.

Method: SDT uses alternating Decision and Dynamic layers with Bayesian surprise-based routing. STT extends to temporal domain with transition networks predicting residual updates and dynamic block execution.

Result: Both architectures show shift from novelty to prediction-driven gating during training, reduce self-attention computation by 75% and KV-cache requirements by 50%.

Conclusion: The architectures provide a flexible framework for efficient conditional computation, offering insights into compute-accuracy trade-offs and setting a pathway for more efficient models.

Abstract: The rigid, uniform allocation of computation in standard Transformer (TF) architectures can limit their efficiency and scalability, particularly for large-scale models and long sequences. Addressing this, we introduce Subjective Depth Transformers (SDT) and Subjective Timescale Transformers (STT), two distinct architectures that leverage Bayesian surprise signals to dynamically route computation, learning where and when to compute within decoder-only TFs. SDT augments a decoder-only stack with alternating Decision and Dynamic layers: a Decision layer computes a full block ‘posterior’ and a lightweight ‘prior,’ while a Dynamic layer employs fixed-capacity Top-K routing based on Bayesian surprise (Expected and Unexpected Change), maintaining a static compute graph. STT extends this conditional computation to the temporal domain: a transition network predicts residual updates, forming a temporal ‘change hypothesis’ that informs a router to dynamically execute or bypass TF blocks for each token, managing KV-cache contributions. Both architectures exhibit the predicted shift from novelty to prediction driven gating over training, suggesting alignment with surprise based principles. While operating at reduced capacity, they offer preliminary insights into the compute-accuracy trade-offs of conditional computation. The proposed architectures establish a flexible framework for efficiency, reducing self-attention computation by 75% and KV-cache requirements by 50% within each compute skipping layer, setting a pathway for more efficient models.

Mengran Li, Zelin Zang, Wenbin Xing, Junzhou Chen, Ronghui Zhang, Jiebo Luo, Stan Z. Li

Main category: cs.LG

TL;DR: CHMR is a robust framework that models hierarchical dependencies between molecules and cellular responses using tree-structured vector quantization, achieving significant improvements in molecular property prediction.

Details

Motivation: Existing methods focus only on chemical structures, ignoring cellular responses, and current cell-aware approaches suffer from modality incompleteness and insufficient modeling of hierarchical dependencies across molecular, cellular, and genomic levels.

Method: CHMR jointly models local-global dependencies between molecules and cellular responses and captures latent biological hierarchies via a novel tree-structured vector quantization module.

Result: Evaluated on nine public benchmarks spanning 728 tasks, CHMR outperforms state-of-the-art baselines with average improvements of 3.6% on classification and 17.2% on regression tasks.

Conclusion: CHMR demonstrates the advantage of hierarchy-aware, multimodal learning for reliable and biologically grounded molecular representations, offering a generalizable framework for integrative biomedical modeling.

Abstract: Understanding how chemical perturbations propagate through biological systems is essential for robust molecular property prediction. While most existing methods focus on chemical structures alone, recent advances highlight the crucial role of cellular responses such as morphology and gene expression in shaping drug effects. However, current cell-aware approaches face two key limitations: (1) modality incompleteness in external biological data, and (2) insufficient modeling of hierarchical dependencies across molecular, cellular, and genomic levels. We propose CHMR (Cell-aware Hierarchical Multi-modal Representations), a robust framework that jointly models local-global dependencies between molecules and cellular responses and captures latent biological hierarchies via a novel tree-structured vector quantization module. Evaluated on nine public benchmarks spanning 728 tasks, CHMR outperforms state-of-the-art baselines, yielding average improvements of 3.6% on classification and 17.2% on regression tasks. These results demonstrate the advantage of hierarchy-aware, multimodal learning for reliable and biologically grounded molecular representations, offering a generalizable framework for integrative biomedical modeling. The code is in https://github.com/limengran98/CHMR.

[405] Privacy in Federated Learning with Spiking Neural Networks

Dogukan Aksu, Jesus Martinez del Rincon, Ihsen Alouani

Main category: cs.LG

TL;DR: SNNs with surrogate gradients show significantly reduced gradient leakage vulnerability compared to ANNs, making them inherently more privacy-preserving for federated learning.

Details

Motivation: To investigate the privacy implications of gradient inversion attacks in SNNs, which remain unexplored despite SNNs' growing use in edge AI and federated learning scenarios.

Method: Adapted gradient leakage attacks to the spike domain and conducted comprehensive empirical study across diverse data domains, comparing SNNs with surrogate gradients against conventional ANNs.

Result: SNN gradients yield noisy, temporally inconsistent reconstructions that fail to recover meaningful spatial or temporal structure, unlike ANN gradients which reliably expose salient input content.

Conclusion: The combination of event-driven dynamics and surrogate-gradient training in SNNs substantially reduces gradient informativeness, highlighting their inherent privacy-preserving potential for neuromorphic computation in federated learning.

Abstract: Spiking neural networks (SNNs) have emerged as prominent candidates for embedded and edge AI. Their inherent low power consumption makes them far more efficient than conventional ANNs in scenarios where energy budgets are tightly constrained. In parallel, federated learning (FL) has become the prevailing training paradigm in such settings, enabling on-device learning while limiting the exposure of raw data. However, gradient inversion attacks represent a critical privacy threat in FL, where sensitive training data can be reconstructed directly from shared gradients. While this vulnerability has been widely investigated in conventional ANNs, its implications for SNNs remain largely unexplored. In this work, we present the first comprehensive empirical study of gradient leakage in SNNs across diverse data domains. SNNs are inherently non-differentiable and are typically trained using surrogate gradients, which we hypothesized would be less correlated with the original input and thus less informative from a privacy perspective. To investigate this, we adapt different gradient leakage attacks to the spike domain. Our experiments reveal a striking contrast with conventional ANNs: whereas ANN gradients reliably expose salient input content, SNN gradients yield noisy, temporally inconsistent reconstructions that fail to recover meaningful spatial or temporal structure. These results indicate that the combination of event-driven dynamics and surrogate-gradient training substantially reduces gradient informativeness. To the best of our knowledge, this work provides the first systematic benchmark of gradient inversion attacks for spiking architectures, highlighting the inherent privacy-preserving potential of neuromorphic computation.

[406] I-GLIDE: Input Groups for Latent Health Indicators in Degradation Estimation

Lucas Thil, Jesse Read, Rim Kaddah, Guillaume Doquet

Main category: cs.LG

TL;DR: Novel framework for health indicator construction with uncertainty quantification and mechanism-specific degradation modeling using indicator groups

Details

Motivation: Existing methods fail to disentangle complex degradation mechanisms in multi-sensor systems or quantify uncertainty in health indicator reliability

Method: Adapts RaPP as health indicator, augments with uncertainty quantification via Monte Carlo dropout and probabilistic latent spaces, and introduces indicator groups to isolate sensor subsets for mechanism-specific degradation modeling (I-GLIDE method)

Result: Outperforms traditional reconstruction error metrics, achieves marked improvements in accuracy and generalizability compared to state-of-the-art methods on aerospace and manufacturing data

Conclusion: Bridges gap between anomaly detection and prognostics, offering principled framework for uncertainty-aware degradation modeling in complex systems with actionable insights into failure pathways

Abstract: Accurate remaining useful life (RUL) prediction hinges on the quality of health indicators (HIs), yet existing methods often fail to disentangle complex degradation mechanisms in multi-sensor systems or quantify uncertainty in HI reliability. This paper introduces a novel framework for HI construction, advancing three key contributions. First, we adapt Reconstruction along Projected Pathways (RaPP) as a health indicator (HI) for RUL prediction for the first time, showing that it outperforms traditional reconstruction error metrics. Second, we show that augmenting RaPP-derived HIs with aleatoric and epistemic uncertainty quantification (UQ) via Monte Carlo dropout and probabilistic latent spaces- significantly improves RUL-prediction robustness. Third, and most critically, we propose indicator groups, a paradigm that isolates sensor subsets to model system-specific degradations, giving rise to our novel method, I-GLIDE which enables interpretable, mechanism-specific diagnostics. Evaluated on data sourced from aerospace and manufacturing systems, our approach achieves marked improvements in accuracy and generalizability compared to state-of-the-art HI methods while providing actionable insights into system failure pathways. This work bridges the gap between anomaly detection and prognostics, offering a principled framework for uncertainty-aware degradation modeling in complex systems.

[407] Robust Gene Prioritization via Fast-mRMR Feature Selection in high-dimensional omics data

Rubén Fernández-Farelo, Jorge Paz-Ruza, Bertha Guijarro-Berdiñas, Amparo Alonso-Betanzos, Alex A. Freitas

Main category: cs.LG

TL;DR: Proposes a robust gene prioritization pipeline using Fast-mRMR feature selection to handle high-dimensional biomedical data, showing significant improvements over existing methods.

Details

Motivation: Existing gene prioritization methods struggle with high dimensionality and incomplete labeling of biomedical data, limiting their effectiveness.

Method: Uses Fast-mRMR feature selection to retain only relevant, non-redundant features for classifiers, enabling simpler models and combination of different biological feature sets.

Result: Experiments on Dietary Restriction datasets show significant improvements over existing methods.

Conclusion: Feature selection is critical for reliable gene prioritization, enabling more robust and efficient AI-based approaches.

Abstract: Gene prioritization (identifying genes potentially associated with a biological process) is increasingly tackled with Artificial Intelligence. However, existing methods struggle with the high dimensionality and incomplete labelling of biomedical data. This work proposes a more robust and efficient pipeline that leverages Fast-mRMR feature selection to retain only relevant, non-redundant features for classifiers. This enables us to build simpler and more effective models, as well as to combine different biological feature sets. Experiments on Dietary Restriction datasets show significant improvements over existing methods, proving that feature selection can be critical for reliable gene prioritization.

[408] A Physics-Informed U-net-LSTM Network for Data-Driven Seismic Response Modeling of Structures

Sutirtha Biswas, Kshitij Kumar Yadav

Main category: cs.LG

TL;DR: A Physics-Informed U-Net LSTM framework that integrates physical laws with deep learning for efficient and accurate seismic response prediction of structures.

Details

Motivation: Traditional FEM has high computational costs limiting scalability, while pure data-driven deep learning models struggle with generalization and capturing underlying physics in seismic analysis.

Method: Proposed a hybrid Physics-Informed U-Net LSTM framework that embeds domain-specific physical constraints into the deep learning process.

Result: The model achieves improved predictive performance over conventional machine learning architectures by bridging data-driven methods with physics-based modeling.

Conclusion: The hybrid approach offers a robust and computationally efficient alternative for seismic response prediction, combining the benefits of both physics-based and data-driven methods.

Abstract: Accurate and efficient seismic response prediction is essential for the design of resilient structures. While the Finite Element Method (FEM) remains the standard for nonlinear seismic analysis, its high computational demands limit its scalability and real time applicability. Recent developments in deep learning, particularly Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short Term Memory (LSTM) models, have shown promise in reducing the computational cost of nonlinear seismic analysis of structures. However, these data driven models often struggle to generalize and capture the underlying physics, leading to reduced reliability. We propose a novel Physics Informed U Net LSTM framework that integrates physical laws with deep learning to enhance both accuracy and efficiency. By embedding domain specific constraints into the learning process, the proposed model achieves improved predictive performance over conventional Machine Learning architectures. This hybrid approach bridges the gap between purely data driven methods and physics based modeling, offering a robust and computationally efficient alternative for seismic response prediction of structures.

[409] Sawtooth Sampling for Time Series Denoising Diffusion Implicit Models

Heiko Oppel, Andreas Spilz, Michael Munz

Main category: cs.LG

TL;DR: Combining implicit diffusion models with a novel Sawtooth Sampler to accelerate DDPM sampling by 30x while improving generated sequence quality for classification tasks.

Details

Motivation: DDPMs can generate synthetic timeseries data to improve classifier performance, but their sampling process is computationally expensive.

Method: Implicit diffusion models combined with a novel Sawtooth Sampler that accelerates the reverse process and can be applied to any pretrained diffusion model.

Result: Achieves 30 times speed-up over standard baseline while enhancing quality of generated sequences for classification tasks.

Conclusion: The proposed approach successfully addresses computational inefficiency in DDPM sampling while maintaining or improving data quality for classification applications.

Abstract: Denoising Diffusion Probabilistic Models (DDPMs) can generate synthetic timeseries data to help improve the performance of a classifier, but their sampling process is computationally expensive. We address this by combining implicit diffusion models with a novel Sawtooth Sampler that accelerates the reverse process and can be applied to any pretrained diffusion model. Our approach achieves a 30 times speed-up over the standard baseline while also enhancing the quality of the generated sequences for classification tasks.

[410] TSGM: Regular and Irregular Time-series Generation using Score-based Generative Models

Haksoo Lim, Jaehoon Lee, Sewon Park, Minjung Kim, Noseong Park

Main category: cs.LG

TL;DR: Score-based generative models applied to time-series synthesis with flexible framework for both regular and irregular time-series, achieving state-of-the-art performance.

Details

Motivation: To leverage the success of score-based generative models (SGMs) in other domains for time-series synthesis, addressing the need for high-quality and diverse time-series generation.

Method: Developed a conditional score network for time-series synthesis with tailored denoising score matching loss, designed to handle both regular and irregular time-series with minimal model changes.

Result: Exceptional synthesis performance on various time-series datasets, achieving state-of-the-art sampling diversity and quality.

Conclusion: The proposed framework successfully extends SGMs to time-series synthesis, demonstrating superior performance and flexibility for handling different types of time-series data.

Abstract: Score-based generative models (SGMs) have demonstrated unparalleled sampling quality and diversity in numerous fields, such as image generation, voice synthesis, and tabular data synthesis, etc. Inspired by those outstanding results, we apply SGMs to synthesize time-series by learning its conditional score function. To this end, we present a conditional score network for time-series synthesis, deriving a denoising score matching loss tailored for our purposes. In particular, our presented denoising score matching loss is the conditional denoising score matching loss for time-series synthesis. In addition, our framework is such flexible that both regular and irregular time-series can be synthesized with minimal changes to our model design. Finally, we obtain exceptional synthesis performance on various time-series datasets, achieving state-of-the-art sampling diversity and quality.

[411] Masks Can Be Distracting: On Context Comprehension in Diffusion Language Models

Julianna Piskorz, Cristina Pinneri, Alvaro Correia, Motasem Alfarra, Risheek Garrepalli, Christos Louizos

Main category: cs.LG

TL;DR: MDLMs have locality bias favoring local context and are distracted by mask tokens during generation, but a mask-agnostic loss can improve their context comprehension.

Details

Motivation: To examine the context comprehension abilities of Masked Diffusion Language Models (MDLMs) as an alternative to Autoregressive Language Models (ARLMs), and identify their limitations.

Method: Systematic ablations to study locality bias and mask token effects, plus introducing a mask-agnostic loss function for fine-tuning to improve robustness.

Result: MDLMs exhibit strong locality bias favoring local context and performance degrades with appended mask tokens, but the mask-agnostic loss substantially mitigates these issues.

Conclusion: Current MDLM training has critical limitations in context comprehension, but the proposed mask-agnostic approach provides actionable improvements for building stronger diffusion-based language models.

Abstract: Masked Diffusion Language Models (MDLMs) have recently emerged as a promising alternative to Autoregressive Language Models (ARLMs), leveraging a denoising objective that, in principle, should enable more uniform context utilisation. In this work, we examine the context comprehension abilities of MDLMs and uncover two key limitations. First, despite their more global training objective and bidirectional attention mechanism, similarly to ARLMS, MDLMs exhibit a strong locality bias: performance is highly sensitive to the position of relevant information within the input, favouring local over distant context. Second, we show that appending a large number of mask tokens–required for generation–can significantly degrade context comprehension. Through systematic ablations, we find that these masks act as distractors, reducing the model’s ability to process relevant information. To address this, we introduce a mask-agnostic loss function that encourages predictions to remain invariant to the number of appended masks. Fine-tuning with this objective substantially mitigates the distracting effect of masks, improving robustness of MDLMs. Overall, our findings reveal critical limitations of the current MDLM training paradigm and provide actionable insights for building diffusion-based language models with stronger context comprehension.

[412] Best Practices for Machine Learning Experimentation in Scientific Applications

Umberto Michelucci, Francesca Venturini

Main category: cs.LG

TL;DR: A practical guide for conducting reproducible ML experiments in scientific research, focusing on fair comparisons and transparent reporting.

Details

Motivation: Poor experiment design and documentation in ML research can lead to unreliable results and misleading conclusions about model performance.

Method: Proposes a structured workflow from dataset preparation to model evaluation, introducing metrics like Logarithmic Overfitting Ratio (LOR) and Composite Overfitting Score (COS) to account for overfitting and instability.

Result: Provides recommended practices and example reporting formats to help researchers establish robust baselines and ensure reproducibility.

Conclusion: This guide supports researchers in conducting valid ML experiments and drawing evidence-based insights for scientific applications.

Abstract: Machine learning (ML) is increasingly adopted in scientific research, yet the quality and reliability of results often depend on how experiments are designed and documented. Poor baselines, inconsistent preprocessing, or insufficient validation can lead to misleading conclusions about model performance. This paper presents a practical and structured guide for conducting ML experiments in scientific applications, focussing on reproducibility, fair comparison, and transparent reporting. We outline a step-by-step workflow, from dataset preparation to model selection and evaluation, and propose metrics that account for overfitting and instability across validation folds, including the Logarithmic Overfitting Ratio (LOR) and the Composite Overfitting Score (COS). Through recommended practices and example reporting formats, this work aims to support researchers in establishing robust baselines and drawing valid evidence-based insights from ML models applied to scientific problems.

[413] Hybrid-AIRL: Enhancing Inverse Reinforcement Learning with Supervised Expert Guidance

Bram Silue, Santiago Amaya-Corredor, Patrick Mannion, Lander Willem, Pieter Libin

Main category: cs.LG

TL;DR: H-AIRL enhances AIRL by adding supervised loss and stochastic regularization, achieving better sample efficiency and stability in complex imperfect-information domains like poker.

Details

Motivation: AIRL struggles with sparse, delayed rewards in complex imperfect-information settings like poker, needing improved reward inference capabilities.

Method: Hybrid-AIRL (H-AIRL) extends AIRL by incorporating supervised loss from expert data and stochastic regularization mechanism.

Result: H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL on Gymnasium benchmarks and HULHE poker.

Conclusion: Incorporating supervised signals into inverse RL is beneficial, making H-AIRL a promising framework for challenging real-world settings.

Abstract: Adversarial Inverse Reinforcement Learning (AIRL) has shown promise in addressing the sparse reward problem in reinforcement learning (RL) by inferring dense reward functions from expert demonstrations. However, its performance in highly complex, imperfect-information settings remains largely unexplored. To explore this gap, we evaluate AIRL in the context of Heads-Up Limit Hold’em (HULHE) poker, a domain characterized by sparse, delayed rewards and significant uncertainty. In this setting, we find that AIRL struggles to infer a sufficiently informative reward function. To overcome this limitation, we contribute Hybrid-AIRL (H-AIRL), an extension that enhances reward inference and policy learning by incorporating a supervised loss derived from expert data and a stochastic regularization mechanism. We evaluate H-AIRL on a carefully selected set of Gymnasium benchmarks and the HULHE poker setting. Additionally, we analyze the learned reward function through visualization to gain deeper insights into the learning process. Our experimental results show that H-AIRL achieves higher sample efficiency and more stable learning compared to AIRL. This highlights the benefits of incorporating supervised signals into inverse RL and establishes H-AIRL as a promising framework for tackling challenging, real-world settings.

[414] The Directed Prediction Change - Efficient and Trustworthy Fidelity Assessment for Local Feature Attribution Methods

Kevin Iselborn, David Dembinsky, Adriano Lucieri, Andreas Dengel

Main category: cs.LG

TL;DR: Proposed Directed Prediction Change (DPC) metric for evaluating explanation fidelity, achieving 10x speedup and eliminating randomness compared to existing Infidelity metric.

Details

Motivation: Existing fidelity metrics like Infidelity use Monte Carlo approximation which requires many model evaluations and introduces uncertainty due to random sampling, making them unreliable for high-stakes medical settings.

Method: Modified existing Prediction Change (PC) metric by incorporating direction of both perturbation and attribution to create Directed Prediction Change (DPC) metric within Guided Perturbation Experiment framework.

Result: DPC achieves almost tenfold speedup and eliminates randomness, providing deterministic evaluation. Evaluated on 4,744 explanations across medical images and financial data, showing DPC and PC together enable holistic evaluation of explanation methods.

Conclusion: DPC provides a deterministic, trustworthy, and computationally efficient evaluation procedure for explanation fidelity that measures the same property as local Infidelity but without randomness.

Abstract: The utility of an explanation method critically depends on its fidelity to the underlying machine learning model. Especially in high-stakes medical settings, clinicians and regulators require explanations that faithfully reflect the model’s decision process. Existing fidelity metrics such as Infidelity rely on Monte Carlo approximation, which demands numerous model evaluations and introduces uncertainty due to random sampling. This work proposes a novel metric for evaluating the fidelity of local feature attribution methods by modifying the existing Prediction Change (PC) metric within the Guided Perturbation Experiment. By incorporating the direction of both perturbation and attribution, the proposed Directed Prediction Change (DPC) metric achieves an almost tenfold speedup and eliminates randomness, resulting in a deterministic and trustworthy evaluation procedure that measures the same property as local Infidelity. DPC is evaluated on two datasets (skin lesion images and financial tabular data), two black-box models, seven explanation algorithms, and a wide range of hyperparameters. Across $4,744$ distinct explanations, the results demonstrate that DPC, together with PC, enables a holistic and computationally efficient evaluation of both baseline-oriented and local feature attribution methods, while providing deterministic and reproducible outcomes.

[415] BanglaMM-Disaster: A Multimodal Transformer-Based Deep Learning Framework for Multiclass Disaster Classification in Bangla

Ariful Islam, Md Rifat Hossen, Md. Mahmudul Arif, Abdullah Al Noman, Md Arifur Rahman

Main category: cs.LG

TL;DR: BanglaMM-Disaster is a multimodal deep learning framework for disaster classification in Bangla using both text and images from social media, achieving 83.76% accuracy and outperforming single-modality baselines.

Details

Motivation: Natural disasters are a major challenge for Bangladesh, requiring real-time monitoring and quick response systems. Current approaches often rely on single data modalities, limiting effectiveness.

Method: End-to-end deep learning framework combining transformer-based text encoders (BanglaBERT, mBERT, XLM-RoBERTa) with CNN backbones (ResNet50, DenseNet169, MobileNetV2) using early fusion on a new dataset of 5,037 Bangla social media posts.

Result: Best model achieves 83.76% accuracy, surpassing text-only baseline by 3.84% and image-only baseline by 16.91%. Shows reduced misclassification across all classes with notable improvements for ambiguous examples.

Conclusion: The work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.

Abstract: Natural disasters remain a major challenge for Bangladesh, so real-time monitoring and quick response systems are essential. In this study, we present BanglaMM-Disaster, an end-to-end deep learning-based multimodal framework for disaster classification in Bangla, using both textual and visual data from social media. We constructed a new dataset of 5,037 Bangla social media posts, each consisting of a caption and a corresponding image, annotated into one of nine disaster-related categories. The proposed model integrates transformer-based text encoders, including BanglaBERT, mBERT, and XLM-RoBERTa, with CNN backbones such as ResNet50, DenseNet169, and MobileNetV2, to process the two modalities. Using early fusion, the best model achieves 83.76% accuracy. This surpasses the best text-only baseline by 3.84% and the image-only baseline by 16.91%. Our analysis also shows reduced misclassification across all classes, with noticeable improvements for ambiguous examples. This work fills a key gap in Bangla multimodal disaster analysis and demonstrates the benefits of combining multiple data types for real-time disaster response in low-resource settings.

[416] Controlling changes to attention logits

Ben Anson, Laurence Aitchison

Main category: cs.LG

TL;DR: Applying parameter-dependent learning rates to query and key weights stabilizes transformer training by controlling logit changes, enabling higher base learning rates and competitive performance with QK normalization.

Details

Motivation: QK normalization fixes stability issues but is incompatible with Multi Latent Attention (MLA) due to requiring full materialization of queries and keys during inference. There's a need for alternative stabilization methods that work with MLA.

Method: Assign parameter-dependent learning rates to query and key weights to control changes to logits, which addresses stability issues without requiring full materialization of queries and keys.

Result: The intervention allows increasing the base learning rate, outperforms other methods in MLA setting, and achieves performance competitive with QK norm when using Multi-head Attention.

Conclusion: Controlling logit changes through parameter-dependent learning rates for query/key weights provides an effective and cheap stabilization method that works with MLA and achieves competitive performance.

Abstract: Stability of neural network weights is critical when training transformer models. The query and key weights are particularly problematic, as they tend to grow large without any intervention. Applying normalization to queries and keys, known as `QK norm’, fixes stability issues in practice, but is not always applicable. For example, QK norm is not compatible with Multi Latent Attention (MLA) because QK norm requires full materialization of queries and keys during inference, which is not done in MLA. In this paper we suggest that controlling the changes to logits is important for stability. We show that these changes are controllable by assigning parameter-dependent learning rates to the query and key weights. We find that our cheap intervention allows us to increase the base learning rate of the network, outperform other methods in the MLA setting, and achieve performance competitive with QK norm when using Multi-head Attention.

[417] Anomaly Detection with Adaptive and Aggressive Rejection for Contaminated Training Data

Jungi Lee, Jungkwon Kim, Chi Zhang, Kwangsun Yoo, Seok-Joo Byun

Main category: cs.LG

TL;DR: AAR is a novel anomaly detection method that dynamically excludes contaminated data using modified z-scores and Gaussian mixture models, outperforming state-of-the-art methods by 0.041 AUROC.

Details

Motivation: Traditional anomaly detection models assume clean training data, but real-world datasets often contain contamination. Fixed contamination ratio assumptions fail in noisy environments where normal and abnormal distributions overlap, severely degrading performance.

Method: Proposes Adaptive and Aggressive Rejection (AAR) using modified z-score and Gaussian mixture model-based thresholds to dynamically exclude anomalies. Integrates hard and soft rejection strategies to balance preserving normal data while excluding anomalies.

Result: Extensive experiments on two image datasets and thirty tabular datasets show AAR outperforms state-of-the-art methods by 0.041 AUROC, demonstrating enhanced robustness against contaminated datasets.

Conclusion: AAR provides a scalable and reliable solution for handling contaminated data in anomaly detection, enabling broader real-world applications in security and healthcare domains.

Abstract: Handling contaminated data poses a critical challenge in anomaly detection, as traditional models assume training on purely normal data. Conventional methods mitigate contamination by relying on fixed contamination ratios, but discrepancies between assumed and actual ratios can severely degrade performance, especially in noisy environments where normal and abnormal data distributions overlap. To address these limitations, we propose Adaptive and Aggressive Rejection (AAR), a novel method that dynamically excludes anomalies using a modified z-score and Gaussian mixture model-based thresholds. AAR effectively balances the trade-off between preserving normal data and excluding anomalies by integrating hard and soft rejection strategies. Extensive experiments on two image datasets and thirty tabular datasets demonstrate that AAR outperforms the state-of-the-art method by 0.041 AUROC. By providing a scalable and reliable solution, AAR enhances robustness against contaminated datasets, paving the way for broader real-world applications in domains such as security and healthcare.

[418] SUPN: Shallow Universal Polynomial Networks

Zachary Morrow, Michael Penwarden, Brian Chen, Aurya Javeed, Akil Narayan, John D. Jakeman

Main category: cs.LG

TL;DR: SUPNs (shallow universal polynomial networks) use polynomial layers to achieve better function approximation with fewer parameters than DNNs and KANs, reducing approximation error and variability.

Details

Motivation: DNNs and KANs require many parameters, leading to optimization challenges, local minima, and sensitivity to initialization. SUPNs aim to provide sufficient expressivity with fewer parameters.

Method: Replace all but the last hidden layer with a single polynomial layer with learnable coefficients, combining strengths of DNNs and polynomials. Derive explicit formulas for quasi-optimal parameters.

Result: SUPNs converge at same rate as best polynomial approximation. In extensive experiments (13,000+ models), SUPNs achieve lower approximation error and variability than DNNs and KANs by an order of magnitude, even outperforming polynomial projection on non-smooth functions.

Conclusion: SUPNs offer a more efficient alternative to DNNs and KANs for function approximation, providing better accuracy with fewer parameters and reduced sensitivity to initialization.

Abstract: Deep neural networks (DNNs) and Kolmogorov-Arnold networks (KANs) are popular methods for function approximation due to their flexibility and expressivity. However, they typically require a large number of trainable parameters to produce a suitable approximation. Beyond making the resulting network less transparent, overparameterization creates a large optimization space, likely producing local minima in training that have quite different generalization errors. In this case, network initialization can have an outsize impact on the model’s out-of-sample accuracy. For these reasons, we propose shallow universal polynomial networks (SUPNs). These networks replace all but the last hidden layer with a single layer of polynomials with learnable coefficients, leveraging the strengths of DNNs and polynomials to achieve sufficient expressivity with far fewer parameters. We prove that SUPNs converge at the same rate as the best polynomial approximation of the same degree, and we derive explicit formulas for quasi-optimal SUPN parameters. We complement theory with an extensive suite of numerical experiments involving SUPNs, DNNs, KANs, and polynomial projection in one, two, and ten dimensions, consisting of over 13,000 trained models. On the target functions we numerically studied, for a given number of trainable parameters, the approximation error and variability are often lower for SUPNs than for DNNs and KANs by an order of magnitude. In our examples, SUPNs even outperform polynomial projection on non-smooth functions.

[419] Ensemble Performance Through the Lens of Linear Independence of Classifier Votes in Data Streams

Enes Bektas, Fazli Can

Main category: cs.LG

TL;DR: This paper analyzes ensemble size optimization through linear independence of classifier votes, deriving theoretical estimates for optimal ensemble sizes and validating them experimentally.

Details

Motivation: To address computational inefficiency and diminishing returns in large ensembles by understanding the relationship between ensemble size and performance through linear independence of classifier votes.

Method: Proposed modeling linear independence among classifier outputs, derived theoretical framework for ensemble size optimization, and validated through experiments with OzaBagging and GOOWE on real-world and synthetic datasets.

Result: Theoretical estimate effectively identifies performance saturation point for robust ensembles like OzaBagging, but reveals algorithmic instability in complex weighting schemes like GOOWE when high theoretical diversity is achieved.

Conclusion: Linear independence provides a theoretical foundation for determining optimal ensemble size, with practical implications varying by ensemble method complexity.

Abstract: Ensemble learning improves classification performance by combining multiple base classifiers. While increasing the number of classifiers generally enhances accuracy, excessively large ensembles can lead to computational inefficiency and diminishing returns. This paper investigates the relationship between ensemble size and performance through the lens of linear independence among classifier votes in data streams. We propose that ensembles composed of linearly independent classifiers maximize representational capacity, particularly under a geometric model. We then generalize the importance of linear independence to the weighted majority voting problem. By modeling the probability of achieving linear independence among classifier outputs, we derive a theoretical framework that explains the trade-off between ensemble size and accuracy. Our analysis leads to a theoretical estimate of the ensemble size required to achieve a user-specified probability of linear independence. We validate our theory through experiments on both real-world and synthetic datasets using two ensemble methods, OzaBagging and GOOWE. Our results confirm that this theoretical estimate effectively identifies the point of performance saturation for robust ensembles like OzaBagging. Conversely, for complex weighting schemes like GOOWE, our framework reveals that high theoretical diversity can trigger algorithmic instability. Our implementation is publicly available to support reproducibility and future research.

[420] Mean-Field Limits for Two-Layer Neural Networks Trained with Consensus-Based Optimization

William De Deyn, Michael Herty, Giovanni Samaey

Main category: cs.LG

TL;DR: The paper studies two-layer neural networks trained with consensus-based optimization (CBO), compares it with Adam, proposes a hybrid approach, extends CBO to multi-task learning with reduced memory overhead, and analyzes the mean-field limit in optimal transport framework.

Details

Motivation: To investigate particle-based optimization methods for neural networks and develop more efficient training approaches by combining CBO with Adam, while also addressing memory efficiency in multi-task learning scenarios.

Method: Uses consensus-based optimization (CBO) for training two-layer neural networks, compares with Adam optimizer, develops hybrid CBO-Adam approach, extends CBO to multi-task learning with reduced memory overhead, and analyzes mean-field limit using optimal transport framework.

Result: Hybrid CBO-Adam approach provides faster convergence than CBO alone, multi-task CBO formulation reduces memory overhead, and mean-field analysis shows monotonic variance decrease in the infinite particle limit.

Conclusion: CBO is a viable alternative to Adam for neural network training, with hybrid approaches offering improved convergence, and the mean-field analysis provides theoretical foundation for the method’s behavior in the infinite particle regime.

Abstract: We study two-layer neural networks and train these with a particle-based method called consensus-based optimization (CBO). We compare the performance of CBO against Adam on two test cases and demonstrate how a hybrid approach, combining CBO with Adam, provides faster convergence than CBO. In the context of multi-task learning, we recast CBO into a formulation that offers less memory overhead. The CBO method allows for a mean-field limit formulation, which we couple with the mean-field limit of the neural network. To this end, we first reformulate CBO within the optimal transport framework. Finally, in the limit of infinitely many particles, we define the corresponding dynamics on the Wasserstein-over-Wasserstein space and show that the variance decreases monotonically.

[421] Lost in Time? A Meta-Learning Framework for Time-Shift-Tolerant Physiological Signal Transformation

Qian Hong, Cheng Bian, Xiao Zhou, Xiaoyu Li, Yelei Li, Zijing Zeng

Main category: cs.LG

TL;DR: ShiftSyncNet is a meta-learning framework that automatically corrects temporal misalignment in multimodal physiological signal transformation (e.g., PPG/BCG to ABP), improving transformation accuracy by learning time offsets and applying Fourier phase shifts.

Details

Motivation: Temporal misalignment in multimodal signal transformation impairs accuracy in capturing critical features like ABP peaks, and existing synchronization methods rely on strong similarity assumptions or manual tuning while LNL approaches fail under time-shifted supervision.

Method: Meta-learning-based bi-level optimization framework with transformation network (TransNet) and time-shift correction network (SyncNet), where SyncNet learns time offsets between training pairs and applies Fourier phase shifts to align supervision signals.

Result: Outperforms strong baselines by 9.4%, 6.0%, and 12.8% on one real-world industrial dataset and two public datasets, demonstrating effectiveness in correcting time shifts and improving transformation accuracy.

Conclusion: ShiftSyncNet provides an effective solution for addressing temporal inconsistencies in multimodal physiological transformation, pointing toward a unified direction for handling temporal misalignment challenges.

Abstract: Translating non-invasive signals such as photoplethysmography (PPG) and ballistocardiography (BCG) into clinically meaningful signals like arterial blood pressure (ABP) is vital for continuous, low-cost healthcare monitoring. However, temporal misalignment in multimodal signal transformation impairs transformation accuracy, especially in capturing critical features like ABP peaks. Conventional synchronization methods often rely on strong similarity assumptions or manual tuning, while existing Learning with Noisy Labels (LNL) approaches are ineffective under time-shifted supervision, either discarding excessive data or failing to correct label shifts. To address this challenge, we propose ShiftSyncNet, a meta-learning-based bi-level optimization framework that automatically mitigates performance degradation due to time misalignment. It comprises a transformation network (TransNet) and a time-shift correction network (SyncNet), where SyncNet learns time offsets between training pairs and applies Fourier phase shifts to align supervision signals. Experiments on one real-world industrial dataset and two public datasets show that ShiftSyncNet outperforms strong baselines by 9.4%, 6.0%, and 12.8%, respectively. The results highlight its effectiveness in correcting time shifts, improving label quality, and enhancing transformation accuracy across diverse misalignment scenarios, pointing toward a unified direction for addressing temporal inconsistencies in multimodal physiological transformation.

[422] IntAttention: A Fully Integer Attention Pipeline for Efficient Edge Inference

Wanli Zhong, Haibo Feng, Zirui Zhou, Hanyang Peng, Shiqi Yu

Main category: cs.LG

TL;DR: IntAttention is a fully integer attention pipeline that eliminates softmax bottlenecks in Transformer inference on edge devices, achieving up to 3.7x speedup and 61% energy reduction without retraining.

Details

Motivation: Transformer deployment on edge devices is limited by softmax bottlenecks, which cause costly dequantize-softmax-requantize operations that can account for 65% of attention latency and disrupt integer dataflow efficiency.

Method: IntAttention uses IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials with integer operations, integrating sparsity-aware clipping, 32-entry lookup-table approximation, and direct integer normalization to eliminate datatype conversions.

Result: Achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines, and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs, with comparable accuracy across language and vision models.

Conclusion: IntAttention enables practical and efficient Transformer inference on commodity edge devices through fully integer attention pipelines without retraining requirements.

Abstract: Deploying Transformer models on edge devices is limited by latency and energy budgets. While INT8 quantization effectively accelerates the primary matrix multiplications, it exposes the softmax as the dominant bottleneck. This stage incurs a costly dequantize-softmax-requantize detour, which can account for up to 65% of total attention latency and disrupts the end-to-end integer dataflow critical for edge hardware efficiency. To address this limitation, we present IntAttention, the first fully integer, plug-and-play attention pipeline without retraining. At the core of our approach lies IndexSoftmax, a hardware-friendly operator that replaces floating-point exponentials entirely within the integer domain. IntAttention integrates sparsity-aware clipping, a 32-entry lookup-table approximation, and direct integer normalization, thereby eliminating all datatype conversion overhead. We evaluate IntAttention and demonstrate consistent and substantial gains. Our method achieves up to 3.7x speedup and 61% energy reduction over FP16 baselines and 2.0x faster than conventional INT8 attention pipelines on Armv8 CPUs. These gains are achieved with high-fidelity accuracy comparable to baselines across diverse language and vision models, enabling practical and efficient Transformer inference on commodity edge devices. Code will be released in later version of this work.

[423] Mechanistic Interpretability for Transformer-based Time Series Classification

Matīss Kalnāre, Sofoklis Kitharidis, Thomas Bäck, Niki van Stein

Main category: cs.LG

TL;DR: The paper adapts Mechanistic Interpretability techniques from NLP to transformers for time series classification to reveal internal causal structures and decision-making processes.

Details

Motivation: Transformer models excel in time series classification but their complex internal decision-making remains poorly understood, with existing explainability methods focusing mainly on input-output attributions rather than internal mechanisms.

Method: Adapted activation patching, attention saliency, and sparse autoencoders from NLP to time series transformers; systematically probed causal roles of attention heads and timesteps; constructed causal graphs to trace information flow.

Result: Revealed causal structures within transformer models, identified key attention heads and temporal positions driving correct classifications, and demonstrated sparse autoencoders’ potential for discovering interpretable latent features.

Conclusion: Provides methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification.

Abstract: Transformer-based models have become state-of-the-art tools in various machine learning tasks, including time series classification, yet their complexity makes understanding their internal decision-making challenging. Existing explainability methods often focus on input-output attributions, leaving the internal mechanisms largely opaque. This paper addresses this gap by adapting various Mechanistic Interpretability techniques; activation patching, attention saliency, and sparse autoencoders, from NLP to transformer architectures designed explicitly for time series classification. We systematically probe the internal causal roles of individual attention heads and timesteps, revealing causal structures within these models. Through experimentation on a benchmark time series dataset, we construct causal graphs illustrating how information propagates internally, highlighting key attention heads and temporal positions driving correct classifications. Additionally, we demonstrate the potential of sparse autoencoders for uncovering interpretable latent features. Our findings provide both methodological contributions to transformer interpretability and novel insights into the functional mechanics underlying transformer performance in time series classification tasks.

[424] Predictive Safety Shield for Dyna-Q Reinforcement Learning

Jin Pin, Krasowski Hanna, Vanneaux Elena

Main category: cs.LG

TL;DR: A predictive safety shield for model-based RL that uses safe predictions from environment models to update Q-functions locally, improving performance while maintaining hard safety guarantees.

Details

Motivation: Existing safety shields use random sampling or fixed fallback controllers, disregarding future performance implications of different safe actions.

Method: Proposes a predictive safety shield that updates Q-function locally based on safe predictions from safe simulation of environment model in discrete space.

Result: Experiments on gridworld show short prediction horizons suffice to identify optimal paths; approach is robust to distribution shifts without additional training.

Conclusion: The predictive safety shield improves RL performance while maintaining hard safety guarantees and demonstrates robustness to distribution shifts.

Abstract: Obtaining safety guarantees for reinforcement learning is a major challenge to achieve applicability for real-world tasks. Safety shields extend standard reinforcement learning and achieve hard safety guarantees. However, existing safety shields commonly use random sampling of safe actions or a fixed fallback controller, therefore disregarding future performance implications of different safe actions. In this work, we propose a predictive safety shield for model-based reinforcement learning agents in discrete space. Our safety shield updates the Q-function locally based on safe predictions, which originate from a safe simulation of the environment model. This shielding approach improves performance while maintaining hard safety guarantees. Our experiments on gridworld environments demonstrate that even short prediction horizons can be sufficient to identify the optimal path. We observe that our approach is robust to distribution shifts, e.g., between simulation and reality, without requiring additional training.

[425] Context-Specific Causal Graph Discovery with Unobserved Contexts: Non-Stationarity, Regimes and Spatio-Temporal Patterns

Martin Rabel, Jakob Runge

Main category: cs.LG

TL;DR: A framework for analyzing causal graph changes in spatially gridded time series data, addressing non-stationarity issues by modifying constraint-based causal discovery methods while maintaining modularity and extensibility.

Details

Motivation: Real-world data like climate applications often show spatial and temporal variations that encode important information and can negatively affect algorithms assuming stationarity. These variations in causal graphs need to be studied for stability and reliability.

Method: Modifies constraint-based causal discovery approaches at the independence testing level, creating a modular framework that works with existing methods (PC, PC-stable, FCI, PCMCI, PCMCI+, LPCMCI) with minimal changes.

Result: Developed an extremely modular, easily extensible framework that can leverage existing causal discovery methods and systematically address subproblems related to change-point detection, clustering, and independence testing.

Conclusion: The framework provides a principled approach to handle non-stationarity in causal discovery, simplifies understanding of limitations and trade-offs, and will be available as open-source implementation.

Abstract: Real-world data, for example in climate applications, often consists of spatially gridded time series data or data with comparable structure. While the underlying system is often believed to behave similar at different points in space and time, those variations that do exist are twofold relevant: They often encode important information in and of themselves. And they may negatively affect the stability / convergence and reliability\Slash{}validity of results of algorithms assuming stationarity or space-translation invariance. We study the information encoded in changes of the causal graph, with stability in mind. An analysis of this general task identifies two core challenges. We develop guiding principles to overcome these challenges, and provide a framework realizing these principles by modifying constraint-based causal discovery approaches on the level of independence testing. This leads to an extremely modular, easily extensible and widely applicable framework. It can leverage existing constraint-based causal discovery methods (demonstrated on IID-algorithms PC, PC-stable, FCI and time series algorithms PCMCI, PCMCI+, LPCMCI) with little to no modification. The built-in modularity allows to systematically understand and improve upon an entire array of subproblems. By design, it can be extended by leveraging insights from change-point-detection, clustering, independence-testing and other well-studied related problems. The division into more accessible sub-problems also simplifies the understanding of fundamental limitations, hyperparameters controlling trade-offs and the statistical interpretation of results. An open-source implementation will be available soon.

[426] Computing Strategic Responses to Non-Linear Classifiers

Jack Geary, Boyan Gao, Henry Gouk

Main category: cs.LG

TL;DR: A new method for computing best responses in strategic classification using Lagrangian dual optimization, enabling non-linear classifier deployment in strategic settings.

Details

Motivation: Current strategic classification methods are limited to linear classifiers, but non-linear classifiers are often more suitable. The main limitation is the inability to compute best responses in non-linear settings.

Method: Proposed a novel method for computing best responses by optimizing the Lagrangian dual of the Agents’ objective function.

Result: The method reproduces best responses in linear settings and identifies weaknesses in existing approaches. It can be straightforwardly applied to non-linear classifier settings for both evaluation and training.

Conclusion: The Lagrangian dual optimization approach enables effective computation of best responses in strategic classification, facilitating the use of non-linear classifiers in strategic environments.

Abstract: We consider the problem of strategic classification, where the act of deploying a classifier leads to strategic behaviour that induces a distribution shift on subsequent observations. Current approaches to learning classifiers in strategic settings are focused primarily on the linear setting, but in many cases non-linear classifiers are more suitable. A central limitation to progress for non-linear classifiers arises from the inability to compute best responses in these settings. We present a novel method for computing the best response by optimising the Lagrangian dual of the Agents’ objective. We demonstrate that our method reproduces best responses in linear settings, identifying key weaknesses in existing approaches. We present further results demonstrating our method can be straight-forwardly applied to non-linear classifier settings, where it is useful for both evaluation and training.

[427] Machine Learning Approaches to Clinical Risk Prediction: Multi-Scale Temporal Alignment in Electronic Health Records

Wei-Chen Chang, Lu Dai, Ting Xu

Main category: cs.LG

TL;DR: Proposes MSTAN for EHR risk prediction, addressing temporal irregularity and multi-scale dependencies through learnable temporal alignment and multi-scale feature extraction.

Details

Motivation: Address challenges of temporal irregularity, sampling interval differences, and multi-scale dynamic dependencies in Electronic Health Records (EHR) for accurate risk prediction.

Method: Uses learnable temporal alignment mechanism and multi-scale convolutional feature extraction to model long-term trends and short-term fluctuations, with attention-based aggregation for risk representation.

Result: Outperforms mainstream baselines on public EHR datasets in accuracy, recall, precision, and F1-Score, demonstrating effectiveness and robustness.

Conclusion: Provides effective solution for intelligent representation of high-dimensional asynchronous medical sequences and supports EHR-driven clinical risk prediction.

Abstract: This study proposes a risk prediction method based on a Multi-Scale Temporal Alignment Network (MSTAN) to address the challenges of temporal irregularity, sampling interval differences, and multi-scale dynamic dependencies in Electronic Health Records (EHR). The method focuses on temporal feature modeling by introducing a learnable temporal alignment mechanism and a multi-scale convolutional feature extraction structure to jointly model long-term trends and short-term fluctuations in EHR sequences. At the input level, the model maps multi-source clinical features into a unified high-dimensional semantic space and employs temporal embedding and alignment modules to dynamically weight irregularly sampled data, reducing the impact of temporal distribution differences on model performance. The multi-scale feature extraction module then captures key patterns across different temporal granularities through multi-layer convolution and hierarchical fusion, achieving a fine-grained representation of patient states. Finally, an attention-based aggregation mechanism integrates global temporal dependencies to generate individual-level risk representations for disease risk prediction and health status assessment. Experiments conducted on publicly available EHR datasets show that the proposed model outperforms mainstream baselines in accuracy, recall, precision, and F1-Score, demonstrating the effectiveness and robustness of multi-scale temporal alignment in complex medical time-series analysis. This study provides a new solution for intelligent representation of high-dimensional asynchronous medical sequences and offers important technical support for EHR-driven clinical risk prediction.

[428] A decoupled alignment kernel for peptide membrane permeability predictions

Ali Amirahmadi, Gökçe Geylan, Leonardo De Maria, Farzaneh Etminani, Mattias Ohlsson, Alessandro Tibo

Main category: cs.LG

TL;DR: Proposed MD-GAK and PMD-GAK kernels for predicting cyclic peptide cell permeability, focusing on uncertainty estimation using Gaussian Processes and outperforming state-of-the-art models.

Details

Motivation: Cell-membrane permeability is a key bottleneck for cyclic peptides targeting intracellular sites, with limited public data and need for well-calibrated uncertainty estimation.

Method: Developed monomer-aware decoupled global alignment kernel (MD-GAK) that couples residue-residue similarity with sequence alignment while decoupling local matches from gap penalties, plus a variant PMD-GAK with triangular positional prior. Used Gaussian Processes for uncertainty estimation.

Result: The methods outperform state-of-the-art models across all metrics, with PMD-GAK showing additional advantages in reducing calibration errors.

Conclusion: The proposed kernels provide effective and robust approaches for predicting cyclic peptide permeability with well-calibrated uncertainty, offering advantages over complex deep learning architectures.

Abstract: Cyclic peptides are promising modalities for targeting intracellular sites; however, cell-membrane permeability remains a key bottleneck, exacerbated by limited public data and the need for well-calibrated uncertainty. Instead of relying on data-eager complex deep learning architecture, we propose a monomer-aware decoupled global alignment kernel (MD-GAK), which couples chemically meaningful residue-residue similarity with sequence alignment while decoupling local matches from gap penalties. MD-GAK is a relatively simple kernel. To further demonstrate the robustness of our framework, we also introduce a variant, PMD-GAK, which incorporates a triangular positional prior. As we will show in the experimental section, PMD-GAK can offer additional advantages over MD-GAK, particularly in reducing calibration errors. Since our focus is on uncertainty estimation, we use Gaussian Processes as the predictive model, as both MD-GAK and PMD-GAK can be directly applied within this framework. We demonstrate the effectiveness of our methods through an extensive set of experiments, comparing our fully reproducible approach against state-of-the-art models, and show that it outperforms them across all metrics.

[429] Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning

Alex Ning, Yen-Ling Kuo, Gabe Gomes

Main category: cs.LG

TL;DR: Latent reasoning reduces reasoning length by 52% without accuracy loss using adaptive-length models and RL optimization.

Details

Motivation: To compress reasoning lengths compared to chain-of-thought reasoning by removing the restriction to human language tokens as reasoning medium.

Method: Developed adaptive-length latent reasoning models with post-SFT reinforcement learning to optimize reasoning length while maintaining accuracy.

Result: Experiments on Llama 3.2 1B model and GSM8K-Aug dataset show 52% reduction in total reasoning length with no accuracy penalty.

Conclusion: Latent reasoning effectively reduces compute usage and demonstrates strong compressive capabilities, with plans to extend to more models and datasets.

Abstract: Latent reasoning represents a new development in Transformer language models that has shown potential in compressing reasoning lengths compared to chain-of-thought reasoning. By directly passing the information-rich previous final latent state into the next sequence, latent reasoning removes the restriction to human language tokens as the medium for reasoning. We develop adaptive-length latent reasoning models and introduce a post-SFT reinforcement-learning methodology to optimize latent reasoning length by minimizing reasoning length while maintaining accuracy. This, in turn, further reduces compute usage and raises the bar on the compressive capabilities of latent reasoning models. Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset show a $52%$ drop in total reasoning length with no penalty to accuracy. In future work, we plan to extend to additional models and datasets, analyze relationships between training coefficients, experiment with architecture variations, and continue our knowledge distillation for latent reasoning SFT efforts. We make our code and pretrained weights available at https://github.com/apning/adaptive-latent-reasoning.

[430] An AI-Enabled Hybrid Cyber-Physical Framework for Adaptive Control in Smart Grids

Muhammad Siddique, Sohaib Zafar

Main category: cs.LG

TL;DR: A machine learning-based digital forensic framework for smart grids deployed on cloud infrastructure, combining data acquisition, secure communication, cloud storage, and automated forensic analytics to detect and mitigate cyber-attacks in real-time.

Details

Motivation: Smart grids integrate power infrastructure with communication networks, creating vulnerabilities that can undermine grid stability and reliability. Digital forensics is needed to identify, detect, and mitigate security incidents in these cyber-physical systems.

Method: Developed an all-in-one framework using supervised and unsupervised ML algorithms (Random Forest, SVM, Gradient Boosted Trees, deep neural networks) for anomaly detection, event reconstruction, and intrusion analysis. The framework includes sensor-level data acquisition, authenticated communication, scalable cloud storage, and automated forensic analytics.

Result: The framework demonstrated high accuracy, scalability, and resilience against various cyber-attacks including data tampering, false-data injection, and coordinated control-loop manipulation in simulation and experimental studies using real-time smart-meter data streams.

Conclusion: Cloud services provide the best backbone for big-data-driven forensic workflows, enabling energy utilities to achieve fast situational awareness and intelligent incident response in smart grid systems.

Abstract: Smart grids are a fusion of classical power infrastructure and advanced communication networks and smart control, to create a cyber-physical environment that is more efficient and flexible than ever before. This integration causes vulnerabilities that can undermine grid stability as well as reliability. Digital forensics is a fundamental concept of learning and identifying, detecting, and mitigating such security incidents. This paper presents an all-in-one machine learning-based digital forensic framework of smart grid systems deployed on the Cloud. The framework combines the data acquisition at the sensor-level, authenticated communication, scalable cloud storage and automated forensic analytics. The model uses supervised and unsupervised learning algorithms - such as Random Forest, Support Vector Machine, Gradient Boosted Trees and deep neural architectures for anomaly detection, event reconstruction and intrusion analysis in real time. After several simulation and experimental studies on real-time smart-meter data streams, the proposed framework is shown to be very accurate, scalable and resilient to cyber-attacks including data tampering, false-data injection and coordinated control-loop manipulation. The results indicate that cloud services are the best backbone for big-data-driven forensic workflows, which allows energy utilities to achieve a fast situational awareness and intelligent incident response.

[431] Visualizing LLM Latent Space Geometry Through Dimensionality Reduction

Alex Ning, Vainateya Rangaraju

Main category: cs.LG

TL;DR: This paper introduces a method to visualize and analyze latent state geometries in Transformer-based language models using dimensionality reduction techniques like PCA and UMAP.

Details

Motivation: While LLMs achieve state-of-the-art performance, their internal mechanisms remain difficult to interpret, motivating the need for better visualization and analysis tools.

Method: Extract layerwise activations from Transformer blocks and apply dimensionality reduction (PCA and UMAP) to visualize latent state geometries in models like GPT-2 and LLaMa.

Result: Identified clear separation between attention and MLP component outputs, characterized high norm of initial position states, visualized layerwise evolution, and revealed high-dimensional helical structure of positional embeddings.

Conclusion: The approach enables systematic analysis of Transformer internals and supports reproducible interpretability research, with code made publicly available.

Abstract: Large language models (LLMs) achieve state-of-the-art results across many natural language tasks, but their internal mechanisms remain difficult to interpret. In this work, we extract, process, and visualize latent state geometries in Transformer-based language models through dimensionality reduction. We capture layerwise activations at multiple points within Transformer blocks and enable systematic analysis through Principal Component Analysis (PCA) and Uniform Manifold Approximation (UMAP). We demonstrate experiments on GPT-2 and LLaMa models, where we uncover interesting geometric patterns in latent space. Notably, we identify a clear separation between attention and MLP component outputs across intermediate layers, a pattern not documented in prior work to our knowledge. We also characterize the high norm of latent states at the initial sequence position and visualize the layerwise evolution of latent states. Additionally, we demonstrate the high-dimensional helical structure of GPT-2’s positional embeddings, the sequence-wise geometric patterns in LLaMa, and experiment with repeating token sequences. We aim to support systematic analysis of Transformer internals with the goal of enabling further reproducible interpretability research. We make our code available at https://github.com/Vainateya/Feature_Geometry_Visualization.

[432] On the Origin of Algorithmic Progress in AI

Hans Gundlach, Alex Fogelson, Jayson Lynch, Ana Trisovic, Jonathan Rosenfeld, Anmol Sandhu, Neil Thompson

Main category: cs.LG

TL;DR: Algorithms have shown 22,000x efficiency gains in AI training from 2012-2023, but small-scale experiments account for less than 100x. Scaling experiments reveal that most gains come from scale-dependent efficiency improvements, particularly the LSTM-to-Transformer transition.

Details

Motivation: To understand the true sources of algorithmic efficiency gains in AI training, as standard small-scale experiments fail to account for the massive 22,000x improvement observed between 2012-2023.

Method: Conducted small-scale ablation experiments on key innovations, scaling experiments between LSTMs and Transformers, and literature surveys to analyze compute-optimal scaling laws and scale-dependent efficiency improvements.

Result: Accounted for 6,930x efficiency gains (vs. 22,000x total), with the LSTM-to-Transformer transition accounting for most gains. Found that algorithmic efficiency gains are strongly scale-dependent and reference-dependent.

Conclusion: Algorithmic progress for small models has been much slower than assumed, and efficiency measures depend heavily on the scale at which they’re evaluated.

Abstract: Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm’s efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.

[433] Scale-Agnostic Kolmogorov-Arnold Geometry in Neural Networks

Mathew Vanherreweghe, Michael H. Freedman, Keith M. Adams

Main category: cs.LG

TL;DR: Kolmogorov-Arnold geometric structure emerges spontaneously in 2-layer MLPs during MNIST digit classification training, exhibiting scale-invariant properties across different spatial scales and training procedures.

Details

Motivation: To determine if the previously observed KAG structure in shallow networks on synthetic 3D tasks persists in realistic high-dimensional settings like MNIST, and to characterize its spatial properties.

Method: Extended KAG analysis to 784-dimensional MNIST using 2-layer MLPs with systematic spatial analysis at multiple scales (from 7-pixel neighborhoods to full 28x28 images), comparing standard training and spatial augmentation.

Result: KAG emerges during training and appears consistently across all spatial scales, with the same qualitative pattern observed in both standard training and training with spatial augmentation.

Conclusion: Neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.

Abstract: Recent work by Freedman and Mulligan demonstrated that shallow multilayer perceptrons spontaneously develop Kolmogorov-Arnold geometric (KAG) structure during training on synthetic three-dimensional tasks. However, it remained unclear whether this phenomenon persists in realistic high-dimensional settings and what spatial properties this geometry exhibits. We extend KAG analysis to MNIST digit classification (784 dimensions) using 2-layer MLPs with systematic spatial analysis at multiple scales. We find that KAG emerges during training and appears consistently across spatial scales, from local 7-pixel neighborhoods to the full 28x28 image. This scale-agnostic property holds across different training procedures: both standard training and training with spatial augmentation produce the same qualitative pattern. These findings reveal that neural networks spontaneously develop organized, scale-invariant geometric structure during learning on realistic high-dimensional data.

[434] Mechanisms of Non-Monotonic Scaling in Vision Transformers

Anantha Padmanaban Krishna Kumar

Main category: cs.LG

TL;DR: Deeper Vision Transformers show a Cliff-Plateau-Climb performance pattern where better performance comes from marginalizing the [CLS] token in favor of distributed patch token consensus, with optimal depth being more important than simply adding parameters.

Details

Motivation: To understand why deeper Vision Transformers often perform worse than shallower ones, challenging common scaling assumptions in transformer architectures.

Method: Systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, using an Information Scrambling Index to quantify information mixing patterns and observe representation evolution with depth.

Result: Identified consistent three-phase Cliff-Plateau-Climb pattern; better performance correlates with progressive marginalization of [CLS] token and distributed consensus among patch tokens; ViT-L shows information-task tradeoff emerging 10 layers later than ViT-B.

Conclusion: Transformer architectures benefit more from carefully calibrated depth that enables clean phase transitions than from simply increasing parameter count; Information Scrambling Index serves as useful diagnostic tool for model design.

Abstract: Deeper Vision Transformers often perform worse than shallower ones, which challenges common scaling assumptions. Through a systematic empirical analysis of ViT-S, ViT-B, and ViT-L on ImageNet, we identify a consistent three-phase Cliff-Plateau-Climb pattern that governs how representations evolve with depth. We observe that better performance is associated with progressive marginalization of the [CLS] token, originally designed as a global aggregation hub, in favor of distributed consensus among patch tokens. We quantify patterns of information mixing with an Information Scrambling Index, and show that in ViT-L the information-task tradeoff emerges roughly 10 layers later than in ViT-B, and that these additional layers correlate with increased information diffusion rather than improved task performance. Taken together, these results suggest that transformer architectures in this regime may benefit more from carefully calibrated depth that executes clean phase transitions than from simply increasing parameter count. The Information Scrambling Index provides a useful diagnostic for existing models and suggests a potential design target for future architectures. All code is available at: https://github.com/AnanthaPadmanaban-KrishnaKumar/Cliff-Plateau-Climb.

[435] Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Daniel R. Jiang, Jalaj Bhandari, Yukai Yang, Rémi Munos, Tyler Lu

Main category: cs.LG

TL;DR: Proposes Iterative PPO, a method that reduces multi-turn conversational RL to single-turn RLHF problems using learned Q-functions as rewards, enabling stable policy improvement.

Details

Motivation: Optimizing LLMs for multi-turn conversations is challenging due to sparse rewards and the mismatch between response-level planning and token-level generation in goal-oriented settings like AI marketing.

Method: Formal reduction of multi-turn RL to single-turn RLHF problems by using learned multi-turn Q-functions as reward models, then applying standard token-level PPO which is equivalent to policy improvement.

Result: Developed Iterative PPO - a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversations and improving policy using off-the-shelf RLHF tools.

Conclusion: The method provides a practical middle ground between online and offline approaches, combining adaptability with stability while leveraging existing single-turn RLHF infrastructure.

Abstract: Optimizing large language models (LLMs) for multi-turn conversational outcomes remains a significant challenge, especially in goal-oriented settings like AI marketing or sales agents who facilitate transactions via messaging platforms. The difficulty stems from sparse, long-horizon rewards and the discrepancy between response-level planning and token-level generation. In this technical note, we propose a formal reduction of the multi-turn RL problem into a sequence of single-turn RLHF-style problems. This is achieved by setting a learned multi-turn Q-function as the reward model for the single-turn problem. We demonstrate and prove a key insight: solving this single-turn RL problem with standard token-level PPO is equivalent to a policy improvement step within the multi-turn problem. This insight naturally leads to Iterative PPO, a batch online policy iteration algorithm that alternates between fitting Q-functions from logged conversation trajectories and improving the policy. A major practical advantage is that Iterative PPO directly leverages stable, off-the-shelf single-turn RLHF tools, making it straightforward to implement. Our method occupies a middle ground between fully online and fully offline approaches, retaining the adaptability of online updates while gaining the stability benefits of offline training.

[436] EvilGenie: A Reward Hacking Benchmark

Jonathan Gabor, Jayson Lynch, Jonathan Rosenfeld

Main category: cs.LG

TL;DR: EvilGenie is a benchmark for detecting reward hacking in programming agents, where models exploit loopholes like hardcoding test cases or editing test files to achieve high scores without solving problems correctly.

Details

Motivation: To create a systematic way to measure and detect reward hacking behavior in programming agents, as current benchmarks may not adequately capture such misaligned behaviors.

Method: Created benchmark using LiveCodeBench problems, implemented three detection methods: held-out unit tests, LLM judges, and test file edit detection, and tested multiple models including proprietary coding agents.

Result: Found LLM judges highly effective at detecting unambiguous reward hacking, minimal improvement from held-out tests, and observed explicit reward hacking by Codex and Claude Code with misaligned behavior across all three major agents.

Conclusion: EvilGenie successfully identifies reward hacking in programming agents, with LLM judges being particularly effective, revealing concerning misalignment in current coding agents that needs addressing.

Abstract: We introduce EvilGenie, a benchmark for reward hacking in programming settings. We source problems from LiveCodeBench and create an environment in which agents can easily reward hack, such as by hardcoding test cases or editing the testing files. We measure reward hacking in three ways: held out unit tests, LLM judges, and test file edit detection. We verify these methods against human review and each other. We find the LLM judge to be highly effective at detecting reward hacking in unambiguous cases, and observe only minimal improvement from the use of held out test cases. In addition to testing many models using Inspect’s basic_agent scaffold, we also measure reward hacking rates for three popular proprietary coding agents: OpenAI’s Codex, Anthropic’s Claude Code, and Google’s Gemini CLI Using GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro, respectively. We observe explicit reward hacking by both Codex and Claude Code, and misaligned behavior by all three agents. Our codebase can be found at https://github.com/JonathanGabor/EvilGenie.

[437] Escaping the Verifier: Learning to Reason via Demonstrations

Locke Cai, Ivan Provilkov

Main category: cs.LG

TL;DR: RARO enables LLMs to learn reasoning from expert demonstrations without task-specific verifiers using adversarial inverse reinforcement learning.

Details

Motivation: Many reasoning tasks lack verifiers but have expert demonstrations, creating a need for methods that can learn reasoning capabilities without explicit reward signals.

Method: Uses adversarial interaction between policy (generator) and relativistic critic (discriminator) that jointly learn via RL, with key stabilization techniques for robust training.

Result: Significantly outperforms verifier-free baselines on Countdown, DeepMath, and Poetry Writing tasks, showing robust scaling trends similar to RL with verifiers.

Conclusion: RARO effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning when task-specific verifiers are unavailable.

Abstract: Training Large Language Models (LLMs) to reason often relies on Reinforcement Learning (RL) with task-specific verifiers. However, many real-world reasoning-intensive tasks lack verifiers, despite offering abundant expert demonstrations that remain under-utilized for reasoning-focused training. We introduce RARO (Relativistic Adversarial Reasoning Optimization) that learns strong reasoning capabilities from only expert demonstrations via Inverse Reinforcement Learning. Our method sets up an adversarial interaction between a policy (generator) and a relativistic critic (discriminator): the policy learns to mimic expert answers, while the critic learns to compare and distinguish between policy and expert answers. Our method trains both the policy and the critic jointly and continuously via RL, and we identify the key stabilization techniques required for robust learning. Empirically, RARO significantly outperforms strong verifier-free baselines on all of our evaluation tasks – Countdown, DeepMath, and Poetry Writing – and enjoys the same robust scaling trends as RL on verifiable tasks. These results demonstrate that our method effectively elicits strong reasoning performance from expert demonstrations alone, enabling robust reasoning learning even when task-specific verifiers are unavailable.

[438] Federated Large Language Models: Current Progress and Future Directions

Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, Carlee Joe-Wong

Main category: cs.LG

TL;DR: Survey paper on Federated Learning for Large Language Models (FedLLM), covering recent advances in fine-tuning and prompt learning, addressing challenges like data heterogeneity and communication costs, and proposing future directions.

Details

Motivation: Privacy concerns in LLM training data collection and the need for collaborative training without sharing local data, while addressing FL-specific challenges like model convergence and communication efficiency.

Method: Comprehensive survey approach analyzing existing work on federated fine-tuning and prompt learning for LLMs, identifying research challenges and proposing future directions.

Result: Systematic overview of FedLLM landscape, highlighting current approaches, limitations, and research gaps in federated settings for LLMs.

Conclusion: FedLLM is a promising approach for privacy-preserving LLM training, but requires further research in areas like pre-training, federated agents, and using LLMs to enhance federated learning processes.

Abstract: Large language models are rapidly gaining popularity and have been widely adopted in real-world applications. While the quality of training data is essential, privacy concerns arise during data collection. Federated learning offers a solution by allowing multiple clients to collaboratively train LLMs without sharing local data. However, FL introduces new challenges, such as model convergence issues due to heterogeneous data and high communication costs. A comprehensive study is required to address these challenges and guide future research. This paper surveys Federated learning for LLMs (FedLLM), highlighting recent advances and future directions. We focus on two key aspects: fine-tuning and prompt learning in a federated setting, discussing existing work and associated research challenges. We finally propose potential directions for federated LLMs, including pre-training, federated agents, and LLMs for federated learning.

[439] Through the telecom lens: Are all training samples important?

Shruti Bothe, Illyyne Saffar, Aurelie Boisbunon, Hasan Farooq, Julien Forgeat, Md Moin Uddin Chowdhury

Main category: cs.LG

TL;DR: The paper questions equal importance of training samples in telecom AI, proposing a sample importance framework that identifies impactful data to reduce computation and energy use while maintaining accuracy.

Details

Motivation: Telecom AI faces challenges with noisy, high-dimensional data and high training demands. Current workflows assume all samples contribute equally, but next-gen systems require accurate, efficient, and sustainable AI models.

Method: Perform sample-level gradient analysis across epochs to identify patterns of influence and redundancy. Propose a sample importance framework that selectively prioritizes impactful data and reduces computation.

Result: Experiments on three real-world telecom datasets show the method maintains performance while reducing data needs and computational overhead.

Conclusion: The proposed framework advances sustainable AI in telecommunications by optimizing computation and energy use without compromising accuracy.

Abstract: The rise of AI in telecommunications, from optimizing Radio Access Networks to managing user experience, has sharply increased data volumes and training demands. Telecom data is often noisy, high-dimensional, costly to store, process, and label. Despite Ai’s critical role, standard workflows still assume all training samples contribute equally. On the other hand, next generation systems require AI models that are accurate, efficient, and sustainable.The paper questions the assumptions of equal importance by focusing on applying and analyzing the roles of individual samples in telecom training and assessing whether the proposed model optimizes computation and energy use. we perform sample-level gradient analysis across epochs to identify patterns of influence and redundancy in model learning. Based on this, we propose a sample importance framework thats electively prioritizes impactful data and reduces computation without compromising accuracy. Experiments on three real-world telecom datasets show that our method [reserves performance while reducing data needs and computational overhead while advancing the goals of sustainable AI in telecommunications.

[440] DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving

Fengze Yu, Leshu Li, Brad McDanel, Saiqian Zhang

Main category: cs.LG

TL;DR: DSD is a distributed speculative decoding framework that extends speculative decoding to multi-device environments, achieving up to 1.1x speedup and 9.7% higher throughput over existing baselines.

Details

Motivation: LLM inference suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments, with existing speculative decoding techniques confined to single-node execution.

Method: Proposed DSD framework with coordinated draft-target execution across multiple devices, introduced DSD-Sim simulator for network/batching/scheduling analysis, and designed Adaptive Window Control policy for dynamic window size optimization.

Result: DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines across diverse workloads.

Conclusion: DSD enables agile and scalable LLM serving across edge and cloud environments through distributed speculative decoding.

Abstract: Large language model (LLM) inference often suffers from high decoding latency and limited scalability across heterogeneous edge-cloud environments. Existing speculative decoding (SD) techniques accelerate token generation but remain confined to single-node execution. We propose DSD, a distributed speculative decoding framework that extends SD to multi-device deployments through coordinated draft-target execution. Given the lack of prior work on simulating this paradigm, we first introduce DSD-Sim, a discrete-event simulator that captures network, batching, and scheduling dynamics. Building on insights from DSD-Sim, we further design an Adaptive Window Control (AWC) policy that dynamically adjusts speculation window size to optimize throughput. Experiments across diverse workloads show that DSD achieves up to 1.1x speedup and 9.7% higher throughput over existing SD baselines, enabling agile and scalable LLM serving across edge and cloud.

[441] Beyond Introspection: Reinforcing Thinking via Externalist Behavioral Feedback

Diji Yang, Linda Zeng, Kezhen Chen, Yi Zhang

Main category: cs.LG

TL;DR: DRR framework uses external discriminative models to critique LLM reasoning instead of self-critique, improving reliability by evaluating observable behaviors rather than introspection.

Details

Motivation: Self-critique methods inherit biases from original outputs (introspection illusion), making reasoning unreliable near knowledge boundaries. Need external evaluation to overcome this limitation.

Method: Three-step DRR framework: 1) Distill behavioral traces from reasoner, 2) Train lightweight external Discriminative Model (DM), 3) Use DM at inference to identify and reject suspicious reasoning steps, forcing LLM to explore alternatives.

Result: Significantly outperforms prominent self-critique methods on multiple reasoning benchmarks, enhancing reasoning quality without altering base model.

Conclusion: DRR provides scalable, annotation-free solution for improving LLM reasoning reliability through external behavioral evaluation rather than introspection.

Abstract: While inference-time thinking allows Large Language Models (LLMs) to address complex problems, the extended thinking process can be unreliable or inconsistent because of the model’s probabilistic nature, especially near its knowledge boundaries. Existing approaches attempt to mitigate this by having the model critique its own reasoning to make corrections. However, such self-critique inherits the same biases of the original output, known as the introspection illusion. Moving beyond such introspection and inspired by core methodologies in ethology, we propose an externalist three-step framework Distillation-Reinforcement-Reasoning (DRR). Rather than relying on a model’s introspection, DRR evaluates its observable behaviors to provide corrective feedback. DRR first distills the reasoner’s behavioral traces, then trains a lightweight, external Discriminative Model (DM). At inference time, this DM acts as a critic, identifying and rejecting suspicious reasoning steps. This external feedback compels the LLM to discard flawed pathways and explore alternatives, thereby enhancing reasoning quality without altering the base model. Experiments on multiple reasoning benchmarks show that our framework significantly outperforms prominent self-critique methods. Benefiting from a lightweight and annotation-free design, DRR offers a scalable and adaptable solution for improving the reliability of reasoning in a wide range of LLMs.

[442] Characterizing Pattern Matching and Its Limits on Compositional Task Structures

Hoyeon Chang, Jinho Park, Hanseul Cho, Sohee Yang, Miyoung Ko, Hyeonbin Hwang, Seungpil Won, Dohaeng Lee, Youbin Ahn, Minjoon Seo

Main category: cs.LG

TL;DR: LLMs rely on pattern matching for compositional tasks, but this leads to OOD generalization failures. The paper formalizes pattern matching as functional equivalence and studies Transformers and Mamba in controlled compositional tasks to understand their limitations.

Details

Motivation: To address ambiguity in behavioral studies that allow multiple generalization sources, and to provide a precise account of how LLMs perform generalization through pattern matching and their limitations.

Method: Formalize pattern matching as functional equivalence, then systematically study decoder-only Transformer and Mamba behavior in controlled compositional tasks that isolate this mechanism.

Result: (1) Pattern matching success predicted by number of witnessing contexts; (2) Proven tight sample complexity bound for learning two-hop structures; (3) Path ambiguity identified as structural barrier; (4) Chain-of-Thought reduces data requirements but doesn’t resolve path ambiguity.

Conclusion: Provides a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms in LLMs.

Abstract: Despite impressive capabilities, LLMs’ successes often rely on pattern-matching behaviors, yet these are also linked to OOD generalization failures in compositional tasks. However, behavioral studies commonly employ task setups that allow multiple generalization sources (e.g., algebraic invariances, structural repetition), obscuring a precise and testable account of how well LLMs perform generalization through pattern matching and their limitations. To address this ambiguity, we first formalize pattern matching as functional equivalence, i.e., identifying pairs of subsequences of inputs that consistently lead to identical results when the rest of the input is held constant. Then, we systematically study how decoder-only Transformer and Mamba behave in controlled tasks with compositional structures that isolate this mechanism. Our formalism yields predictive and quantitative insights: (1) Instance-wise success of pattern matching is well predicted by the number of contexts witnessing the relevant functional equivalence. (2) We prove a tight sample complexity bound of learning a two-hop structure by identifying the exponent of the data scaling law for perfect in-domain generalization. Our empirical results align with the theoretical prediction, under 20x parameter scaling and across architectures. (3) Path ambiguity is a structural barrier: when a variable influences the output via multiple paths, models fail to form unified intermediate state representations, impairing accuracy and interpretability. (4) Chain-of-Thought reduces data requirements yet does not resolve path ambiguity. Hence, we provide a predictive, falsifiable boundary for pattern matching and a foundational diagnostic for disentangling mixed generalization mechanisms.

[443] AutoDiscovery: Open-ended Scientific Discovery via Bayesian Surprise

Dhruv Agarwal, Bodhisattwa Prasad Majumder, Reece Adamson, Megha Chakravorty, Satvika Reddy Gavireddy, Aditya Parashar, Harshit Surana, Bhavana Dalvi Mishra, Andrew McCallum, Ashish Sabharwal, Peter Clark

Main category: cs.LG

TL;DR: AutoDiscovery enables open-ended autonomous scientific discovery using Bayesian surprise to drive exploration, outperforming competitors by producing 5-29% more surprising discoveries across 21 real-world datasets.

Details

Motivation: Current ASD approaches rely on human-specified questions or use diversity heuristics/subjective proxies that struggle with vast hypothesis spaces and imprecise definitions. Scientific discovery could be accelerated by allowing AI systems to drive exploration autonomously.

Method: Uses Bayesian surprise (epistemic shift from prior to posterior beliefs) as exploration driver, with Monte Carlo tree search (MCTS) strategy and progressive widening using surprisal as reward function to efficiently explore nested hypotheses.

Result: Under fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Human evaluation shows two-thirds of discoveries are surprising to domain experts.

Conclusion: AutoDiscovery represents an important step towards building effective open-ended autonomous scientific discovery systems that can generate meaningful scientific insights without human guidance.

Abstract: The promise of autonomous scientific discovery (ASD) hinges not only on answering questions, but also on knowing which questions to ask. Most recent works in ASD explore the use of large language models (LLMs) in goal-driven settings, relying on human-specified research questions to guide hypothesis generation. However, scientific discovery may be accelerated further by allowing the AI system to drive exploration by its own criteria. The few existing approaches in open-ended ASD select hypotheses based on diversity heuristics or subjective proxies for human interestingness, but the former struggles to meaningfully navigate the typically vast hypothesis space, and the latter suffers from imprecise definitions. This paper presents AutoDiscovery – a method for open-ended ASD that instead drives scientific exploration using Bayesian surprise. Here, we quantify the epistemic shift from the LLM’s prior beliefs about a hypothesis to its posterior beliefs after gathering experimental results. To efficiently explore the space of nested hypotheses, our method employs a Monte Carlo tree search (MCTS) strategy with progressive widening using surprisal as the reward function. We evaluate AutoDiscovery in the setting of data-driven discovery across 21 real-world datasets spanning domains such as biology, economics, finance, and behavioral science. Our results demonstrate that under a fixed budget, AutoDiscovery substantially outperforms competitors by producing 5-29% more discoveries deemed surprising by the LLM. Our human evaluation further reveals that two-thirds of discoveries made by our system are surprising to domain experts as well, suggesting this is an important step towards building open-ended ASD systems.

[444] Mechanism of Task-oriented Information Removal in In-context Learning

Hakaze Cho, Haolin Yang, Gouki Minegishi, Naoya Inoue

Main category: cs.LG

TL;DR: In-context learning works through selective information removal from language model hidden states, where demonstrations help remove redundant information and focus on the intended task.

Details

Motivation: To understand the inner mechanism of in-context learning (ICL) in language models, which remains unclear despite its effectiveness in few-shot learning.

Method: Investigated ICL through information removal perspective, using low-rank filters to selectively remove specific information from hidden states, and identified denoising attention heads that enable this process.

Result: Found that zero-shot LMs encode non-selective representations containing all possible task information, while ICL selectively removes redundant information through denoising heads, significantly improving accuracy. Blocking these heads degrades ICL performance.

Conclusion: The key mechanism underlying ICL is selective information removal from entangled representations, enabled by specific denoising attention heads that steer models toward intended tasks.

Abstract: In-context Learning (ICL) is an emerging few-shot learning paradigm based on modern Language Models (LMs), yet its inner mechanism remains unclear. In this paper, we investigate the mechanism through a novel perspective of information removal. Specifically, we demonstrate that in the zero-shot scenario, LMs encode queries into non-selective representations in hidden states containing information for all possible tasks, leading to arbitrary outputs without focusing on the intended task, resulting in near-zero accuracy. Meanwhile, we find that selectively removing specific information from hidden states by a low-rank filter effectively steers LMs toward the intended task. Building on these findings, by measuring the hidden states on carefully designed metrics, we observe that few-shot ICL effectively simulates such task-oriented information removal processes, selectively removing the redundant information from entangled non-selective representations, and improving the output based on the demonstrations, which constitutes a key mechanism underlying ICL. Moreover, we identify essential attention heads inducing the removal operation, termed Denoising Heads, which enables the ablation experiments blocking the information removal operation from the inference, where the ICL accuracy significantly degrades, especially when the correct label is absent from the few-shot demonstrations, confirming both the critical role of the information removal mechanism and denoising heads.

[445] Dual-Balancing for Multi-Task Learning

Baijiong Lin, Weisen Jiang, Feiyang Ye, Yu Zhang, Pengguang Chen, Ying-Cong Chen, Shu Liu, Ivor W. Tsang, James T. Kwok

Main category: cs.LG

TL;DR: DB-MTL addresses multi-task learning imbalance by balancing both loss scales and gradient magnitudes through logarithmic transformation and gradient normalization.

Details

Motivation: Multi-task learning faces performance compromises due to disparities in loss and gradient scales among tasks, making task balancing a significant challenge.

Method: DB-MTL performs logarithm transformation on task losses for loss-scale balancing and normalizes all task gradients using maximum gradient norm to achieve comparable gradient magnitudes.

Result: Extensive experiments on benchmark datasets show DB-MTL consistently outperforms current state-of-the-art methods.

Conclusion: DB-MTL effectively addresses multi-task learning imbalance through dual balancing of loss scales and gradient magnitudes, achieving superior performance.

Abstract: Multi-task learning aims to learn multiple related tasks simultaneously and has achieved great success in various fields. However, the disparity in loss and gradient scales among tasks often leads to performance compromises, and the balancing of tasks remains a significant challenge. In this paper, we propose Dual-Balancing Multi-Task Learning (DB-MTL) to achieve task balancing from both the loss and gradient perspectives. Specifically, DB-MTL achieves loss-scale balancing by performing logarithm transformation on each task loss, and rescales gradient magnitudes by normalizing all task gradients to comparable magnitudes using the maximum gradient norm. Extensive experiments on a number of benchmark datasets demonstrate that DB-MTL consistently performs better than the current state-of-the-art.

[446] Data Valuation by Fusing Global and Local Statistical Information

Xiaoling Zhou, Ou Wu, Michael K. Ng, Hao Jiang

Main category: cs.LG

TL;DR: This paper proposes enhanced data valuation methods that incorporate global and local statistical properties of value distributions to improve Shapley value estimation and introduces dynamic data valuation for computational efficiency.

Details

Motivation: Existing Shapley value-based data valuation methods neglect value distribution information and fail to handle dynamic data conditions, limiting their performance and practical applications.

Method: The authors analyze value distributions across datasets, propose regularization terms incorporating distribution characteristics for Shapley value refinement, and develop a dynamic valuation approach that updates data values without recomputing Shapley values.

Result: Extensive experiments across multiple tasks show consistent effectiveness and efficiency improvements in Shapley value estimation, data management tasks, and dynamic valuation scenarios.

Conclusion: Global and local value distributions play a significant role in data valuation, and the proposed methodologies demonstrate substantial potential for improving data valuation accuracy and computational efficiency.

Abstract: Data valuation has garnered increasing attention in recent years, given the critical role of high-quality data in various applications. Among diverse data valuation approaches, Shapley value-based methods are predominant due to their strong theoretical grounding. However, the exact computation of Shapley values is often computationally prohibitive, prompting the development of numerous approximation techniques. Despite notable advancements, existing methods generally neglect the incorporation of value distribution information and fail to account for dynamic data conditions, thereby compromising their performance and application potential. In this paper, we highlight the crucial role of both global and local statistical properties of value distributions in the context of data valuation for machine learning. First, we conduct a comprehensive analysis of these distributions across various simulated and real-world datasets, uncovering valuable insights and key patterns. Second, we propose an enhanced data valuation method that fuses the explored distribution characteristics into two regularization terms to refine Shapley value estimation. The proposed regularizers can be seamlessly incorporated into various existing data valuation methods. Third, we introduce a novel approach for dynamic data valuation that infers updated data values without recomputing Shapley values, thereby significantly improving computational efficiency. Extensive experiments have been conducted across a range of tasks, including Shapley value estimation, value-based data addition and removal, mislabeled data detection, and dynamic data valuation. The results showcase the consistent effectiveness and efficiency of our proposed methodologies, affirming the significant potential of global and local value distributions in data valuation.

[447] Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness

Erh-Chung Chen, Pin-Yu Chen, I-Hsin Chung, Che-Rung Lee

Main category: cs.LG

TL;DR: Proposes a cost-efficient adversarial defense method using Lipschitz continuity that achieves comparable robustness to data-intensive methods without requiring external data or gradient estimation.

Details

Motivation: Existing adversarial training methods incur high computational costs by using external datasets or generative models, limiting practical deployment of robust DNNs.

Method: Lipschitz continuity-based approach that requires only a single dataset pass without gradient estimation, and can integrate with existing adversarial training frameworks.

Result: Experimental results show reduced computational overhead while maintaining or improving defensive capabilities compared to conventional methods.

Conclusion: Opens a promising direction for practical, scalable defenses against adversarial attacks by providing efficient robustness without external data requirements.

Abstract: As deep neural networks (DNNs) are increasingly deployed in sensitive applications, ensuring their security and robustness has become critical. A major threat to DNNs arises from adversarial attacks, where small input perturbations can lead to incorrect predictions. Recent advances in adversarial training improve robustness by incorporating additional examples from external datasets or generative models. However, these methods often incur high computational costs, limiting their practicality and hindering real-world deployment. In this paper, we propose a cost-efficient alternative based on Lipschitz continuity that achieves robustness comparable to models trained with extensive supplementary data. Unlike conventional adversarial training, our method requires only a single pass over the dataset without gradient estimation, making it highly efficient. Furthermore, our method can integrate seamlessly with existing adversarial training frameworks and enhances the robustness of models without requiring extra generative data. Experimental results show that our approach not only reduces computational overhead but also maintains or improves the defensive capabilities of robust neural networks. This work opens a promising direction for developing practical, scalable defenses against adversarial attacks.

[448] CoxKAN: Kolmogorov-Arnold Networks for Interpretable, High-Performance Survival Analysis

William Knottenbelt, William McGough, Rebecca Wray, Woody Zhidong Zhang, Jiashuai Liu, Ines Prata Machado, Zeyu Gao, Mireia Crispin-Ortuzar

Main category: cs.LG

TL;DR: CoxKAN is an interpretable survival analysis model combining Cox proportional hazards with Kolmogorov-Arnold Networks, achieving high performance while maintaining transparency for medical applications.

Details

Motivation: Address the trade-off between performance and interpretability in survival analysis for medicine, where practitioners need transparent models for critical patient decisions but deep learning models are black-boxes.

Method: Cox proportional hazards Kolmogorov-Arnold Network (CoxKAN) that uses KANs as an interpretable alternative to multi-layer perceptrons, enabling automatic feature selection and discovery of complex variable interactions.

Result: CoxKAN outperformed traditional Cox models by up to 4% in C-index, matched or surpassed deep learning models, accurately recovered hazard functions in synthetic data, and revealed interpretable symbolic formulae and biomarker interactions in real clinical and genomics datasets.

Conclusion: CoxKAN provides an effective solution for interpretable, high-performance survival analysis that bridges the gap between traditional statistical models and deep learning approaches, offering clear insights into biomarker impacts on patient risk.

Abstract: Motivation: Survival analysis is a branch of statistics that is crucial in medicine for modeling the time to critical events such as death or relapse, in order to improve treatment strategies and patient outcomes. Selecting survival models often involves a trade-off between performance and interpretability; deep learning models offer high performance but lack the transparency of more traditional approaches. This poses a significant issue in medicine, where practitioners are reluctant to use black-box models for critical patient decisions. Results: We introduce CoxKAN, a Cox proportional hazards Kolmogorov-Arnold Network for interpretable, high-performance survival analysis. Kolmogorov-Arnold Networks (KANs) were recently proposed as an interpretable and accurate alternative to multi-layer perceptrons. We evaluated CoxKAN on four synthetic and nine real datasets, including five cohorts with clinical data and four with genomics biomarkers. In synthetic experiments, CoxKAN accurately recovered interpretable hazard function formulae and excelled in automatic feature selection. Evaluations on real datasets showed that CoxKAN consistently outperformed the traditional Cox proportional hazards model (by up to 4% in C-index) and matched or surpassed the performance of deep learning-based models. Importantly, CoxKAN revealed complex interactions between predictor variables and uncovered symbolic formulae, which are key capabilities that other survival analysis methods lack, to provide clear insights into the impact of key biomarkers on patient risk. Availability and implementation: CoxKAN is available at GitHub and Zenodo

Eunjee Choi, Junhyun Ahn, XinYu Piao, Jong-Kook Kim

Main category: cs.LG

TL;DR: CroMe: A multimodal fake news detection method using BLIP2 encoders, metric learning for intra-modality relationships, and Cross-Modal Tri-Transformer for feature fusion.

Details

Motivation: Existing methods overlook intra-modality relationships and inter-modal integration, relying on independently encoded unimodal data.

Method: Uses BLIP2 encoders for text, image, and image-text representations; metric learning with proxy anchor for intra-modality relationships; Cross-Modal Tri-Transformer for feature fusion; classifier for final detection.

Result: Experiments show CroMe excels in multimodal fake news detection on benchmark datasets.

Conclusion: CroMe effectively addresses limitations of existing methods by capturing detailed multimodal representations and integrating intra/inter-modal relationships for superior fake news detection.

Abstract: Multimodal Fake News Detection has received increasing attention recently. Existing methods rely on independently encoded unimodal data and overlook the advantages of capturing intra-modality relationships and integrating inter-modal similarities using advanced techniques. To address these issues, Cross-Modal Tri-Transformer and Metric Learning for Multimodal Fake News Detection (CroMe) is proposed. CroMe utilizes Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (BLIP2) as encoders to capture detailed text, image and combined image-text representations. The metric learning module employs a proxy anchor method to capture intra-modality relationships while the feature fusion module uses a Cross-Modal and Tri-Transformer for effective integration. The final fake news detector processes the fused features through a classifier to predict the authenticity of the content. Experiments on datasets show that CroMe excels in multimodal fake news detection.

[450] Enabling Differentially Private Federated Learning for Speech Recognition: Benchmarks, Adaptive Optimizers and Gradient Clipping

Martin Pelikan, Sheikh Shams Azam, Vitaly Feldman, Jan “Honza” Silovsky, Kunal Talwar, Christopher G. Brinton, Tatiana Likhomanenko

Main category: cs.LG

TL;DR: First benchmark for federated learning with differential privacy in automatic speech recognition, achieving strong privacy guarantees with minimal performance degradation using per-layer clipping and layer-wise gradient normalization.

Details

Motivation: Federated learning and differential privacy have not been well-explored for ASR due to challenges in training large transformer models, particularly gradient heterogeneity across layers in deep models.

Method: Per-layer clipping and layer-wise gradient normalization to mitigate clipping bias and gradient heterogeneity across layers in deeper transformer models.

Result: Achieved user-level (7.2, 10^-9)-DP with only 1.3% absolute drop in word error rate at high population scales, and (4.5, 10^-9)-DP with 4.6% drop at low population scales.

Conclusion: FL with DP is viable for ASR under strong privacy guarantees with sufficient user population, and the principles discovered offer broader guidance for privacy-preserving FL algorithms for large models across domains.

Abstract: While federated learning (FL) and differential privacy (DP) have been extensively studied, their application to automatic speech recognition (ASR) remains largely unexplored due to the challenges in training large transformer models. Specifically, large models further exacerbate issues in FL as they are particularly susceptible to gradient heterogeneity across layers, unlike the relatively uniform gradient behavior observed in shallow models. As a result, prior works struggle to converge with standard optimization techniques, even in the absence of DP mechanisms. To the best of our knowledge, no existing work establishes a competitive, practical recipe for FL with DP in the context of ASR. To address this gap, we establish \textbf{the first benchmark for FL with DP in end-to-end ASR}. Our approach centers on per-layer clipping and layer-wise gradient normalization: theoretical analysis reveals that these techniques together mitigate clipping bias and gradient heterogeneity across layers in deeper models. Consistent with these theoretical insights, our empirical results show that FL with DP is viable under strong privacy guarantees, provided a population of at least several million users. Specifically, we achieve user-level (7.2, $10^{-9}$)-DP (resp. (4.5, $10^{-9}$)-DP) with only a 1.3% (resp. 4.6%) absolute drop in word error rate when extrapolating to high (resp. low) population scales for FL with DP in ASR. Although our experiments focus on ASR, the underlying principles we uncover - particularly those concerning gradient heterogeneity and layer-wise gradient normalization - offer broader guidance for designing scalable, privacy-preserving FL algorithms for large models across domains. Code of all experiments and benchmarks is available at https://github.com/apple/ml-pfl4asr.

[451] TS-RAG: Retrieval-Augmented Generation based Time Series Foundation Models are Stronger Zero-Shot Forecaster

Kanghui Ning, Zijie Pan, Yu Liu, Yushan Jiang, James Yiming Zhang, Kashif Rasul, Anderson Schneider, Lintao Ma, Yuriy Nevmyvaka, Dongjin Song

Main category: cs.LG

TL;DR: TS-RAG is a retrieval-augmented generation framework for time series forecasting that enhances generalization and interpretability of Time Series Foundation Models by retrieving relevant patterns from a knowledge base and dynamically fusing them with model representations.

Details

Motivation: Existing Time Series Foundation Models struggle with non-stationary dynamics, distribution shifts, and generalization across diverse datasets due to lack of effective adaptation mechanisms.

Method: Leverages pre-trained time series encoders to retrieve semantically relevant segments from a knowledge base, and uses an Adaptive Retrieval Mixer (ARM) module to dynamically fuse retrieved patterns with TSFM’s internal representations without task-specific fine-tuning.

Result: Achieves state-of-the-art zero-shot forecasting performance, outperforming existing TSFMs by up to 6.84% across seven public benchmark datasets from diverse domains.

Conclusion: TS-RAG provides an effective framework for improving time series forecasting accuracy and interpretability while maintaining strong generalization capabilities across different domains.

Abstract: Large Language Models (LLMs) and Foundation Models (FMs) have recently become prevalent for time series forecasting tasks. While fine-tuning LLMs enables domain adaptation, they often struggle to generalize across diverse and unseen datasets. Moreover, existing Time Series Foundation Models (TSFMs) still face challenges in handling non-stationary dynamics and distribution shifts, largely due to the lack of effective mechanisms for adaptation. To this end, we present TS-RAG, a retrieval-augmented generation framework for time series forecasting that enhances the generalization and interpretability of TSFMs. Specifically, TS-RAG leverages pre-trained time series encoders to retrieve semantically relevant segments from a dedicated knowledge base, enriching the contextual representation of the input query. Furthermore, we propose an Adaptive Retrieval Mixer (ARM) module that dynamically fuses the retrieved patterns with the TSFM’s internal representation, improving forecasting accuracy without requiring task-specific fine-tuning. Thorough empirical studies on seven public benchmark datasets demonstrate that TS-RAG achieves state-of-the-art zero-shot forecasting performance, outperforming the existing TSFMs by up to 6.84% across diverse domains while also providing desirable interpretability. Our code and data are available at: https://github.com/UConn-DSIS/TS-RAG

[452] TinyFormer: Efficient Transformer Design and Deployment on Tiny Devices

Jianlei Yang, Jiacheng Liao, Fanding Lei, Meichen Liu, Lingkun Long, Junyi Chen, Han Wan, Bei Yu, Weisheng Zhao

Main category: cs.LG

TL;DR: TinyFormer is a framework for developing and deploying resource-efficient transformer models on Microcontroller Units (MCUs) through neural architecture search and sparse inference optimization.

Details

Motivation: There's a need to deploy advanced models like transformers on tiny devices with severe hardware constraints (1MB storage, 320KB memory), which is challenging due to resource limitations.

Method: TinyFormer uses three components: SuperNAS for supernet search, SparseNAS for finding optimal sparse models, and SparseEngine for efficient MCU deployment of sparse transformers.

Result: Achieves 96.1% accuracy on CIFAR-10 while meeting hardware constraints, with up to 12.2x speedup in sparse inference compared to CMSIS-NN library.

Conclusion: TinyFormer successfully enables transformer deployment on MCUs, expanding deep learning applications in TinyML scenarios.

Abstract: Developing deep learning models on tiny devices (e.g. Microcontroller units, MCUs) has attracted much attention in various embedded IoT applications. However, it is challenging to efficiently design and deploy recent advanced models (e.g. transformers) on tiny devices due to their severe hardware resource constraints. In this work, we propose TinyFormer, a framework specifically designed to develop and deploy resource-efficient transformer models on MCUs. TinyFormer consists of SuperNAS, SparseNAS, and SparseEngine. Separately, SuperNAS aims to search for an appropriate supernet from a vast search space. SparseNAS evaluates the best sparse single-path transformer model from the identified supernet. Finally, SparseEngine efficiently deploys the searched sparse models onto MCUs. To the best of our knowledge, SparseEngine is the first deployment framework capable of performing inference of sparse transformer models on MCUs. Evaluation results on the CIFAR-10 dataset demonstrate that TinyFormer can design efficient transformers with an accuracy of 96.1% while adhering to hardware constraints of 1MB storage and 320KB memory. Additionally, TinyFormer achieves significant speedups in sparse inference, up to 12.2x comparing to the CMSIS-NN library. TinyFormer is believed to bring powerful transformers into TinyML scenarios and to greatly expand the scope of deep learning applications

[453] Single- vs. Dual-Policy Reinforcement Learning for Dynamic Bike Rebalancing

Jiaqi Liang, Defeng Liu, Sanjay Dominik Jena, Andrea Lodi, Thibaut Vidal

Main category: cs.LG

TL;DR: This paper proposes two reinforcement learning approaches for dynamic bike-sharing rebalancing: single-policy RL using DQN for joint inventory and routing decisions, and dual-policy RL that decouples these decisions for better efficiency.

Details

Motivation: Bike-sharing systems need effective rebalancing strategies to handle stochastic demand and prevent station imbalances, ensuring system reliability and sustainable urban mobility.

Method: Formulated as Markov Decision Process in continuous-time framework; developed single-policy DQN for joint decisions and dual-policy RL that separates inventory and routing decisions; used high-fidelity simulator with first-arrive-first-serve rule.

Result: Both RL models outperform benchmarks, with dual-policy model significantly reducing lost demand compared to single-policy model.

Conclusion: Reinforcement learning shows strong potential for real-time bike-sharing rebalancing, with dual-policy approach providing more adaptive and intelligent urban mobility solutions.

Abstract: Bike-sharing systems (BSS) provide a sustainable urban mobility solution, but ensuring their reliability requires effective rebalancing strategies to address stochastic demand and prevent station imbalances. This paper proposes reinforcement learning (RL) algorithms for dynamic rebalancing problem with multiple vehicles, introducing and comparing two RL approaches: Single-policy RL and Dual-policy RL. We formulate this network optimization problem as a Markov Decision Process within a continuous-time framework, allowing vehicles to make independent and cooperative rebalancing decisions without synchronization constraints. In the first approach, a single deep Q-network (DQN) is trained to jointly learn inventory and routing decisions. The second approach decouples node-level inventory decisions from arc-level vehicle routing, enhancing learning efficiency and adaptability. A high-fidelity simulator under the first-arrive-first-serve rule is developed to estimate rewards across diverse demand scenarios influenced by temporal and weather variations. Extensive experiments demonstrate that while the single-policy model is competitive against several benchmarks, the dual-policy model significantly reduces lost demand. These findings provide valuable insights for bike-sharing operators, reinforcing the potential of RL for real-time rebalancing and paving the way for more adaptive and intelligent urban mobility solutions.

[454] Federated Learning: A Stochastic Approximation Approach

Srihari P, Anik Kumar Paul, Bharath Bhikkaji

Main category: cs.LG

TL;DR: This paper analyzes federated learning using client-specific tapering step sizes instead of constant step sizes, achieving almost sure convergence and allowing clients with rare data to exert greater influence on the global model.

Details

Motivation: Prior federated learning approaches used constant step sizes across clients, leading to convergence only in expectation. This work aims to achieve stronger convergence guarantees (with probability one) and enable differential influence of clients based on their data characteristics.

Method: The authors use client-specific tapering step sizes a^{(i)}_n in a stochastic approximation framework. The global model tracks an ODE with forcing function as weighted sum of client gradients, where weights are determined by limiting ratios p^{(i)} of step sizes.

Result: The proposed method achieves convergence with probability one (stronger than prior expectation convergence). Clients with larger p^{(i)} exert greater influence on the global model, enabling preferential treatment for clients with rare/uncommon data. Numerical experiments validate convergence and demonstrate step-size regulation effects.

Conclusion: Client-specific tapering step sizes in federated learning provide stronger convergence guarantees and allow controlled influence of different clients, particularly beneficial for prioritizing clients with rare data distributions.

Abstract: This paper considers the Federated learning (FL) in a stochastic approximation (SA) framework. Here, each client $i$ trains a local model using its dataset $\mathcal{D}^{(i)}$ and periodically transmits the model parameters $w^{(i)}_n$ to a central server, where they are aggregated into a global model parameter $\bar{w}_n$ and sent back. The clients continue their training by re-initializing their local models with the global model parameters. Prior works typically assumed constant (and often identical) step sizes (learning rates) across clients for model training. As a consequence the aggregated model converges only in expectation. In this work, client-specific tapering step sizes $a^{(i)}n$ are used. The global model is shown to track an ODE with a forcing function equal to the weighted sum of the negative gradients of the individual clients. The weights being the limiting ratios $p^{(i)}=\lim{n \to \infty} \frac{a^{(i)}_n}{a^{(1)}_n}$ of the step sizes, where $a^{(1)}_n \geq a^{(i)}_n, \forall n$. Unlike the constant step sizes, the convergence here is with probability one. In this framework, the clients with the larger $p^{(i)}$ exert a greater influence on the global model than those with smaller $p^{(i)}$, which can be used to favor clients that have rare and uncommon data. Numerical experiments were conducted to validate the convergence and demonstrate the choice of step-sizes for regulating the influence of the clients.

[455] CTSyn: A Foundation Model for Cross Tabular Data Generation

Xiaofeng Lin, Chenheng Xu, Matthew Yang, Guang Cheng

Main category: cs.LG

TL;DR: CTSyn is a diffusion-based generative foundation model for tabular data that uses an autoencoder to unify diverse tables into a latent space and a conditional diffusion model to generate synthetic data, outperforming existing methods.

Details

Motivation: Current cross-table learning frameworks lack generative model backbones and effective mechanisms to handle heterogeneous tabular features, limiting their ability to generate high-quality synthetic tabular data.

Method: CTSyn uses an autoencoder network to consolidate diverse tables into a unified latent space with dynamic value reconstruction using table schema embeddings, combined with a conditional latent diffusion model that generates samples conditioned on table schema.

Result: CTSyn outperforms existing table synthesizers on standard benchmarks in both utility and diversity through large-scale pre-training.

Conclusion: CTSyn is a promising framework for synthetic table generation and lays the groundwork for developing large-scale tabular foundation models.

Abstract: Generative Foundation Models (GFMs) have achieved remarkable success in producing high-quality synthetic data for images and text. However, their application to tabular data presents significant challenges due to the heterogeneous nature of table features. Current cross-table learning frameworks struggle because they lack a generative model backbone and an effective mechanism to decode heterogeneous feature values. To address these challenges, we propose the Cross-Table Synthesizer (CTSyn), a diffusion-based generative foundation model for tabular data generation. CTSyn comprises two key components. The first is an autoencoder network that consolidates diverse tables into a unified latent space. It dynamically reconstructs table values using a table schema embedding, allowing adaptation to heterogeneous datasets. The second is a conditional latent diffusion model that generates samples from the learned latent space, conditioned on the table schema. Through large-scale pre-training, CTSyn outperforms existing table synthesizers on standard benchmarks in both utility and diversity. These results position CTSyn as a promising framework for synthetic table generation and lay the groundwork for developing large-scale tabular foundation models.

[456] HO-FMN: Hyperparameter Optimization for Fast Minimum-Norm Attacks

Raffaele Mura, Giuseppe Floris, Luca Scionis, Giorgio Piras, Maura Pintor, Ambra Demontis, Giorgio Giacinto, Battista Biggio, Fabio Roli

Main category: cs.LG

TL;DR: Proposes a parametric fast minimum-norm attack (HO-FMN) with dynamic loss functions, optimizers, step-size schedulers, and hyperparameters to find smaller adversarial perturbations without manual tuning.

Details

Motivation: Existing gradient-based attacks provide overly-optimistic robustness evaluations due to fixed loss functions, optimizers, step-size schedulers, and default hyperparameters.

Method: Developed a parametric variation of the fast minimum-norm attack algorithm where loss, optimizer, step-size scheduler, and hyperparameters can be dynamically adjusted.

Result: The attack finds smaller adversarial perturbations than existing methods without requiring additional tuning, and enables reporting robustness as a function of perturbation budget.

Conclusion: HO-FMN provides more complete and accurate adversarial robustness evaluation than fixed-budget attacks while remaining computationally efficient.

Abstract: Gradient-based attacks are a primary tool to evaluate robustness of machine-learning models. However, many attacks tend to provide overly-optimistic evaluations as they use fixed loss functions, optimizers, step-size schedulers, and default hyperparameters. In this work, we tackle these limitations by proposing a parametric variation of the well-known fast minimum-norm attack algorithm, whose loss, optimizer, step-size scheduler, and hyperparameters can be dynamically adjusted. We re-evaluate 12 robust models, showing that our attack finds smaller adversarial perturbations without requiring any additional tuning. This also enables reporting adversarial robustness as a function of the perturbation budget, providing a more complete evaluation than that offered by fixed-budget attacks, while remaining efficient. We release our open-source code at https://github.com/pralab/HO-FMN.

[457] No Request Left Behind: Tackling Heterogeneity in Long-Context LLM Inference with Medha

Amey Agrawal, Haoran Qiu, Junda Chen, Íñigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse

Main category: cs.LG

TL;DR: Medha is a serving system that eliminates convoy effects in LLM inference through fine-grained preemptive scheduling, improving throughput by 5.7x and reducing latency significantly.

Details

Motivation: Production LLM workloads are highly heterogeneous with mixed short queries and long documents, creating convoy effects where long requests stall short ones, degrading system responsiveness due to attention's quadratic complexity.

Method: Uses fine-grained preemptive scheduling with Adaptive Chunking, Stream Pipeline Parallel, and KV-Cache Parallelism mechanisms, orchestrated by a Length-Aware Relative Slack (LARS) scheduler that prevents convoy effects and starvation.

Result: Under heterogeneous workloads, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x respectively compared to state-of-the-art non-preemptive systems.

Conclusion: Medha successfully eliminates convoy effects in LLM serving through practical preemptive scheduling mechanisms, significantly improving both throughput and latency for heterogeneous workloads.

Abstract: Deploying million-token Large Language Models (LLMs) is challenging because production workloads are highly heterogeneous, mixing short queries and long documents. This heterogeneity, combined with the quadratic complexity of attention, creates severe convoy effects where long-running requests stall short, interactive ones, degrading system responsiveness. We present Medha, a serving system that eliminates these convoys by introducing fine-grained, preemptive scheduling to LLM inference. Medha makes preemption practical with a co-designed set of mechanisms – including Adaptive Chunking and Stream Pipeline Parallel that overcome the perceived inefficiencies and scaling challenges of chunking. Additionally, we present a new parallelism strategy KV-Cache Parallelism to reduce the decode latency and afford interactivity despite very long context. These mechanisms are orchestrated by a Length-Aware Relative Slack (LARS) scheduler, a deadline and heterogeneity-aware scheduling policy that prevents both the convoy effect and the starvation that plagues simpler policies. Under a heterogeneous workload, Medha improves throughput by 5.7x while reducing median and 99th percentile latency by 30x and 174x, respectively, compared to state-of-the-art non-preemptive systems.

[458] HoGA: Higher-Order Graph Attention via Diversity-Aware k-Hop Sampling

Thomas Bailie, Yun Sing Koh, Karthik Mukkavilli

Main category: cs.LG

TL;DR: HoGA introduces a higher-order graph attention module that samples diverse subgraphs to capture varied higher-order relationships, improving accuracy in node classification tasks over existing methods.

Details

Motivation: Edge-based MPNNs have limited expressive power for discovering higher-order relationships in graphs, and existing higher-order attention methods often resample similar structures, leading to redundancy.

Method: The HoGA module constructs a k-order attention matrix by sampling subgraphs to maximize diversity among feature vectors, targeting diverse modalities in higher-order topology.

Result: HoGA achieves at least 5% accuracy gain on all benchmark node classification datasets and outperforms recent baselines on six of eight datasets.

Conclusion: HoGA effectively captures diverse higher-order relationships in graphs, reducing redundancy and expanding the range of captured substructures for improved performance.

Abstract: Graphs model latent variable relationships in many real-world systems, and Message Passing Neural Networks (MPNNs) are widely used to learn such structures for downstream tasks. While edge-based MPNNs effectively capture local interactions, their expressive power is theoretically bounded, limiting the discovery of higher-order relationships. We introduce the Higher-Order Graph Attention (HoGA) module, which constructs a k-order attention matrix by sampling subgraphs to maximize diversity among feature vectors. Unlike existing higher-order attention methods that greedily resample similar k-order relationships, HoGA targets diverse modalities in higher-order topology, reducing redundancy and expanding the range of captured substructures. Applied to two single-hop attention models, HoGA achieves at least a 5% accuracy gain on all benchmark node classification datasets and outperforms recent baselines on six of eight datasets. Code is available at https://github.com/TB862/Higher_Order.

[459] Physics-Constrained Flow Matching: Sampling Generative Models with Hard Constraints

Utkarsh Utkarsh, Pengfei Cai, Alan Edelman, Rafael Gomez-Bombarelli, Christopher Vincent Rackauckas

Main category: cs.LG

TL;DR: PCFM is a zero-shot inference framework that enforces hard physical constraints in pretrained flow-based generative models for PDE simulations, outperforming existing methods while guaranteeing exact constraint satisfaction.

Details

Motivation: Existing deep generative models for PDEs struggle to enforce hard physical constraints like conservation laws, often relying on soft penalties that don't guarantee constraint satisfaction.

Method: Physics-Constrained Flow Matching (PCFM) continuously guides sampling through physics-based corrections applied to intermediate states, while remaining aligned with learned flows and satisfying constraints.

Result: PCFM outperforms both unconstrained and constrained baselines on various PDEs with shocks, discontinuities, and sharp features, ensuring exact constraint satisfaction at final solutions.

Conclusion: PCFM provides a flexible framework for enforcing hard constraints in scientific and general-purpose generative models, especially crucial for applications where constraint satisfaction is essential.

Abstract: Deep generative models have recently been applied to physical systems governed by partial differential equations (PDEs), offering scalable simulation and uncertainty-aware inference. However, enforcing physical constraints, such as conservation laws (linear and nonlinear) and physical consistencies, remains challenging. Existing methods often rely on soft penalties or architectural biases that fail to guarantee hard constraints. In this work, we propose Physics-Constrained Flow Matching (PCFM), a zero-shot inference framework that enforces arbitrary nonlinear constraints in pretrained flow-based generative models. PCFM continuously guides the sampling process through physics-based corrections applied to intermediate solution states, while remaining aligned with the learned flow and satisfying physical constraints. Empirically, PCFM outperforms both unconstrained and constrained baselines on a range of PDEs, including those with shocks, discontinuities, and sharp features, while ensuring exact constraint satisfaction at the final solution. Our method provides a flexible framework for enforcing hard constraints in both scientific and general-purpose generative models, especially in applications where constraint satisfaction is essential.

[460] On the Effectiveness of Adversarial Training on Malware Classifiers

Hamid Bostani, Jacopo Cortellazzi, Daniel Arp, Fabio Pierazzi, Veelasha Moonsamy, Lorenzo Cavallaro

Main category: cs.LG

TL;DR: Rubik is a framework for systematically evaluating Adversarial Training (AT) in malware detection, addressing prior research gaps by examining multiple dimensions like data, features, classifiers, and optimization settings to understand AT’s real-world effectiveness.

Details

Motivation: Prior research on AT for malware detection is fragmented and overlooks malware's inherent nature, using weak evaluations that yield non-generalizable insights. There's a need for systematic evaluation to understand AT's true effectiveness in real-world scenarios.

Method: Introduces Rubik framework that defines key factors across multiple dimensions (data, feature representations, classifiers, robust optimization settings) and uses reliable evaluation practices like realistic evasion attacks. Applied to Android malware for empirical analysis.

Result: Findings challenge prior beliefs - realizable adversarial examples offer only conditional robustness benefits. Reveals critical role of model architecture and feature-space structure in AT’s success. Identifies four key insights and exposes four common evaluation misconceptions.

Conclusion: Provides practical recommendations to guide development of truly robust malware classifiers based on systematic analysis of AT’s effectiveness in malware domain.

Abstract: Adversarial Training (AT) is a key defense against Machine Learning evasion attacks, but its effectiveness for real-world malware detection remains poorly understood. This uncertainty stems from a critical disconnect in prior research: studies often overlook the inherent nature of malware and are fragmented, examining diverse variables like realism or confidence of adversarial examples in isolation, or relying on weak evaluations that yield non-generalizable insights. To address this, we introduce Rubik, a framework for the systematic, multi-dimensional evaluation of AT in the malware domain. This framework defines diverse key factors across essential dimensions, including data, feature representations, classifiers, and robust optimization settings, for a comprehensive exploration of the interplay of influential AT’s variables through reliable evaluation practices, such as realistic evasion attacks. We instantiate Rubik on Android malware, empirically analyzing how this interplay shapes robustness. Our findings challenge prior beliefs–showing, for instance, that realizable adversarial examples offer only conditional robustness benefits–and reveal new insights, such as the critical role of model architecture and feature-space structure in determining AT’s success. From this analysis, we distill four key insights, expose four common evaluation misconceptions, and offer practical recommendations to guide the development of truly robust malware classifiers.

[461] Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation

Giacomo Baldan, Qiang Liu, Alberto Guardone, Nils Thuerey

Main category: cs.LG

TL;DR: PBFM is a physics-constrained generative framework that embeds PDE residuals and algebraic relations into flow matching, achieving up to 8× better physical accuracy than standard flow matching without hyperparameter tuning.

Details

Motivation: Standard generative methods like diffusion models learn physics implicitly from data, lacking explicit physical constraints which limits their accuracy and reliability for physics applications.

Method: Physics-Based Flow Matching (PBFM) explicitly embeds physical constraints into flow matching objective, uses temporal unrolling during training, and jointly minimizes flow matching and physics-based residual losses without weight tuning.

Result: PBFM achieves up to 8× more accurate physical residuals compared to standard flow matching, outperforms existing methods in distributional accuracy across three PDE benchmark problems.

Conclusion: PBFM provides a principled framework for physics-constrained surrogate modeling that enables improved accuracy for uncertainty quantification and accelerated simulation in engineering applications.

Abstract: Generative machine learning methods, such as diffusion models and flow matching, have shown great potential in modeling complex system behaviors and building efficient surrogate models. However, these methods typically learn the underlying physics implicitly from data. We propose Physics-Based Flow Matching (PBFM), a novel generative framework that explicitly embeds physical constraints, both PDE residuals and algebraic relations, into the flow matching objective. We also introduce temporal unrolling at training time that improves the accuracy of the final, noise-free sample prediction. Our method jointly minimizes the flow matching loss and the physics-based residual loss without requiring hyperparameter tuning of their relative weights. Additionally, we analyze the role of the minimum noise level, $σ_{\min}$, in the context of physical constraints and evaluate a stochastic sampling strategy that helps to reduce physical residuals. Through extensive benchmarks on three representative PDE problems, we show that our approach yields up to an $8\times$ more accurate physical residuals compared to FM, while clearly outperforming existing algorithms in terms of distributional accuracy. PBFM thus provides a principled and efficient framework for surrogate modeling, uncertainty quantification, and accelerated simulation in physics and engineering applications.

[462] A Unifying View of Linear Function Approximation in Off-Policy RL Through Matrix Splitting and Preconditioning

Zechen Wu, Amy Greenwald, Ronald Parr

Main category: cs.LG

TL;DR: This paper provides a unified mathematical framework showing that TD, PFQI, and FQI are all solving the same linear system but with different matrix splitting schemes and preconditioners, explaining why TD convergence doesn’t guarantee FQI convergence.

Details

Motivation: To resolve the traditional view that TD, PFQI, and FQI differ only in the number of updates to the target value function, and to provide a unified theoretical framework that explains their convergence relationships.

Method: Developed a mathematical framework using linear value function approximation that unifies TD, PFQI, and FQI as iterative methods solving the same linear system with different matrix splitting schemes and preconditioners.

Result: Showed that target network technique transitions from constant to data-feature adaptive preconditioning, established tight convergence connections, provided sharper theoretical results without feature independence assumptions, and discovered new convergence conditions.

Conclusion: The unified framework reveals fundamental connections between TD, PFQI, and FQI, explains their convergence differences, enables dropping traditional feature assumptions, and provides new insights into learning rate effects and convergence conditions.

Abstract: In off-policy policy evaluation (OPE) tasks within reinforcement learning, Temporal Difference Learning(TD) and Fitted Q-Iteration (FQI) have traditionally been viewed as differing in the number of updates toward the target value function: TD makes one update, FQI makes an infinite number, and Partial Fitted Q-Iteration (PFQI) performs a finite number. We show that this view is not accurate, and provide a new mathematical perspective under linear value function approximation that unifies these methods as a single iterative method solving the same linear system, but using different matrix splitting schemes and preconditioners. We show that increasing the number of updates under the same target value function, i.e., the target network technique, is a transition from using a constant preconditioner to using a data-feature adaptive preconditioner. This elucidates, for the first time, why TD convergence does not necessarily imply FQI convergence, and establishes tight convergence connections among TD, PFQI, and FQI. Our framework enables sharper theoretical results than previous work and characterization of the convergence conditions for each algorithm, without relying on assumptions about the features (e.g., linear independence). We also provide an encoder-decoder perspective to better understand the convergence conditions of TD, and prove, for the first time, that when a large learning rate doesn’t work, trying a smaller one may help. Our framework also leads to the discovery of new crucial conditions on features for convergence, and shows how common assumptions about features influence convergence, e.g., the assumption of linearly independent features can be dropped without compromising the convergence guarantees of stochastic TD in the on-policy setting. This paper is also the first to introduce matrix splitting into the convergence analysis of these algorithms.

[463] Evolutionary Prediction Games

Eden Saig, Nir Rosenfeld

Main category: cs.LG

TL;DR: The paper introduces evolutionary prediction games to model feedback loops between prediction algorithms and user populations, showing that while idealized settings lead to competitive exclusion, realistic constraints enable stable coexistence and mutualistic symbiosis between user groups.

Details

Motivation: To understand how disparities in prediction quality create feedback loops that shape both machine learning models and user populations, particularly when users respond to accurate predictions by increasing engagement.

Method: Developed evolutionary prediction games framework based on evolutionary game theory, analyzing behavioral dynamics under both idealized settings (unlimited data/compute) and realistic constraints (finite data, limited compute, overfitting risk).

Result: Found that idealized settings promote competitive exclusion, while realistic constraints enable stable coexistence and mutualistic symbiosis between user groups. Analyzed stability and feasibility of these outcomes and presented sustaining mechanisms.

Conclusion: Real-world learning constraints fundamentally change the evolutionary dynamics of prediction systems, making stable coexistence between user groups possible, in contrast to competitive exclusion in idealized settings.

Abstract: When a prediction algorithm serves a collection of users, disparities in prediction quality are likely to emerge. If users respond to accurate predictions by increasing engagement, inviting friends, or adopting trends, repeated learning creates a feedback loop that shapes both the model and the population of its users. In this work, we introduce evolutionary prediction games, a framework grounded in evolutionary game theory which models such feedback loops as natural-selection processes among groups of users. Our theoretical analysis reveals a gap between idealized and real-world learning settings: In idealized settings with unlimited data and computational power, repeated learning creates competition and promotes competitive exclusion across a broad class of behavioral dynamics. However, under realistic constraints such as finite data, limited compute, or risk of overfitting, we show that stable coexistence and mutualistic symbiosis between groups becomes possible. We analyze these possibilities in terms of their stability and feasibility, present mechanisms that can sustain their existence, and empirically demonstrate our findings.

[464] Fair Algorithms with Probing for Multi-Agent Multi-Armed Bandits

Tianyi Xu, Jiaxin Liu, Nicholas Mattei, Zizhan Zheng

Main category: cs.LG

TL;DR: Proposes a multi-agent multi-armed bandit framework with strategic probing to ensure fairness while maximizing system performance, with algorithms for both offline and online settings.

Details

Motivation: To address the challenge of ensuring fair outcomes across agents in multi-agent systems while maximizing overall performance under limited information about arm rewards.

Method: Introduces a novel probing framework that strategically gathers information before allocation. Uses submodular properties for greedy probing in offline setting, and develops an online algorithm with sublinear regret.

Result: Extensive experiments on synthetic and real-world datasets show the approach outperforms baseline methods, achieving better fairness and efficiency.

Conclusion: The proposed MA-MAB framework with strategic probing effectively balances fairness and performance in multi-agent decision-making under uncertainty.

Abstract: We propose a multi-agent multi-armed bandit (MA-MAB) framework aimed at ensuring fair outcomes across agents while maximizing overall system performance. A key challenge in this setting is decision-making under limited information about arm rewards. To address this, we introduce a novel probing framework that strategically gathers information about selected arms before allocation. In the offline setting, where reward distributions are known, we leverage submodular properties to design a greedy probing algorithm with a provable performance bound. For the more complex online setting, we develop an algorithm that achieves sublinear regret while maintaining fairness. Extensive experiments on synthetic and real-world datasets show that our approach outperforms baseline methods, achieving better fairness and efficiency.

[465] F-INR: Functional Tensor Decomposition for Implicit Neural Representations

Sai Karthikeya Vemuri, Tim Büchner, Joachim Denzler

Main category: cs.LG

TL;DR: F-INR is a framework that factorizes high-dimensional Implicit Neural Representations (INRs) into compact axis-specific sub-networks using functional tensor decomposition, enabling faster training and better performance than state-of-the-art INRs.

Details

Motivation: Monolithic INRs scale poorly with data dimensionality, leading to excessive training costs and limited representational capacity for high-dimensional signals.

Method: Factorizes high-dimensional INR into compact axis-specific sub-networks using functional tensor decomposition, integrating with various INR backbones (SIREN, WIRE, FINER, Factor Fields) and tensor formats (CP, TT, Tucker).

Result: Accelerates training by up to 20× and improves fidelity by over 6.0 dB PSNR compared to state-of-the-art INRs across image representation, 3D geometry reconstruction, neural radiance fields, and physics simulations.

Conclusion: F-INR provides a scalable, flexible, and efficient framework for high-dimensional signal modeling with fine-grained control over speed-accuracy trade-offs.

Abstract: Implicit Neural Representations (INRs) model signals as continuous, differentiable functions. However, monolithic INRs scale poorly with data dimensionality, leading to excessive training costs. We propose F-INR, a framework that addresses this limitation by factorizing a high-dimensional INR into a set of compact, axis-specific sub-networks based on functional tensor decomposition. These sub-networks learn low-dimensional functional components that are then combined via tensor operations. This factorization reduces computational complexity while additionally improving representational capacity. F-INR is both architecture- and decomposition-agnostic. It integrates with various existing INR backbones (e.g., SIREN, WIRE, FINER, Factor Fields) and tensor formats (e.g., CP, TT, Tucker), offering fine-grained control over the speed-accuracy trade-off via the tensor rank and mode. Our experiments show F-INR accelerates training by up to $20\times$ and improves fidelity by over \num{6.0} dB PSNR compared to state-of-the-art INRs. We validate these gains on diverse tasks, including image representation, 3D geometry reconstruction, and neural radiance fields. We further show F-INR’s applicability to scientific computing by modeling complex physics simulations. Thus, F-INR provides a scalable, flexible, and efficient framework for high-dimensional signal modeling. Project page: https://f-inr.github.io

[466] Active Learning Methods for Efficient Data Utilization and Model Performance Enhancement

Chiung-Yi Tseng, Junhao Song, Ziqian Bi, Tianyang Wang, Chia Xin Liang, Xinyuan Song, Ming Liu

Main category: cs.LG

TL;DR: This paper provides a comprehensive overview of Active Learning (AL), a machine learning strategy that achieves better performance with fewer labeled examples, covering concepts, applications, key research topics, and current challenges.

Details

Motivation: The paradox of data abundance and annotation scarcity has emerged as a critical bottleneck in machine learning advancement, motivating the need for more data-efficient learning approaches.

Method: The paper introduces basic concepts of Active Learning and discusses its applications across computer vision, natural language processing, transfer learning, and real-world scenarios, focusing on uncertainty estimation, class imbalance handling, domain adaptation, fairness, and evaluation metrics.

Result: The paper shows that Active Learning often outperforms passive learning, especially when good evaluation measures are used, and that human-inspired learning methods and question-guided approaches can improve data efficiency and learning effectiveness.

Conclusion: This work provides key insights for researchers and practitioners and proposes future directions for progress in Active Learning, addressing challenges such as rebuilding trust, ensuring reproducibility, and dealing with inconsistent methodologies.

Abstract: In the era of data-driven intelligence, the paradox of data abundance and annotation scarcity has emerged as a critical bottleneck in the advancement of machine learning. This paper gives a detailed overview of Active Learning (AL), which is a strategy in machine learning that helps models achieve better performance using fewer labeled examples. It introduces the basic concepts of AL and discusses how it is used in various fields such as computer vision, natural language processing, transfer learning, and real-world applications. The paper focuses on important research topics such as uncertainty estimation, handling of class imbalance, domain adaptation, fairness, and the creation of strong evaluation metrics and benchmarks. It also shows that learning methods inspired by humans and guided by questions can improve data efficiency and help models learn more effectively. In addition, this paper talks about current challenges in the field, including the need to rebuild trust, ensure reproducibility, and deal with inconsistent methodologies. It points out that AL often gives better results than passive learning, especially when good evaluation measures are used. This work aims to be useful for both researchers and practitioners by providing key insights and proposing directions for future progress in active learning.

[467] Empowering Time Series Forecasting with LLM-Agents

Chin-Chia Michael Yeh, Vivian Lai, Uday Singh Saini, Xiran Fan, Yujie Fan, Junpeng Wang, Xin Dai, Yan Zheng

Main category: cs.LG

TL;DR: DCATS is a data-centric agent for time series forecasting that improves data quality using metadata, achieving 6% error reduction across models.

Details

Motivation: Recent studies show lightweight models can achieve SOTA in time series forecasting, suggesting data quality improvement may be more impactful than model architecture search in AutoML.

Method: Leverages LLM-powered agents to clean time series data using accompanying metadata while optimizing forecasting performance.

Result: DCATS achieves average 6% error reduction across four time series forecasting models on large-scale traffic volume dataset.

Conclusion: Data-centric approaches show significant potential for AutoML in time series forecasting, outperforming traditional model-centric methods.

Abstract: Large Language Model (LLM) powered agents have emerged as effective planners for Automated Machine Learning (AutoML) systems. While most existing AutoML approaches focus on automating feature engineering and model architecture search, recent studies in time series forecasting suggest that lightweight models can often achieve state-of-the-art performance. This observation led us to explore improving data quality, rather than model architecture, as a potentially fruitful direction for AutoML on time series data. We propose DCATS, a Data-Centric Agent for Time Series. DCATS leverages metadata accompanying time series to clean data while optimizing forecasting performance. We evaluated DCATS using four time series forecasting models on a large-scale traffic volume forecasting dataset. Results demonstrate that DCATS achieves an average 6% error reduction across all tested models and time horizons, highlighting the potential of data-centric approaches in AutoML for time series forecasting.

[468] Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenchel-Young Losses

Yuzhou Cao, Han Bao, Lei Feng, Bo An

Main category: cs.LG

TL;DR: The paper overcomes the trade-off between smoothness and linear regret bounds by constructing convex smooth surrogate losses using Fenchel-Young losses based on convolutional negentropy, enabling efficient optimization while maintaining linear regret transfer.

Details

Motivation: There has been a belief in the community about a trade-off between loss smoothness and linear regret bounds, where smooth convex surrogates may deteriorate after regret transfer to target losses. This work aims to overcome this dilemma.

Method: Construct convex smooth surrogate losses using Fenchel-Young losses generated by convolutional negentropy, which are equivalent to the infimal convolution of generalized negentropy and target Bayes risk. This enables smoothness while maintaining linear regret bounds.

Result: Successfully constructed convex smooth surrogate losses that achieve linear surrogate regret bounds for arbitrary discrete target losses, overcoming the previously believed trade-off.

Conclusion: The infimal convolution approach demonstrates how convex analysis enables both optimization efficiency (through smoothness) and statistical efficiency (through linear regret bounds) in risk minimization, providing a novel solution to the smoothness-regret trade-off.

Abstract: Surrogate regret bounds, also known as excess risk bounds, bridge the gap between the convergence rates of surrogate and target losses. The regret transfer is lossless if the surrogate regret bound is linear. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the loss smoothness and linear regret bound has been believed in the community. Under this scenario, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel–Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.

[469] Enhancing Training Data Attribution with Representational Optimization

Weiwei Sun, Haokun Liu, Nikhil Kandpal, Colin Raffel, Yiming Yang

Main category: cs.LG

TL;DR: AirRep is a scalable training data attribution method that learns task-specific representations optimized for attribution, matching gradient-based methods’ performance while being much more efficient.

Details

Motivation: Current TDA methods face a trade-off: gradient-based methods are accurate but computationally expensive, while representation-based methods are scalable but use heuristic embeddings not optimized for attribution, limiting their fidelity.

Method: AirRep learns task-specific and model-aligned representations through a trainable encoder optimized for attribution quality, plus an attention-based pooling mechanism for group-wise influence estimation. It’s trained using a ranking objective over automatically constructed training subsets.

Result: Experiments on instruction-tuned LLMs show AirRep achieves performance comparable to state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. It also demonstrates robustness and generalization across tasks and models.

Conclusion: AirRep successfully bridges the gap between accuracy and efficiency in training data attribution, providing a scalable solution that maintains high fidelity while dramatically reducing computational costs.

Abstract: Training data attribution (TDA) methods aim to measure how training data impacts a model’s predictions. While gradient-based attribution methods, such as influence functions, offer theoretical grounding, their computational costs make them impractical for large-scale applications. Representation-based approaches are far more scalable, but typically rely on heuristic embeddings that are not optimized for attribution, limiting their fidelity. To address these challenges, we propose AirRep, a scalable, representation-based approach that closes this gap by learning task-specific and model-aligned representations optimized explicitly for TDA. AirRep introduces two key innovations: a trainable encoder tuned for attribution quality, and an attention-based pooling mechanism that enables accurate estimation of group-wise influence. We train AirRep using a ranking objective over automatically constructed training subsets labeled by their empirical effect on target predictions. Experiments on instruction-tuned LLMs demonstrate that AirRep achieves performance on par with state-of-the-art gradient-based approaches while being nearly two orders of magnitude more efficient at inference time. Further analysis highlights its robustness and generalization across tasks and models. Our code is available at https://github.com/sunnweiwei/AirRep

[470] Asymmetric Duos: Sidekicks Improve Uncertainty

Tim G. Zhou, Evan Shelhamer, Geoff Pleiss

Main category: cs.LG

TL;DR: Asymmetric Duos: Pairing large models with smaller sidekick models to improve uncertainty quantification and performance with minimal computational overhead.

Details

Motivation: Traditional ensembling methods are too computationally expensive for today's large-scale models, creating a need for cost-effective uncertainty quantification strategies.

Method: Coupling a large model with a smaller ‘sidekick’ model and aggregating their predictions through learned weighted averaging.

Result: Across five image classification benchmarks, Asymmetric Duos significantly improved accuracy, uncertainty quantification, and selective classification metrics with only ~10-20% more computation.

Conclusion: Small sidekick models can effectively enhance large models’ performance and uncertainty quantification without harming performance, offering a practical alternative to expensive ensembling.

Abstract: The go-to strategy to apply deep networks in settings where uncertainty informs decisions–ensembling multiple training runs with random initializations–is ill-suited for the extremely large-scale models and practical fine-tuning workflows of today. We introduce a new cost-effective strategy for improving the uncertainty quantification and downstream decisions of a large model (e.g. a fine-tuned ViT-B): coupling it with a less accurate but much smaller “sidekick” (e.g. a fine-tuned ResNet-34) with a fraction of the computational cost. We propose aggregating the predictions of this Asymmetric Duo by simple learned weighted averaging. Surprisingly, despite their inherent asymmetry, the sidekick model almost never harms the performance of the larger model. In fact, across five image classification benchmarks and a variety of model architectures and training schemes (including soups), Asymmetric Duos significantly improve accuracy, uncertainty quantification, and selective classification metrics with only ${\sim}10-20%$ more computation.

[471] Equivariant Flow Matching for Symmetry-Breaking Bifurcation Problems

Fleur Hendriks, Ondřej Rokoš, Martin Doškář, Marc G. D. Geers, Vlado Menkovski

Main category: cs.LG

TL;DR: A generative framework using flow matching to model probability distributions over bifurcation outcomes, enabling direct sampling of multiple solutions while preserving system symmetries through equivariant modeling.

Details

Motivation: Deterministic machine learning models fail to capture multiple coexisting stable solutions in nonlinear dynamical systems with symmetry breaking, averaging over solutions and missing lower-symmetry outcomes.

Method: Proposes flow matching with symmetric matching strategy that aligns predicted and target outputs under group actions for equivariant modeling, enabling direct sampling of multiple valid solutions.

Result: Flow matching significantly outperforms non-probabilistic and variational methods in capturing multimodal distributions and symmetry-breaking bifurcations across toy models, buckling beams, and Allen-Cahn equation.

Conclusion: The approach offers a principled and scalable solution for modeling multistability in high-dimensional systems by accurately capturing the full probability distribution over bifurcation outcomes.

Abstract: Bifurcation phenomena in nonlinear dynamical systems often lead to multiple coexisting stable solutions, particularly in the presence of symmetry breaking. Deterministic machine learning models struggle to capture this multiplicity, averaging over solutions and failing to represent lower-symmetry outcomes. In this work, we propose a generative framework based on flow matching to model the full probability distribution over bifurcation outcomes. Our method enables direct sampling of multiple valid solutions while preserving system symmetries through equivariant modeling. We introduce a symmetric matching strategy that aligns predicted and target outputs under group actions, allowing accurate learning in equivariant settings. We validate our approach on a range of systems, from toy models to complex physical problems such as buckling beams and the Allen-Cahn equation. Our results demonstrate that flow matching significantly outperforms non-probabilistic and variational methods in capturing multimodal distributions and symmetry-breaking bifurcations, offering a principled and scalable solution for modeling multistability in high-dimensional systems.

[472] Alignment of large language models with constrained learning

Botong Zhang, Shuo Li, Ignacio Hounie, Osbert Bastani, Dongsheng Ding, Alejandro Ribeiro

Main category: cs.LG

TL;DR: The paper develops a dual-based alignment method for LLMs that maximizes primary reward while satisfying secondary utility constraints, addressing convergence issues in existing methods.

Details

Motivation: Existing Lagrangian-based LLM policy search methods often fail to converge, and non-iterative dual methods don't achieve optimality in LLM parameter space, creating a need for better constrained alignment approaches.

Method: Iterative dual-based alignment that alternates between updating LLM policy via Lagrangian maximization and updating dual variable via dual descent, using Lagrangian duality theory.

Result: The method achieves optimal constrained LLM policies up to parametrization gap, with theoretical guarantees on primal-dual gap and optimality gap, validated on PKU-SafeRLHF and Anthropic HH-RLHF datasets.

Conclusion: Dual-based alignment methods can effectively find optimal constrained LLM policies, providing theoretical foundations and practical solutions for constrained alignment problems.

Abstract: We study the problem of computing an optimal large language model (LLM) policy for the constrained alignment problem, where the goal is to maximize a primary reward objective while satisfying constraints on secondary utilities. Despite the popularity of Lagrangian-based LLM policy search in constrained alignment, iterative primal-dual methods often fail to converge, and non-iterative dual-based methods do not achieve optimality in the LLM parameter space. To address these challenges, we employ Lagrangian duality to develop an iterative dual-based alignment method that alternates between updating the LLM policy via Lagrangian maximization and updating the dual variable via dual descent. In theory, we characterize the primal-dual gap between the primal value in the distribution space and the dual value in the LLM parameter space. We further quantify the optimality gap of the learned LLM policies at near-optimal dual variables with respect to both the objective and the constraint functions. These results prove that dual-based alignment methods can find an optimal constrained LLM policy, up to an LLM parametrization gap. We demonstrate the effectiveness and merits of our approach through extensive experiments conducted on the PKU-SafeRLHF and Anthropic HH-RLHF datasets.

[473] Deep Actor-Critics with Tight Risk Certificates

Bahareh Tasdighi, Manuel Haussmann, Yi-Shan Wu, Andres R. Masegosa, Melih Kandemir

Main category: cs.LG

TL;DR: Develops tight risk certificates for deep actor-critic algorithms using minimal validation data and PAC-Bayes theory to predict generalization performance.

Details

Motivation: Deep actor-critic algorithms are widely used but lack validation schemes that quantify their risk of malfunction, limiting deployment in physical systems.

Method: Uses minimal evaluation data (small set of roll-outs) combined with recursive PAC-Bayes approach that builds bounds on excess loss using data-informed priors from previous validation portions.

Result: Empirical results across locomotion tasks, actor-critic methods, and policy expertise levels show risk certificates tight enough for practical use.

Conclusion: It’s possible to develop accurate risk certificates for deep actor-critic algorithms using minimal validation data and PAC-Bayes theory, enabling safer deployment in physical systems.

Abstract: Deep actor-critic algorithms have reached a level where they influence everyday life. They are a driving force behind continual improvement of large language models through user feedback. However, their deployment in physical systems is not yet widely adopted, mainly because no validation scheme fully quantifies their risk of malfunction. We demonstrate that it is possible to develop tight risk certificates for deep actor-critic algorithms that predict generalization performance from validation-time observations. Our key insight centers on the effectiveness of minimal evaluation data. A small feasible set of evaluation roll-outs collected from a pretrained policy suffices to produce accurate risk certificates when combined with a simple adaptation of PAC-Bayes theory. Specifically, we adopt a recently introduced recursive PAC-Bayes approach, which splits validation data into portions and recursively builds PAC-Bayes bounds on the excess loss of each portion’s predictor, using the predictor from the previous portion as a data-informed prior. Our empirical results across multiple locomotion tasks, actor-critic methods, and policy expertise levels demonstrate risk certificates tight enough to be considered for practical use.

[474] Inference-Time Alignment of Diffusion Models via Evolutionary Algorithms

Purvish Jajal, Nick John Eliopoulos, Benjamin Shiue-Hal Chou, George K. Thiruvathukal, James C. Davis, Yung-Hsiang Lu

Main category: cs.LG

TL;DR: Evolutionary algorithm-based inference-time alignment for diffusion models that treats models as black boxes and searches latent space to maximize alignment objectives without requiring gradients or internal model access.

Details

Motivation: Diffusion models often fail to satisfy application objectives like safety constraints or domain-specific validity, and existing alignment techniques require gradients, internal model access, or large computational budgets.

Method: Uses evolutionary algorithms to search the latent space of diffusion models as black boxes to maximize alignment objectives during inference time.

Result: Achieves 3-35% higher ImageReward scores than gradient-free and gradient-based methods with equal or less running time, competitive results on Open Image Preferences dataset across four alignment objectives, 55-76% less GPU memory usage, and 72-80% faster than gradient-based methods.

Conclusion: The evolutionary algorithm-based approach provides an efficient and effective inference-time alignment method for diffusion models that outperforms existing techniques in both performance and computational efficiency.

Abstract: Diffusion models are state-of-the-art generative models, yet their samples often fail to satisfy application objectives such as safety constraints or domain-specific validity. Existing techniques for alignment require gradients, internal model access, or large computational budgets resulting in high compute demands, or lack of support for certain objectives. In response, we introduce an inference-time alignment framework based on evolutionary algorithms. We treat diffusion models as black boxes and search their latent space to maximize alignment objectives. Given equal or less running time, our method achieves 3-35% higher ImageReward scores than gradient-free and gradient-based methods. On the Open Image Preferences dataset, our method achieves competitive results across four popular alignment objectives. In terms of computational efficiency, we require 55% to 76% less GPU memory and are 72% to 80% faster than gradient-based methods.

[475] A Unified Noise-Curvature View of Loss of Trainability

Gunbir Singh Baveja, Alex Lewandowski, Mark Schmidt

Main category: cs.LG

TL;DR: The paper analyzes loss of trainability in continual learning and proposes a step-size scheduler that prevents this phenomenon by using adaptive noise thresholds based on gradient noise and curvature volatility.

Details

Motivation: Loss of trainability occurs when parameter updates stop making progress on optimization objectives in continual learning, causing accuracy to stall or degrade. Existing individual indicators fail to reliably predict this phenomenon.

Method: Introduces two new indicators: batch-size-aware gradient-noise bound and curvature volatility-controlled bound. Combines these into a per-layer adaptive noise threshold on effective step-size. Proposes a step-size scheduler that keeps each layer’s parameter updates below this bound.

Result: The proposed scheduler improves accuracy maintained by existing approaches (CReLU, Wasserstein regularizer, L2 weight decay) and produces adaptive step-size trajectories that mirror manually engineered decay schedules without tuning.

Conclusion: Loss of trainability can be effectively prevented using adaptive step-size scheduling based on gradient noise and curvature volatility bounds, providing better performance than existing methods and automatically generating optimal step-size decay patterns.

Abstract: Loss of trainability refers to a phenomenon in continual learning where parameter updates no longer make progress on the optimization objective, so accuracy stalls or degrades as the learning problem changes over time. In this paper, we analyze loss of trainability through an optimization lens and find that the phenomenon is not reliably predicted by existing individual indicators such as Hessian rank, sharpness level, weight or gradient norms, gradient-to-parameter ratios, and unit-sign entropy. Motivated by our analysis, we introduce two complementary indicators: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound. We then combine these two indicators into a per-layer adaptive noise threshold on the effective step-size that anticipates trainability behavior. Using this insight, we propose a step-size scheduler that keeps each layer’s effective parameter update below this bound, thereby avoiding loss of trainability. We demonstrate that our scheduler can improve the accuracy maintained by previously proposed approaches, such as concatenated ReLU (CReLU), Wasserstein regularizer, and L2 weight decay. Surprisingly, our scheduler produces adaptive step-size trajectories that, without tuning, mirror the manually engineered step-size decay schedules.

[476] ENMA: Tokenwise Autoregression for Generative Neural PDE Operators

Armand Kassaï Koupaï, Lise Le Boudec, Louis Serrano, Patrick Gallinari

Main category: cs.LG

TL;DR: ENMA is a generative neural operator that uses masked autoregressive transformers with flow matching to predict spatio-temporal PDE dynamics in compressed latent space, enabling robust generalization across physical parameters.

Details

Motivation: Solving time-dependent parametric PDEs is challenging for neural solvers, especially with uncertain/incomplete data and need for generalization across physical parameters.

Method: Uses generative masked autoregressive transformer with flow matching loss for tokenwise generation in compressed latent space; encodes irregular spatial observations via attention mechanisms and spatio-temporal convolutional encoder; supports in-context learning.

Result: Creates a robust framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

Conclusion: ENMA provides an adaptable generative approach for modeling spatio-temporal dynamics from physical phenomena with strong generalization capabilities.

Abstract: Solving time-dependent parametric partial differential equations (PDEs) remains a fundamental challenge for neural solvers, particularly when generalizing across a wide range of physical parameters and dynamics. When data is uncertain or incomplete-as is often the case-a natural approach is to turn to generative models. We introduce ENMA, a generative neural operator designed to model spatio-temporal dynamics arising from physical phenomena. ENMA predicts future dynamics in a compressed latent space using a generative masked autoregressive transformer trained with flow matching loss, enabling tokenwise generation. Irregularly sampled spatial observations are encoded into uniform latent representations via attention mechanisms and further compressed through a spatio-temporal convolutional encoder. This allows ENMA to perform in-context learning at inference time by conditioning on either past states of the target trajectory or auxiliary context trajectories with similar dynamics. The result is a robust and adaptable framework that generalizes to new PDE regimes and supports one-shot surrogate modeling of time-dependent parametric PDEs.

[477] The Impossibility of Inverse Permutation Learning in Transformer Models

Rohan Alur, Chris Hays, Manish Raghavan, Devavrat Shah

Main category: cs.LG

TL;DR: Decoder-only transformers cannot learn inverse permutation tasks, but adding scratch tokens makes it possible, suggesting a mechanism for how chain-of-thought reasoning works.

Details

Motivation: To study the robustness property of transformers across reasoning tasks like long-context retrieval and multiple choice QA, using inverse permutation learning as a model.

Method: Analyzed the expressive capacity of decoder-only transformers for inverse permutation learning, tested alternative constructions including causal attention masks and padding with scratch tokens.

Result: Proved an impossibility result for decoder-only transformers learning inverse permutations, but showed feasibility with causal attention masks or scratch token padding.

Conclusion: Scratch tokens may enable reasoning in LLMs through intermediate computations, even without semantic meaning, explaining chain-of-thought prompting effectiveness.

Abstract: In this technical note, we study the problem of inverse permutation learning in decoder-only transformers. Given a permutation and a string to which that permutation has been applied, the model is tasked with producing the original (canonical'') string. We argue that this task models a natural robustness property across a variety of reasoning tasks, including long-context retrieval, multiple choice QA and in-context learning. Our primary contribution is an impossibility result: we show that an arbitrary depth, decoder-only transformer cannot learn this task. This result concerns the expressive capacity of decoder-only transformer models and is agnostic to training dynamics or sample complexity. We give a pair of alternative constructions under which inverse permutation learning is feasible. The first of these highlights the fundamental role of the causal attention mask, and reveals a gap between the expressivity of encoder-decoder transformers and the more popular decoder-only architecture. The latter result is more surprising: we show that simply padding the input with scratch tokens" yields a construction under which inverse permutation learning is possible. We conjecture that this may suggest an alternative mechanism by which chain-of-thought prompting or, more generally, intermediate ``thinking’’ tokens can enable reasoning in large language models, even when these tokens encode no meaningful semantic information (e.g., the results of intermediate computations).

[478] ConStellaration: A dataset of QI-like stellarator plasma boundaries and optimization benchmarks

Santiago A. Cadena, Andrea Merlo, Emanuel Laude, Alexander Bauer, Atul Agrawal, Maria Pascu, Marija Savtchouk, Enrico Guiraud, Lukas Bonauer, Stuart Hudson, Markus Kaiser

Main category: cs.LG

TL;DR: The paper introduces an open dataset and optimization benchmarks for quasi-isodynamic stellarator design to accelerate fusion energy research by enabling data-driven approaches and lowering entry barriers.

Details

Motivation: Stellarator optimization is bottlenecked by lack of standardized problems, datasets, and baselines, particularly for quasi-isodynamic configurations which are promising for commercial fusion due to disruption resilience.

Method: Created dataset by sampling QI fields and optimizing plasma boundaries, introduced three benchmarks with increasing complexity, provided reference code and classical optimization baselines, and demonstrated learned models can generate novel configurations.

Result: Released open dataset of QI-like stellarator plasma boundaries with equilibria and performance metrics, established optimization benchmarks with strong baselines, and showed data-driven models can efficiently generate feasible configurations without expensive physics simulations.

Conclusion: The open dataset and benchmarks aim to lower entry barriers for optimization and ML researchers in stellarator design, accelerating cross-disciplinary progress toward fusion energy.

Abstract: Stellarators are magnetic confinement devices under active development to deliver steady-state carbon-free fusion energy. Their design involves a high-dimensional, constrained optimization problem that requires expensive physics simulations and significant domain expertise. Recent advances in plasma physics and open-source tools have made stellarator optimization more accessible. However, broader community progress is currently bottlenecked by the lack of standardized optimization problems with strong baselines and datasets that enable data-driven approaches, particularly for quasi-isodynamic (QI) stellarator configurations, considered as a promising path to commercial fusion due to their inherent resilience to current driven disruptions. Here, we release an open dataset of diverse QI-like stellarator plasma boundary shapes, paired with their ideal magnetohydrodynamic (MHD) equilibria and performance metrics. We generated this dataset by sampling a variety of QI fields and optimizing corresponding stellarator plasma boundaries. We introduce three optimization benchmarks of increasing complexity: (1) a single objective geometric optimization problem, (2) a “simple-to-build” QI stellarator, and (3) a multi-objective ideal-MHD stable QI stellarator that investigates trade-offs between compactness and coil simplicity. For every benchmark, we provide reference code, evaluation scripts, and strong baselines based on classical optimization techniques. Finally, we show how learned models trained on our dataset can efficiently generate novel, feasible configurations without querying expensive physics oracles. By openly releasing the dataset along with benchmark problems and baselines, we aim to lower the entry barrier for optimization and machine learning researchers to engage in stellarator design and to accelerate cross-disciplinary progress toward bringing fusion energy to the grid.

[479] Action Chunking and Exploratory Data Collection Yield Exponential Improvements in Behavior Cloning for Continuous Control

Thomas T. Zhang, Daniel Pfrommer, Chaoyi Pan, Nikolai Matni, Max Simchowitz

Main category: cs.LG

TL;DR: Theoretical analysis shows action-chunking and exploratory augmentation in imitation learning circumvent exponential compounding errors through control-theoretic stability mechanisms.

Details

Motivation: To understand why action-chunking and exploratory data collection interventions work effectively in imitation learning despite known issues with compounding errors in continuous control settings.

Method: Theoretical analysis using control-theoretic stability framework, combined with empirical validation on robot learning benchmarks.

Result: Action-chunking and exploratory augmentation avoid exponential compounding errors in different regimes, with control-theoretic stability identified as the key mechanism. Provides tighter statistical guarantees than previous information-theoretic approaches.

Conclusion: Control-theoretic analysis offers superior insights into imitation learning error dynamics compared to purely information-theoretic approaches, explaining the effectiveness of action-chunking and exploratory interventions.

Abstract: This paper presents a theoretical analysis of two of the most impactful interventions in modern learning from demonstration in robotics and continuous control: the practice of action-chunking (predicting sequences of actions in open-loop) and exploratory augmentation of expert demonstrations. Though recent results show that learning from demonstration, also known as imitation learning (IL), can suffer errors that compound exponentially with task horizon in continuous settings, we demonstrate that action chunking and exploratory data collection circumvent exponential compounding errors in different regimes. Our results identify control-theoretic stability as the key mechanism underlying the benefits of these interventions. On the empirical side, we validate our predictions and the role of control-theoretic stability through experimentation on popular robot learning benchmarks. On the theoretical side, we demonstrate that the control-theoretic lens provides fine-grained insights into how compounding error arises, leading to tighter statistical guarantees on imitation learning error when these interventions are applied than previous techniques based on information-theoretic considerations alone.

[480] Deep RL Dual Sourcing Inventory Management with Supply and Capacity Risk Awareness

Defeng Liu, Ying Liu, Carson Eisenach

Main category: cs.LG

TL;DR: Uses reinforcement learning with intervention models and pre-trained deep learning modules to solve large-scale stochastic optimization problems, demonstrated on multi-sourcing inventory management.

Details

Motivation: To efficiently solve large-scale stochastic optimization problems by better exploring solution space and handling complex constraints in a scalable way.

Method: Leverages intervention models with pre-trained DL models to simulate stochastic processes, uses deep RL for learning supply chain processes, and introduces constraint coordination mechanism for dual cost forecasting.

Result: Approach breaks down complex supply chain processes into scalable DL modules, leading to improved performance on large real-world datasets.

Conclusion: Methodology enables more efficient RL application for stochastic optimization by decomposing complex problems into composable modules, with identified open problems for future research.

Abstract: In this work, we study how to efficiently apply reinforcement learning (RL) for solving large-scale stochastic optimization problems by leveraging intervention models. The key of the proposed methodology is to better explore the solution space by simulating and composing the stochastic processes using pre-trained deep learning (DL) models. We demonstrate our approach on a challenging real-world application, the multi-sourcing multi-period inventory management problem in supply chain optimization. In particular, we employ deep RL models for learning and forecasting the stochastic supply chain processes under a range of assumptions. Moreover, we also introduce a constraint coordination mechanism, designed to forecast dual costs given the cross-products constraints in the inventory network. We highlight that instead of directly modeling the complex physical constraints into the RL optimization problem and solving the stochastic problem as a whole, our approach breaks down those supply chain processes into scalable and composable DL modules, leading to improved performance on large real-world datasets. We also outline open problems for future research to further investigate the efficacy of such models.

[481] Geometric Multi-color Message Passing Graph Neural Networks for Blood-brain Barrier Permeability Prediction

Trung Nguyen, Md Masud Rana, Farjana Tasnim Mukta, Chang-Guo Zhan, Duc Duy Nguyen

Main category: cs.LG

TL;DR: GMC-MPNN is a geometric graph neural network that incorporates 3D atomic geometry and long-range interactions to improve blood-brain barrier permeability prediction, outperforming existing methods on benchmark datasets.

Details

Motivation: Current graph neural networks for molecular property prediction rely mainly on molecular topology and neglect crucial 3D geometric information needed to model transport mechanisms like BBB permeability.

Method: The paper introduces GMC-MPNN, which enhances standard message-passing architectures by incorporating atomic-level geometric features and long-range interactions through weighted colored subgraphs based on atom types to capture spatial relationships.

Result: GMC-MPNN achieved state-of-the-art performance with AUC-ROC of 0.9704/0.9685 for classification and RMSE of 0.4609 with Pearson correlation of 0.7759 for regression on benchmark datasets using scaffold-based splitting.

Conclusion: By integrating spatial geometry into graph representations, GMC-MPNN sets a new performance benchmark and provides a more accurate, generalizable tool for drug discovery, with ablation studies confirming the importance of learning from both common and rare functional motifs.

Abstract: Accurate prediction of blood-brain barrier permeability (BBBP) is essential for central nervous system (CNS) drug development. While graph neural networks (GNNs) have advanced molecular property prediction, they often rely on molecular topology and neglect the three-dimensional geometric information crucial for modeling transport mechanisms. This paper introduces the geometric multi-color message-passing graph neural network (GMC-MPNN), a novel framework that enhances standard message-passing architectures by explicitly incorporating atomic-level geometric features and long-range interactions. Our model constructs weighted colored subgraphs based on atom types to capture the spatial relationships and chemical context that govern BBB permeability. We evaluated GMC-MPNN on three benchmark datasets for both classification and regression tasks, using rigorous scaffold-based splitting to ensure a robust assessment of generalization. The results demonstrate that GMC-MPNN consistently outperforms existing state-of-the-art models, achieving superior performance in both classifying compounds as permeable/non-permeable (AUC-ROC of 0.9704 and 0.9685) and in regressing continuous permeability values (RMSE of 0.4609, Pearson correlation of 0.7759). An ablation study further quantified the impact of specific atom-pair interactions, revealing that the model’s predictive power derives from its ability to learn from both common and rare, but chemically significant, functional motifs. By integrating spatial geometry into the graph representation, GMC-MPNN sets a new performance benchmark and offers a more accurate and generalizable tool for drug discovery pipelines.

[482] Weak-to-Strong Generalization under Distribution Shifts

Myeongho Jeon, Jan Sobotka, Suhwan Choi, Maria Brbić

Main category: cs.LG

TL;DR: RAVEN is a robust weak-to-strong generalization framework that addresses the failure of naive weak-to-strong supervision under distribution shifts by dynamically learning optimal combinations of weak models and strong model parameters.

Details

Motivation: As superhuman models become more complex, human supervision becomes insufficient. Weak-to-strong generalization works but fails under distribution shifts, often making strong models perform worse than their weak supervisors.

Method: RAVEN dynamically learns optimal combinations of weak models in addition to parameters of the strong model, enabling robust supervision across distribution shifts.

Result: RAVEN outperforms baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. It automatically assigns higher weights to more accurate weak models.

Conclusion: RAVEN provides an effective framework for robust weak-to-strong generalization that can handle distribution shifts and automatically identify trustworthy supervision sources.

Abstract: As future superhuman models become increasingly complex, accurately supervising their behavior may exceed human capabilities. Recent works have demonstrated that in such scenarios, weak models can effectively supervise strong models, a phenomenon known as weak-to-strong generalization. However, we find that naive weak-to-strong generalization fails under distribution shifts, often leading to worse performance of the strong model than its weak supervisors. To address this, we propose RAVEN, a robust weak-to-strong generalization framework that dynamically learns the optimal combinations of weak models in addition to parameters of the strong model. We demonstrate the effectiveness of RAVEN on image classification, text classification, and preference alignment tasks. RAVEN outperforms alternative baselines by over 30% on out-of-distribution tasks while matching or surpassing existing methods on in-distribution tasks. Moreover, our results show that RAVEN assigns higher weights to more accurate weak models, demonstrating its ability to automatically identify trustworthy supervision.

[483] Learning in Stackelberg Mean Field Games: A Non-Asymptotic Analysis

Sihan Zeng, Benjamin Patrick Evans, Sujay Bhatt, Leo Ardon, Sumitra Ganesh, Alec Koppel

Main category: cs.LG

TL;DR: AC-SMFG is a single-loop actor-critic algorithm for Stackelberg mean field games that provides finite-time convergence guarantees and efficient sample usage without restrictive independence assumptions.

Details

Motivation: Existing methods for Stackelberg MFGs rely on restrictive independence assumptions, use samples inefficiently due to nested-loop structures, and lack finite-time convergence guarantees.

Method: Proposed AC-SMFG algorithm uses single-loop structure with alternating (semi-)gradient updates for leader, representative follower, and mean field, operating on continuously generated Markovian samples.

Result: Established finite-time and finite-sample convergence to stationary point of Stackelberg objective, outperforming existing baselines in policy quality and convergence speed in economics environments.

Conclusion: AC-SMFG is the first Stackelberg MFG algorithm with non-asymptotic convergence guarantees, using a relaxed ‘gradient alignment’ condition instead of restrictive independence assumptions.

Abstract: We study policy optimization in Stackelberg mean field games (MFGs), a hierarchical framework for modeling the strategic interaction between a single leader and an infinitely large population of homogeneous followers. The objective can be formulated as a structured bi-level optimization problem, in which the leader needs to learn a policy maximizing its reward, anticipating the response of the followers. Existing methods for solving these (and related) problems often rely on restrictive independence assumptions between the leader’s and followers’ objectives, use samples inefficiently due to nested-loop algorithm structure, and lack finite-time convergence guarantees. To address these limitations, we propose AC-SMFG, a single-loop actor-critic algorithm that operates on continuously generated Markovian samples. The algorithm alternates between (semi-)gradient updates for the leader, a representative follower, and the mean field, and is simple to implement in practice. We establish the finite-time and finite-sample convergence of the algorithm to a stationary point of the Stackelberg objective. To our knowledge, this is the first Stackelberg MFG algorithm with non-asymptotic convergence guarantees. Our key assumption is a “gradient alignment” condition, which requires that the full policy gradient of the leader can be approximated by a partial component of it, relaxing the existing leader-follower independence assumption. Simulation results in a range of well-established economics environments demonstrate that AC-SMFG outperforms existing multi-agent and MFG learning baselines in policy quality and convergence speed.

[484] A Conditional Distribution Equality Testing Framework using Deep Generative Learning

Siming Zheng, Tong Wang, Meifang Lan, Yuanyuan Lin

Main category: cs.LG

TL;DR: A neural network-based framework for testing conditional distribution equality in two-sample problems, with applications to covariate shift and causal discovery.

Details

Motivation: Address the need for testing conditional distribution equality in two-sample problems, particularly relevant for covariate shift scenarios and causal discovery applications.

Method: Transform conditional testing into unconditional testing using neural network-based generative methods and sample splitting techniques, introducing GCA-CDET (Generative Classification Accuracy-based Conditional Distribution Equality Test).

Result: Established convergence rate for learned generators using offset Rademacher complexity and proved testing consistency under mild conditions. Empirical studies on synthetic and real-world datasets demonstrate effectiveness.

Conclusion: The proposed framework provides an effective approach for conditional distribution equality testing with theoretical guarantees and practical performance.

Abstract: In this paper, we propose a general framework for testing the conditional distribution equality in a two-sample problem, which is most relevant to covariate shift and causal discovery. Our framework is built on neural network-based generative methods and sample splitting techniques by transforming the conditional testing problem into an unconditional one. We introduce the generative classification accuracy-based conditional distribution equality test (GCA-CDET) to illustrate the proposed framework. We establish the convergence rate for the learned generator by deriving new results related to the recently-developed offset Rademacher complexity and prove the testing consistency of GCA-CDET under mild conditions.Empirically, we conduct numerical studies including synthetic datasets and two real-world datasets, demonstrating the effectiveness of our approach. Additional discussions on the optimality of the proposed framework are provided in the online supplementary material.

[485] Diffusion Models at the Drug Discovery Frontier: A Review on Generating Small Molecules versus Therapeutic Peptides

Yiquan Wang, Yahui Ma, Yuhan Chang, Jiayao Yan, Jialin Zhang, Minnuo Cai, Kai Wei

Main category: cs.LG

TL;DR: Diffusion models are transforming drug discovery by enabling generative modeling for small molecules and therapeutic peptides, though each modality faces distinct challenges that require bridging gaps and integration into automated DBTL platforms.

Details

Motivation: To leverage diffusion models' potential in revolutionizing the traditionally slow and costly drug discovery process, particularly for designing small molecules and therapeutic peptides.

Method: Systematic comparison of diffusion model applications using iterative denoising framework adapted to different molecular representations, chemical spaces, and design objectives for small molecules and therapeutic peptides.

Result: Small molecule models excel at structure-based design but struggle with chemical synthesizability; peptide models focus on functional sequence generation but face challenges with biological stability, proper folding, and immunogenicity minimization.

Conclusion: Full potential of diffusion models requires bridging modality-specific gaps and integrating them into automated Design-Build-Test-Learn platforms to shift from chemical exploration to on-demand therapeutic engineering.

Abstract: Diffusion models have emerged as a leading framework in generative modeling, poised to transform the traditionally slow and costly process of drug discovery. This review provides a systematic comparison of their application in designing two principal therapeutic modalities: small molecules and therapeutic peptides. We dissect how the unified framework of iterative denoising is adapted to the distinct molecular representations, chemical spaces, and design objectives of each modality. For small molecules, these models excel at structure-based design, generating novel, pocket-fitting ligands with desired physicochemical properties, yet face the critical hurdle of ensuring chemical synthesizability. Conversely, for therapeutic peptides, the focus shifts to generating functional sequences and designing de novo structures, where the primary challenges are achieving biological stability against proteolysis, ensuring proper folding, and minimizing immunogenicity. Despite these distinct challenges, both domains face shared hurdles: the scarcity of high-quality experimental data, the reliance on inaccurate scoring functions for validation, and the crucial need for experimental validation. We conclude that the full potential of diffusion models will be unlocked by bridging these modality-specific gaps and integrating them into automated, closed-loop Design-Build-Test-Learn (DBTL) platforms, thereby shifting the paradigm from mere chemical exploration to the on-demand engineering of novel~therapeutics.

[486] CRPS-LAM: Regional ensemble weather forecasting from matching marginals

Erik Larsson, Joel Oskarsson, Tomas Landelius, Fredrik Lindsten

Main category: cs.LG

TL;DR: CRPS-LAM is a probabilistic regional weather forecasting model that uses CRPS-based training to generate ensemble members in a single forward pass, achieving 39x faster sampling than diffusion models while maintaining accuracy.

Details

Motivation: Diffusion-based models for weather prediction show strong performance but are computationally expensive at sampling time. The authors aim to develop a more efficient probabilistic forecasting method for Limited-Area Modeling (LAM).

Method: The model is trained with a Continuous Ranked Probability Score (CRPS)-based objective and generates ensemble members by sampling and injecting a single latent noise vector into the model in a single forward pass.

Result: CRPS-LAM achieves sampling speeds up to 39 times faster than diffusion-based models while matching their low errors on the MEPS regional dataset. It also retains fine-scale forecast details.

Conclusion: CRPS-LAM stands out as an effective approach for probabilistic regional weather forecasting, offering computational efficiency without sacrificing forecast quality.

Abstract: Machine learning for weather prediction increasingly relies on ensemble methods to provide probabilistic forecasts. Diffusion-based models have shown strong performance in Limited-Area Modeling (LAM) but remain computationally expensive at sampling time. Building on the success of global weather forecasting models trained based on Continuous Ranked Probability Score (CRPS), we introduce CRPS-LAM, a probabilistic LAM forecasting model trained with a CRPS-based objective. By sampling and injecting a single latent noise vector into the model, CRPS-LAM generates ensemble members in a single forward pass, achieving sampling speeds up to 39 times faster than a diffusion-based model. We evaluate the model on the MEPS regional dataset, where CRPS-LAM matches the low errors of diffusion models. By retaining also fine-scale forecast details, the method stands out as an effective approach for probabilistic regional weather forecasting

[487] A Connection Between Score Matching and Local Intrinsic Dimension

Eric Yeats, Aaron Jacobson, Darryl Hannan, Yiran Jia, Timothy Doster, Henry Kvinge, Scott Mahan

Main category: cs.LG

TL;DR: The paper proposes using denoising score matching loss as a scalable local intrinsic dimension (LID) estimator that outperforms existing methods in accuracy and memory efficiency.

Details

Motivation: Existing LID estimation methods using diffusion models require many forward passes or gradient computations, limiting their applicability in compute- and memory-constrained scenarios.

Method: The authors show that LID is a lower bound on the denoising score matching loss and demonstrate that the equivalent implicit score matching loss also approximates LID via the normal dimension, relating it to the FLIPD estimator.

Result: Experiments on manifold benchmarks and Stable Diffusion 3.5 show the denoising score matching loss achieves superior accuracy and memory footprint under increasing problem size and quantization levels.

Conclusion: The denoising score matching loss serves as a highly competitive and scalable LID estimator that addresses computational limitations of previous methods.

Abstract: The local intrinsic dimension (LID) of data is a fundamental quantity in signal processing and learning theory, but quantifying the LID of high-dimensional, complex data has been a historically challenging task. Recent works have discovered that diffusion models capture the LID of data through the spectra of their score estimates and through the rate of change of their density estimates under various noise perturbations. While these methods can accurately quantify LID, they require either many forward passes of the diffusion model or use of gradient computation, limiting their applicability in compute- and memory-constrained scenarios. We show that the LID is a lower bound on the denoising score matching loss, motivating use of the denoising score matching loss as a LID estimator. Moreover, we show that the equivalent implicit score matching loss also approximates LID via the normal dimension and is closely related to a recent LID estimator, FLIPD. Our experiments on a manifold benchmark and with Stable Diffusion 3.5 indicate that the denoising score matching loss is a highly competitive and scalable LID estimator, achieving superior accuracy and memory footprint under increasing problem size and quantization level.

[488] Lost in Serialization: Invariance and Generalization of LLM Graph Reasoners

Daniel Herbst, Lea Karbevska, Divyanshu Kumar, Akanksha Ahuja, Fatemeh Gholamzadeh Nasrabadi, Fabrizio Frasca

Main category: cs.LG

TL;DR: LLM-based graph reasoners lack invariance to graph symmetries, causing robustness issues. Fine-tuning reduces sensitivity to node relabeling but may increase sensitivity to structural and formatting variations, without consistently improving generalization to unseen tasks.

Details

Motivation: Graph reasoners using LLMs are sensitive to graph representation symmetries (node reindexing, edge reordering, formatting changes), raising robustness concerns that need systematic analysis.

Method: Proposed decomposition of graph serializations into node labeling, edge encoding, and syntax; evaluated LLM robustness to variations in these factors using comprehensive benchmarking and novel spectral tasks.

Result: Larger non-fine-tuned models are more robust. Fine-tuning reduces sensitivity to node relabeling but increases sensitivity to structural and format variations, and doesn’t consistently improve performance on unseen tasks.

Conclusion: Fine-tuning provides mixed robustness benefits - improving invariance to node relabeling but potentially harming robustness to other graph representation variations, while failing to enhance generalization to new tasks.

Abstract: While promising, graph reasoners based on Large Language Models (LLMs) lack built-in invariance to symmetries in graph representations. Operating on sequential graph serializations, LLMs can produce different outputs under node reindexing, edge reordering, or formatting changes, raising robustness concerns. We systematically analyze these effects, studying how fine-tuning impacts encoding sensitivity as well generalization on unseen tasks. We propose a principled decomposition of graph serializations into node labeling, edge encoding, and syntax, and evaluate LLM robustness to variations of each of these factors on a comprehensive benchmarking suite. We also contribute a novel set of spectral tasks to further assess generalization abilities of fine-tuned reasoners. Results show that larger (non-fine-tuned) models are more robust. Fine-tuning reduces sensitivity to node relabeling but may increase it to variations in structure and format, while it does not consistently improve performance on unseen tasks.

[489] Category learning in deep neural networks: Information content and geometry of internal representations

Laurent Bonnasse-Gahot, Jean-Pierre Nadal

Main category: cs.LG

TL;DR: Category learning enhances discrimination near category boundaries through neural space expansion, which is theoretically shown to be an optimal outcome of minimizing Bayes cost and maximizing mutual information in artificial neural networks.

Details

Motivation: To extend the theoretical framework of categorical perception from neuroscience to artificial neural networks, explaining why neural representations expand near category boundaries during efficient learning.

Method: Developed a theoretical framework showing that minimizing Bayes cost (cross-entropy loss) maximizes mutual information between categories and neural activities, leading to optimal neural representations with Fisher information matrices that align with category boundaries.

Result: Found that optimal learning induces neural space expansion near decision boundaries, with Fisher information maxima located near (not exactly at) class boundaries. Demonstrated this numerically on toy models and MNIST dataset.

Conclusion: Category learning naturally induces categorical perception through optimal information processing, with neural representations adapting their metric to maximize discrimination near category boundaries, providing a theoretical foundation for this phenomenon in both biological and artificial systems.

Abstract: In humans and other animals, category learning enhances discrimination between stimuli close to the category boundary. This phenomenon, called categorical perception, was also empirically observed in artificial neural networks trained on classification tasks. In previous modeling works based on neuroscience data, we show that this expansion/compression is a necessary outcome of efficient learning. Here we extend our theoretical framework to artificial networks. We show that minimizing the Bayes cost (mean of the cross-entropy loss) implies maximizing the mutual information between the set of categories and the neural activities prior to the decision layer. Considering structured data with an underlying feature space of small dimension, we show that maximizing the mutual information implies (i) finding an appropriate projection space, and, (ii) building a neural representation with the appropriate metric. The latter is based on a Fisher information matrix measuring the sensitivity of the neural activity to changes in the projection space. Optimal learning makes this neural Fisher information follow a category-specific Fisher information, measuring the sensitivity of the category membership. Category learning thus induces an expansion of neural space near decision boundaries. We characterize the properties of the categorical Fisher information, showing that its eigenvectors give the most discriminant directions at each point of the projection space. We find that, unexpectedly, its maxima are in general not exactly at, but near, the class boundaries. Considering toy models and the MNIST dataset, we numerically illustrate how after learning the two Fisher information matrices match, and essentially align with the category boundaries. Finally, we relate our approach to the Information Bottleneck one, and we exhibit a bias-variance decomposition of the Bayes cost, of interest on its own.

[490] QiMeng-SALV: Signal-Aware Learning for Verilog Code Generation

Yang Zhang, Rui Zhang, Jiaming Guo, Lei Huang, Di Huang, Yunpu Zhao, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

Main category: cs.LG

TL;DR: QiMeng-SALV introduces signal-aware learning for Verilog code generation by extracting verified signal-level implementations from partially incorrect modules to provide meaningful functional rewards for RL optimization.

Details

Motivation: The lack of meaningful functional rewards hinders reinforcement learning optimization for producing functionally correct Verilog code, as current approaches operate at the module level rather than fine-grained signal level.

Method: Extracts verified signal-aware implementations from partially incorrect modules by comparing output signals with reference modules, uses AST to identify correct signal-level code segments, and introduces signal-aware DPO optimization on correct signal-level segments.

Result: Achieves state-of-the-art performance on VerilogEval and RTLLM benchmarks, with a 7B parameter model matching DeepSeek v3 671B model performance and significantly outperforming CodeV.

Conclusion: The method enables a paradigm shift from module-level to fine-grained signal-level optimization in Verilog code generation, effectively addressing the issue of insufficient functional rewards.

Abstract: The remarkable progress of Large Language Models (LLMs) presents promising opportunities for Verilog code generation which is significantly important for automated circuit design. The lacking of meaningful functional rewards hinders the preference optimization based on Reinforcement Learning (RL) for producing functionally correct Verilog code. In this paper, we propose Signal-Aware Learning for Verilog code generation (QiMeng-SALV) by leveraging code segments of functionally correct output signal to optimize RL training. Considering Verilog code specifies the structural interconnection of hardware gates and wires so that different output signals are independent, the key insight of QiMeng-SALV is to extract verified signal-aware implementations in partially incorrect modules, so as to enhance the extraction of meaningful functional rewards. Roughly, we verify the functional correctness of signals in generated module by comparing with that of reference module in the training data. Then abstract syntax tree (AST) is employed to identify signal-aware code segments which can provide meaningful functional rewards from erroneous modules. Finally, we introduce signal-aware DPO which is optimized on the correct signal-level code segments, thereby preventing noise and interference from incorrect signals. The proposed QiMeng-SALV underscores the paradigm shift from conventional module-level to fine-grained signal-level optimization in Verilog code generation, addressing the issue of insufficient functional rewards. Experiments demonstrate that our method achieves state-of-the-art performance on VerilogEval and RTLLM, with a 7B parameter model matching the performance of the DeepSeek v3 671B model and significantly outperforming the leading open-source model CodeV trained on the same dataset. Our code is available at https://github.com/zy1xxx/SALV.

[491] g-DPO: Scalable Preference Optimization for Protein Language Models

Constance Ferragu, Jonathan D. Ziegler, Nicolas Deutschmann, Arthur Lindoulsi, Eli Bixby, Cradle ML Team

Main category: cs.LG

TL;DR: g-DPO addresses DPO’s scalability bottleneck by clustering sequences to prune redundant pairs and using group-based approximations, achieving similar performance with 1.7x-5.4x faster convergence.

Details

Motivation: DPO faces scalability issues as training pairs grow quadratically with labeled sequences, leading to prohibitive training times for protein language model alignment.

Method: Uses sequence space clustering to prune redundant pairs while preserving training signal, and amortizes likelihood computations with group-based approximations.

Result: Maintains in silico and in vitro performance statistically indistinguishable from standard DPO while converging 1.7x to 5.4x faster across three protein engineering tasks.

Conclusion: g-DPO provides an efficient framework that scales with dataset size and mutational landscape structure, overcoming DPO’s computational limitations without sacrificing performance.

Abstract: Direct Preference Optimization (DPO) is an effective approach for aligning protein language models with experimental design goals. However, DPO faces a scalability bottleneck: the number of possible training pairs grows quadratically with the number of labeled sequences, leading to prohibitive training times even for modestly sized datasets. We introduce g-DPO, a framework that (i) uses sequence space clustering to prune redundant pairs while preserving training signal, and (ii) amortizes likelihood computations with group-based approximations. Across three protein engineering tasks, g-DPO maintains in silico and in vitro performance that is statistically indistinguishable from standard DPO, while converging 1.7x to 5.4x times faster, with speedups that scale with dataset size and the structure of the underlying mutational landscape.

[492] Empowering Targeted Neighborhood Search via Hyper Tour for Large-Scale TSP

Tongkai Lu, Shuai Ma, Chongyang Tao

Main category: cs.LG

TL;DR: HyperNS method uses hyper tour guidance for large-scale TSP, outperforming existing neural methods by reducing search space through clustering and sparse heatmaps.

Details

Motivation: Neural methods for TSP face scaling challenges with memory constraints on global heatmaps/edge weights, poor initial solutions, and insufficient global guidance for large search spaces.

Method: Hyper Tour Guided Neighborhood Search (HyperNS) divides TSP into clusters using sparse heatmap graphs, abstracts them as supernodes, generates hyper tour to guide initialization and optimization, focusing on relevant edges.

Result: Outperforms existing neural-based methods on synthetic and real-world datasets, especially for larger instances, with significant reduction in gap to optimal solution.

Conclusion: HyperNS provides efficient and effective optimization for large-scale TSP by reducing search space through hyper tour guidance and clustering strategy.

Abstract: Traveling Salesman Problem (TSP) is a classic NP-hard problem that has garnered significant attention from both academia and industry. While neural-based methods have shown promise for solving TSPs, they still face challenges in scaling to larger instances, particularly in memory constraints associated with global heatmaps, edge weights, or access matrices, as well as in generating high-quality initial solutions and insufficient global guidance for efficiently navigating vast search spaces. To address these challenges, we propose a Hyper Tour Guided Neighborhood Search (HyperNS) method for large-scale TSP instances. Inspired by the ``clustering first, route second" strategy, our approach initially divides the TSP instance into clusters using a sparse heatmap graph and abstracts them as supernodes, followed by the generation of a hyper tour to guide both the initialization and optimization processes. This method reduces the search space by focusing on edges relevant to the hyper tour, leading to more efficient and effective optimization. Experimental results on both synthetic and real-world datasets demonstrate that our approach outperforms existing neural-based methods, particularly in handling larger-scale instances, offering a significant reduction in the gap to the optimal solution.

[493] Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

Jungyeon Koh, Hyun Jong Yang

Main category: cs.LG

TL;DR: A unified framework that jointly optimizes user association and resource allocation to support efficient parallel speculative decoding in mobile edge computing systems, reducing latency by 23.7% on average without accuracy loss.

Details

Motivation: The growing demand for on-device LLM inference requires efficient mobile edge computing solutions, especially in resource-constrained settings. Speculative decoding helps but suffers from communication overhead and asynchronous delays.

Method: Proposed a unified framework for joint optimization of user association and resource allocation (UARA) using multi-agent deep reinforcement learning, evaluated with Sionna simulator.

Result: Achieves up to 28.0% and average 23.7% reduction in end-to-end latency without compromising inference accuracy.

Conclusion: Enables scalable and low-latency LLM services in MEC systems through optimized speculative decoding.

Abstract: The growing demand for on-device large language model (LLM) inference highlights the need for efficient mobile edge computing (MEC) solutions, especially in resource-constrained settings. Speculative decoding offers a promising solution by partitioning token generation between a lightweight draft model on mobile devices and a powerful target model on edge servers, but suffers from communication overhead and asynchronous delays. This paper is the first to propose a unified framework that jointly optimizes user association and resource allocation (UARA) to support efficient parallel speculative decoding. We solve the UARA problem using a multi-agent deep reinforcement learning algorithm. To evaluate our approach under realistic conditions, we conduct experiments using the Sionna simulator. Results show that our method achieves up to 28.0% and an average of 23.7% reduction in end-to-end latency without compromising inference accuracy, enabling scalable and low-latency LLM services in MEC systems.

[494] Uncertainty-Aware Deep Learning Framework for Remaining Useful Life Prediction in Turbofan Engines with Learned Aleatoric Uncertainty

Krishang Sharma

Main category: cs.LG

TL;DR: Novel uncertainty-aware deep learning framework for RUL prediction with Bayesian output layer that learns aleatoric uncertainty, achieving breakthrough critical zone performance on NASA CMAPSS benchmarks.

Details

Motivation: Accurate RUL prediction with uncertainty quantification is critical for aerospace prognostics, but existing CMAPSS-based literature lacks probabilistic modeling for uncertainty learning.

Method: Hierarchical architecture with multi-scale Inception blocks, bidirectional LSTMs, dual-level attention mechanism, and Bayesian output layer that predicts both mean RUL and variance. Comprehensive preprocessing includes condition-aware clustering, wavelet denoising, and feature selection.

Result: Competitive overall performance on CMAPSS FD001-FD004 with RMSE of 16.22, 19.29, 16.84, 19.98. Breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, 7.16 (25-40% improvements). Well-calibrated 95% confidence intervals with 93.5-95.2% coverage.

Conclusion: The framework establishes new benchmarks for safety-critical predictions and enables risk-aware maintenance scheduling previously unattainable in CMAPSS literature through learned uncertainty quantification.

Abstract: Accurate Remaining Useful Life (RUL) prediction coupled with uncertainty quantification remains a critical challenge in aerospace prognostics. This research introduces a novel uncertainty-aware deep learning framework that learns aleatoric uncertainty directly through probabilistic modeling, an approach unexplored in existing CMAPSS-based literature. Our hierarchical architecture integrates multi-scale Inception blocks for temporal pattern extraction, bidirectional Long Short-Term Memory networks for sequential modeling, and a dual-level attention mechanism operating simultaneously on sensor and temporal dimensions. The innovation lies in the Bayesian output layer that predicts both mean RUL and variance, enabling the model to learn data-inherent uncertainty. Comprehensive preprocessing employs condition-aware clustering, wavelet denoising, and intelligent feature selection. Experimental validation on NASA CMAPSS benchmarks (FD001-FD004) demonstrates competitive overall performance with RMSE values of 16.22, 19.29, 16.84, and 19.98 respectively. Remarkably, our framework achieves breakthrough critical zone performance (RUL <= 30 cycles) with RMSE of 5.14, 6.89, 5.27, and 7.16, representing 25-40 percent improvements over conventional approaches and establishing new benchmarks for safety-critical predictions. The learned uncertainty provides well-calibrated 95 percent confidence intervals with coverage ranging from 93.5 percent to 95.2 percent, enabling risk-aware maintenance scheduling previously unattainable in CMAPSS literature.

[495] HardFlow: Hard-Constrained Sampling for Flow-Matching Models via Trajectory Optimization

Zeyang Li, Kaveh Alim, Navid Azizan

Main category: cs.LG

TL;DR: HardFlow: A novel framework that reformulates hard-constrained sampling as trajectory optimization using numerical optimal control to precisely satisfy constraints at terminal time while maintaining sample quality.

Details

Motivation: Existing projection-based methods for hard constraints in generative models are overly restrictive and degrade sample quality by constraining the entire sampling path.

Method: Leverages numerical optimal control to steer sampling trajectories, exploiting flow-matching model structure and model predictive control techniques to transform complex constrained optimization into tractable surrogate problems.

Result: HardFlow substantially outperforms existing methods in both constraint satisfaction and sample quality across robotics planning, PDE boundary control, and text-guided image editing domains.

Conclusion: The trajectory optimization perspective provides a unified framework for hard constraint enforcement that goes beyond simple guidance, enabling constraint satisfaction, distribution shift minimization, and enhanced sample quality.

Abstract: Diffusion and flow-matching have emerged as powerful methodologies for generative modeling, with remarkable success in capturing complex data distributions and enabling flexible guidance at inference time. Many downstream applications, however, demand enforcing hard constraints on generated samples (for example, robot trajectories must avoid obstacles), a requirement that goes beyond simple guidance. Prevailing projection-based approaches constrain the entire sampling path to the constraint manifold, which is overly restrictive and degrades sample quality. In this paper, we introduce a novel framework that reformulates hard-constrained sampling as a trajectory optimization problem. Our key insight is to leverage numerical optimal control to steer the sampling trajectory so that constraints are satisfied precisely at the terminal time. By exploiting the underlying structure of flow-matching models and adopting techniques from model predictive control, we transform this otherwise complex constrained optimization problem into a tractable surrogate that can be solved efficiently and effectively. Furthermore, this trajectory optimization perspective offers significant flexibility beyond mere constraint satisfaction, allowing for the inclusion of integral costs to minimize distribution shift and terminal objectives to further enhance sample quality, all within a unified framework. We provide a control-theoretic analysis of our method, establishing bounds on the approximation error between our tractable surrogate and the ideal formulation. Extensive experiments across diverse domains, including robotics (planning), partial differential equations (boundary control), and vision (text-guided image editing), demonstrate that our algorithm, which we name $\textit{HardFlow}$, substantially outperforms existing methods in both constraint satisfaction and sample quality.

[496] Practical Global and Local Bounds in Gaussian Process Regression via Chaining

Junyi Liu, Stanley Kok

Main category: cs.LG

TL;DR: A chaining-based framework for estimating bounds on expected extreme values in Gaussian process regression, providing both global and local uncertainty quantification without requiring specific input features or posterior variance scaling.

Details

Motivation: Existing uncertainty bounds in GPR require specific input features, rely on posterior mean/variance estimates, or need hyperparameter tuning, limiting robustness and failing to capture global model behavior in expectation.

Method: Proposed chaining-based framework with kernel-specific refinements for RBF and Matérn kernels, avoiding analytical relaxations to improve numerical tightness. Also developed local uncertainty quantification using chaining geometry through partition diameters.

Result: Theoretical bounds are tighter than generic constructions for common kernels. Experimental results show the method outperforms existing approaches on synthetic and real-world datasets.

Conclusion: The proposed framework provides robust uncertainty quantification for GPR without input feature requirements, offering both global and local bounds that adapt to kernel structures and local geometries.

Abstract: Gaussian process regression (GPR) is a popular nonparametric Bayesian method that provides predictive uncertainty estimates and is widely used in safety-critical applications. While prior research has introduced various uncertainty bounds, most existing approaches require access to specific input features, and rely on posterior mean and variance estimates or the tuning of hyperparameters. These limitations hinder robustness and fail to capture the model’s global behavior in expectation. To address these limitations, we propose a chaining-based framework for estimating upper and lower bounds on the expected extreme values over unseen data, without requiring access to specific input features. We provide kernel-specific refinements for commonly used kernels such as RBF and Matérn, in which our bounds are tighter than generic constructions. We further improve numerical tightness by avoiding analytical relaxations. In addition to global estimation, we also develop a novel method for local uncertainty quantification at specified inputs. This approach leverages chaining geometry through partition diameters, adapting to local structures without relying on posterior variance scaling. Our experimental results validate the theoretical findings and demonstrate that our method outperforms existing approaches on both synthetic and real-world datasets.

[497] UniGame: Turning a Unified Multimodal Model Into Its Own Adversary

Zhaolong Su, Wang Lu, Hao Chen, Sharon Li, Jindong Wang

Main category: cs.LG

TL;DR: UniGame is a self-adversarial post-training framework that addresses the inconsistency between understanding and generation in Unified Multimodal Models by using a lightweight perturber to make the generation branch challenge fragile understanding.

Details

Motivation: UMMs exhibit fundamental inconsistency where understanding favors compact embeddings while generation favors reconstruction-rich representations, leading to misaligned decision boundaries, degraded cross-modal coherence, and vulnerability to distributional and adversarial shifts.

Method: UniGame applies a lightweight perturber at the shared token interface to enable the generation branch to actively seek and challenge fragile understanding, turning the model into its own adversary through self-adversarial post-training.

Result: UniGame significantly improves consistency (+4.6%), understanding (+3.6%), generation (+0.02), and robustness on out-of-distribution (+4.8% on NaturalBench) and adversarial scenarios (+6.2% on AdVQA), with less than 1% additional parameters.

Conclusion: Adversarial self-play is a general and effective principle for enhancing coherence, stability, and unified competence of multimodal foundation models, and UniGame is complementary to existing post-training methods.

Abstract: Unified Multimodal Models (UMMs) have shown impressive performance in both understanding and generation with a single architecture. However, UMMs still exhibit a fundamental inconsistency: understanding favors compact embeddings, whereas generation favors reconstruction-rich representations. This structural trade-off produces misaligned decision boundaries, degraded cross-modal coherence, and heightened vulnerability under distributional and adversarial shifts. In this paper, we present UniGame, a self-adversarial post-training framework that directly targets the inconsistencies. By applying a lightweight perturber at the shared token interface, UniGame enables the generation branch to actively seek and challenge fragile understanding, turning the model itself into its own adversary. Experiments demonstrate that UniGame significantly improves the consistency (+4.6%). Moreover, it also achieves substantial improvements in understanding (+3.6%), generation (+0.02), out-of-distribution and adversarial robustness (+4.8% and +6.2% on NaturalBench and AdVQA). The framework is architecture-agnostic, introduces less than 1% additional parameters, and is complementary to existing post-training methods. These results position adversarial self-play as a general and effective principle for enhancing the coherence, stability, and unified competence of future multimodal foundation models. The official code is available at: https://github.com/AIFrontierLab/UniGame

[498] SculptDrug : A Spatial Condition-Aware Bayesian Flow Model for Structure-based Drug Design

Qingsong Zhong, Haomin Yu, Yan Lin, Wangmeng Shen, Long Zeng, Jilin Hu

Main category: cs.LG

TL;DR: SculptDrug is a spatial condition-aware generative model for Structure-Based Drug Design that uses Bayesian flow networks to address challenges in boundary constraints, hierarchical structural conditions, and spatial modeling fidelity.

Details

Motivation: Existing generative models for SBDD face challenges in incorporating boundary condition constraints, integrating hierarchical structural conditions, and ensuring spatial modeling fidelity, which limits their effectiveness in drug discovery.

Method: SculptDrug uses a BFN-based framework with progressive denoising, a Boundary Awareness Block for protein surface constraints, and a Hierarchical Encoder to capture global structural context while preserving fine-grained molecular interactions.

Result: Experimental results on the CrossDocked dataset show that SculptDrug outperforms state-of-the-art baselines, demonstrating the effectiveness of spatial condition-aware modeling.

Conclusion: SculptDrug successfully addresses key limitations in SBDD by incorporating spatial awareness and hierarchical structural conditions, providing a more effective approach for generating geometrically compatible drug ligands.

Abstract: Structure-Based drug design (SBDD) has emerged as a popular approach in drug discovery, leveraging three-dimensional protein structures to generate drug ligands. However, existing generative models encounter several key challenges: (1) incorporating boundary condition constraints, (2) integrating hierarchical structural conditions, and (3) ensuring spatial modeling fidelity. To address these limitations, we propose SculptDrug, a spatial condition-aware generative model based on Bayesian flow networks (BFNs). First, SculptDrug follows a BFN-based framework and employs a progressive denoising strategy to ensure spatial modeling fidelity, iteratively refining atom positions while enhancing local interactions for precise spatial alignment. Second, we introduce a Boundary Awareness Block that incorporates protein surface constraints into the generative process to ensure that generated ligands are geometrically compatible with the target protein. Third, we design a Hierarchical Encoder that captures global structural context while preserving fine-grained molecular interactions, ensuring overall consistency and accurate ligand-protein conformations. We evaluate SculptDrug on the CrossDocked dataset, and experimental results demonstrate that SculptDrug outperforms state-of-the-art baselines, highlighting the effectiveness of spatial condition-aware modeling.

[499] PrefixGPT: Prefix Adder Optimization by a Generative Pre-trained Transformer

Ruogu Ding, Xin Ning, Ulf Schlichtmann, Weikang Qian

Main category: cs.LG

TL;DR: PrefixGPT uses a GPT model to generate optimized prefix adders from scratch, achieving 7.7% improved area-delay product and 79.1% lower average ADP compared to existing methods.

Details

Motivation: Designing optimized prefix adders is challenging due to strict design rules and exponentially large design space, requiring automated solutions.

Method: Represent adder topology as 2D coordinate sequence, use legality mask for valid designs, employ decoder-only Transformer architecture pre-trained on random valid adders then fine-tuned for optimization.

Result: Found new optimal design with 7.7% improved area-delay product and up to 79.1% lower average ADP compared to existing works.

Conclusion: GPT-style models can master complex hardware design principles and apply them for efficient design optimization, demonstrating strong potential for automated hardware design.

Abstract: Prefix adders are widely used in compute-intensive applications for their high speed. However, designing optimized prefix adders is challenging due to strict design rules and an exponentially large design space. We introduce PrefixGPT, a generative pre-trained Transformer (GPT) that directly generates optimized prefix adders from scratch. Our approach represents an adder’s topology as a two-dimensional coordinate sequence and applies a legality mask during generation, ensuring every design is valid by construction. PrefixGPT features a customized decoder-only Transformer architecture. The model is first pre-trained on a corpus of randomly synthesized valid prefix adders to learn design rules and then fine-tuned to navigate the design space for optimized design quality. Compared with existing works, PrefixGPT not only finds a new optimal design with a 7.7% improved area-delay product (ADP) but exhibits superior exploration quality, lowering the average ADP by up to 79.1%. This demonstrates the potential of GPT-style models to first master complex hardware design principles and then apply them for more efficient design optimization.

[500] Self-Organization and Spectral Mechanism of Attractor Landscapes in High-Capacity Kernel Hopfield Networks

Akira Tamamori

Main category: cs.LG

TL;DR: The paper reveals that optimal performance in kernel-based Hopfield networks is achieved through a critical state called “Spectral Concentration,” where the leading eigenvalue is amplified for stability while trailing eigenvalues are preserved for capacity, creating a “Goldilocks zone” between rank collapse and diffusion.

Details

Motivation: To understand the dynamical mechanism behind the enhanced storage capacity of kernel-based Hopfield networks, which remains poorly understood despite empirical success.

Method: Unifying geometric analysis of attractor landscapes with spectral theory of kernel machines, using a novel “Pinnacle Sharpness” metric to analyze attractor stability and identify optimal network configurations.

Result: Discovered a “Ridge of Optimization” phase where networks achieve maximal robustness under high-load conditions, characterized by “Force Antagonism” - a balance between strong driving force and collective feedback force. This arises from “Spectral Concentration” where the leading eigenvalue is amplified for stability while trailing eigenvalues are preserved for capacity.

Conclusion: Optimal performance in high-capacity associative memories is achieved by tuning the system to a spectral “Goldilocks zone” between rank collapse and diffusion, providing a complete physical picture of how these networks form robust memories.

Abstract: Kernel-based learning methods can dramatically increase the storage capacity of Hopfield networks, yet the dynamical mechanism behind this enhancement remains poorly understood. We address this gap by unifying the geometric analysis of the attractor landscape with the spectral theory of kernel machines. Using a novel metric, “Pinnacle Sharpness,” we first uncover a rich phase diagram of attractor stability, identifying a “Ridge of Optimization” where the network achieves maximal robustness under high-load conditions. Phenomenologically, this ridge is characterized by a “Force Antagonism,” where a strong driving force is balanced by a collective feedback force. Theoretically, we reveal that this phenomenon arises from a specific reorganization of the weight spectrum, which we term \textit{Spectral Concentration}. Unlike a simple rank-1 collapse, our analysis shows that the network on the ridge self-organizes into a critical state: the leading eigenvalue is amplified to maximize global stability (Direct Force), while the trailing eigenvalues are preserved to maintain high memory capacity (Indirect Force). These findings provide a complete physical picture of how high-capacity associative memories are formed, demonstrating that optimal performance is achieved by tuning the system to a spectral “Goldilocks zone” between rank collapse and diffusion.

[501] Optimized scheduling of electricity-heat cooperative system considering wind energy consumption and peak shaving and valley filling

Jin Ye, Lingmei Wang, Shujian Zhang, Haihang Wu

Main category: cs.LG

TL;DR: Proposes PVTD3 algorithm for combined power-heat system scheduling, reducing costs by 6.93-13.59% and grid power fluctuations by 12.8% compared to TD3.

Details

Motivation: Address scheduling optimization challenges in combined power-heat systems under renewable energy integration and multiple uncertainties during global energy transition.

Method: Intelligent scheduling method based on improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm with penalty term for grid power purchase variations.

Result: PVTD3 reduces comprehensive costs by 6.93%, 12.68%, and 13.59% at 10%, 20%, and 30% renewable penetration; reduces grid power fluctuation by 12.8%; improves energy storage management with reduced low-temperature tank states and safe high-temperature tank operation.

Conclusion: PVTD3 algorithm demonstrates superior economic efficiency, grid stability, and sustainable scheduling capabilities in energy storage management for combined power-heat systems.

Abstract: With the global energy transition and rapid development of renewable energy, the scheduling optimization challenge for combined power-heat systems under new energy integration and multiple uncertainties has become increasingly prominent. Addressing this challenge, this study proposes an intelligent scheduling method based on the improved Dual-Delay Deep Deterministic Policy Gradient (PVTD3) algorithm. System optimization is achieved by introducing a penalty term for grid power purchase variations. Simulation results demonstrate that under three typical scenarios (10%, 20%, and 30% renewable penetration), the PVTD3 algorithm reduces the system’s comprehensive cost by 6.93%, 12.68%, and 13.59% respectively compared to the traditional TD3 algorithm. Concurrently, it reduces the average fluctuation amplitude of grid power purchases by 12.8%. Regarding energy storage management, the PVTD3 algorithm reduces the end-time state values of low-temperature thermal storage tanks by 7.67-17.67 units while maintaining high-temperature tanks within the 3.59-4.25 safety operating range. Multi-scenario comparative validation demonstrates that the proposed algorithm not only excels in economic efficiency and grid stability but also exhibits superior sustainable scheduling capabilities in energy storage device management.

[502] TREASURE: A Transformer-Based Foundation Model for High-Volume Transaction Understanding

Chin-Chia Michael Yeh, Uday Singh Saini, Xin Dai, Xiran Fan, Shubham Jain, Yujie Fan, Jiarui Sun, Junpeng Wang, Menghai Pan, Yingtong Dou, Yuzhong Chen, Vineeth Rakesh, Liang Wang, Yan Zheng, Mahashweta Das

Main category: cs.LG

TL;DR: TREASURE is a transformer-based foundation model for transaction data that captures consumer behavior and payment network signals, improving abnormal behavior detection by 111% and recommendation systems by 104%.

Details

Motivation: Payment networks generate high volumes of transaction data that can enable applications like abnormal behavior detection and hyper-personalized consumer insights to improve people's lives.

Method: Uses a transformer-based architecture with dedicated input modules for static and dynamic attributes, and an efficient training paradigm for predicting high-cardinality categorical attributes.

Result: Increases abnormal behavior detection performance by 111% over production systems and enhances recommendation models by 104%. Verified with industry-grade datasets.

Conclusion: TREASURE serves as both a standalone model and embedding provider, demonstrating comprehensive transaction data modeling capabilities with significant performance improvements over existing systems.

Abstract: Payment networks form the backbone of modern commerce, generating high volumes of transaction records from daily activities. Properly modeling this data can enable applications such as abnormal behavior detection and consumer-level insights for hyper-personalized experiences, ultimately improving people’s lives. In this paper, we present TREASURE, TRansformer Engine As Scalable Universal transaction Representation Encoder, a multipurpose transformer-based foundation model specifically designed for transaction data. The model simultaneously captures both consumer behavior and payment network signals (such as response codes and system flags), providing comprehensive information necessary for applications like accurate recommendation systems and abnormal behavior detection. Verified with industry-grade datasets, TREASURE features three key capabilities: 1) an input module with dedicated sub-modules for static and dynamic attributes, enabling more efficient training and inference; 2) an efficient and effective training paradigm for predicting high-cardinality categorical attributes; and 3) demonstrated effectiveness as both a standalone model that increases abnormal behavior detection performance by 111% over production systems and an embedding provider that enhances recommendation models by 104%. We present key insights from extensive ablation studies, benchmarks against production models, and case studies, highlighting valuable knowledge gained from developing TREASURE.

[503] Enhancing Nuclear Reactor Core Simulation through Data-Based Surrogate Models

Perceval Beja-Battais, Alain Grossetête, Nicolas Vayatis

Main category: cs.LG

TL;DR: The paper introduces two surrogate models for nuclear reactor core simulation to improve Model Predictive Control methods, achieving up to 1000x computational time reduction.

Details

Motivation: To address the need for Nuclear Power Plants to improve flexibility in matching renewable energy growth through enhanced simulation methods.

Method: Developed two surrogate models (data-driven and physics-informed) from nonlinear stiff ODEs as alternative simulation schemes for nuclear reactor core simulation.

Result: Both models can rapidly integrate complex dynamics with very low computational time (up to 1000x time reduction).

Conclusion: Data-driven and physics-informed surrogate models effectively enhance nuclear reactor core simulation for improved MPC methods in nuclear power plant flexibility.

Abstract: In recent years, there has been an increasing need for Nuclear Power Plants (NPPs) to improve flexibility in order to match the rapid growth of renewable energies. The Operator Assistance Predictive System (OAPS) developed by Framatome addresses this problem through Model Predictive Control (MPC). In this work, we aim to improve MPC methods through data-driven simulation schemes. Thus, from a set of nonlinear stiff ordinary differential equations (ODEs), this paper introduces two surrogate models acting as alternative simulation schemes to enhance nuclear reactor core simulation. We show that both data-driven and physics-informed models can rapidly integrate complex dynamics, with a very low computational time (up to 1000x time reduction).

[504] TiCT: A Synthetically Pre-Trained Foundation Model for Time Series Classification

Chin-Chia Michael Yeh, Uday Singh Saini, Junpeng Wang, Xin Dai, Xiran Fan, Jiarui Sun, Yujie Fan, Yan Zheng

Main category: cs.LG

TL;DR: TiCT is a transformer-based foundation model for time series classification that performs in-context learning using only synthetic pre-training data, achieving competitive results without fine-tuning.

Details

Motivation: Address the gap in general-purpose time series foundation models for classification, reducing dependency on expensive labeled data through in-context learning capabilities.

Method: Transformer architecture with bit-based label encoding and output attention mechanism, pre-trained on synthetic data using Mixup-inspired process and data augmentation.

Result: Achieves competitive performance against state-of-the-art supervised methods on UCR Archive benchmarks using only in-context examples at inference.

Conclusion: TiCT demonstrates that effective time series classification can be achieved through synthetic pre-training and in-context learning, eliminating the need for labeled data and model fine-tuning.

Abstract: The ubiquity of time series data creates a strong demand for general-purpose foundation models, yet developing them for classification remains a significant challenge, largely due to the high cost of labeled data. Foundation models capable of in-context learning (ICL) offer a powerful solution, adapting to new tasks with minimal examples and reducing the need for extensive retraining. However, prior work on large-scale time series models has predominantly focused on forecasting, leaving a critical gap for versatile, fine-tuning-free classification. To address this, we introduce TiCT (Time-series in-Context Transformer), a transformer-based model pre-trained exclusively on synthetic data to perform in-context classification. We make two primary technical contributions: 1) a novel architecture featuring a scalable bit-based label encoding and a special output attention mechanism to handle an arbitrary number of classes; and 2) a synthetic pre-training framework that combines a Mixup-inspired process with data augmentation to foster generalization and noise invariance. Extensive evaluations on the UCR Archive show that TiCT achieves competitive performance against state-of-the-art supervised methods. Crucially, this is accomplished using only in-context examples at inference time, without updating a single model weight.

[505] Hard Samples, Bad Labels: Robust Loss Functions That Know When to Back Off

Nicholas Pellegrino, David Szczecina, Paul Fieguth

Main category: cs.LG

TL;DR: The paper proposes two novel robust loss functions (Blurry Loss and Piecewise-zero Loss) that improve label error detection by de-weighting difficult-to-classify samples likely to be mislabeled, outperforming state-of-the-art methods.

Details

Motivation: Mislabeled training data is common in datasets and negatively impacts model performance. Existing error detection methods require well-trained models but training on corrupt data reduces model generalizability, creating a chicken-and-egg problem.

Method: Proposed Blurry Loss and Piecewise-zero Loss that enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples, leveraging the insight that mislabeled examples are typically harder to classify.

Result: Comprehensive experiments on artificially corrupted datasets show the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 scores for error detection across both uniform and non-uniform corruption scenarios.

Conclusion: These robust loss functions enable practitioners to more effectively identify, prune, or correct errors in training data, with broad applicability to different label error detection frameworks and corruption types.

Abstract: Incorrectly labelled training data are frustratingly ubiquitous in both benchmark and specially curated datasets. Such mislabelling clearly adversely affects the performance and generalizability of models trained through supervised learning on the associated datasets. Frameworks for detecting label errors typically require well-trained / well-generalized models; however, at the same time most frameworks rely on training these models on corrupt data, which clearly has the effect of reducing model generalizability and subsequent effectiveness in error detection – unless a training scheme robust to label errors is employed. We evaluate two novel loss functions, Blurry Loss and Piecewise-zero Loss, that enhance robustness to label errors by de-weighting or disregarding difficult-to-classify samples, which are likely to be erroneous. These loss functions leverage the idea that mislabelled examples are typically more difficult to classify and should contribute less to the learning signal. Comprehensive experiments on a variety of artificially corrupted datasets demonstrate that the proposed loss functions outperform state-of-the-art robust loss functions in nearly all cases, achieving superior F1 scores for error detection. Further analyses through ablation studies offer insights to confirm these loss functions’ broad applicability to cases of both uniform and non-uniform corruption, and with different label error detection frameworks. By using these robust loss functions, machine learning practitioners can more effectively identify, prune, or correct errors in their training data.

[506] An Adaptive Resonance Theory-based Topological Clustering Algorithm with a Self-Adjusting Vigilance Parameter

Naoki Masuyama, Yuichiro Toda, Yusuke Nojima, Hisao Ishibuchi

Main category: cs.LG

TL;DR: An ART-based topological clustering algorithm with diversity-driven adaptation that autonomously adjusts parameters for hyperparameter-free learning in both stationary and nonstationary environments.

Details

Motivation: To address clustering in dynamic environments where data distributions evolve over time, requiring models that adapt to distributional shifts while preserving learned cluster structures.

Method: Proposes an Adaptive Resonance Theory (ART)-based topological clustering algorithm with diversity-driven adaptation mechanism that autonomously adjusts recalculation interval and vigilance threshold.

Result: Outperforms state-of-the-art methods on 24 real-world datasets in both clustering performance and continual learning capability, effectively mitigating catastrophic forgetting.

Conclusion: The proposed parameter adaptation mechanism enables effective hyperparameter-free learning that maintains cluster stability and continuity in evolving data streams.

Abstract: Clustering in stationary and nonstationary settings, where data distributions remain static or evolve over time, requires models that can adapt to distributional shifts while preserving previously learned cluster structures. This paper proposes an Adaptive Resonance Theory (ART)-based topological clustering algorithm that autonomously adjusts its recalculation interval and vigilance threshold through a diversity-driven adaptation mechanism. This mechanism enables hyperparameter-free learning that maintains cluster stability and continuity in dynamic environments. Experiments on 24 real-world datasets demonstrate that the proposed algorithm outperforms state-of-the-art methods in both clustering performance and continual learning capability. These results highlight the effectiveness of the proposed parameter adaptation in mitigating catastrophic forgetting and maintaining consistent clustering in evolving data streams. Source code is available at https://github.com/Masuyama-lab/IDAT

[507] scipy.spatial.transform: Differentiable Framework-Agnostic 3D Transformations in Python

Martin Schuck, Alexander von Rohr, Angela P. Schoellig

Main category: cs.LG

TL;DR: SciPy’s spatial.transform module has been overhauled to support any Python array library (JAX, PyTorch, CuPy), enabling GPU/TPU execution, JIT compilation, batching, and autodiff while maintaining the established interface.

Details

Motivation: Existing implementations of 3D rigid-body transforms on SO(3) are error-prone due to axis conventions, normalizations, and edge cases. SciPy's implementation was limited to NumPy, restricting use in GPU-accelerated and autodiff workflows.

Method: Complete redesign of SciPy’s spatial.transform functionality to be compatible with Python array API standards, supporting multiple backends while preserving the established interface.

Result: Successfully enabled GPU/TPU execution, JIT compilation, vectorized batching, and native autodiff across different array libraries. Demonstrated through case studies on 3D transform scalability and JAX drone simulation.

Conclusion: The contributions provide a framework-agnostic, production-grade foundation for 3D spatial math in differentiable systems and machine learning, now merged into SciPy main for next release.

Abstract: Three-dimensional rigid-body transforms, i.e. rotations and translations, are central to modern differentiable machine learning pipelines in robotics, vision, and simulation. However, numerically robust and mathematically correct implementations, particularly on SO(3), are error-prone due to issues such as axis conventions, normalizations, composition consistency and subtle errors that only appear in edge cases. SciPy’s spatial$.$transform module is a rigorously tested Python implementation. However, it historically only supported NumPy, limiting adoption in GPU-accelerated and autodiff-based workflows. We present a complete overhaul of SciPy’s spatial$.$transform functionality that makes it compatible with any array library implementing the Python array API, including JAX, PyTorch, and CuPy. The revised implementation preserves the established SciPy interface while enabling GPU/TPU execution, JIT compilation, vectorized batching, and differentiation via native autodiff of the chosen backend. We demonstrate how this foundation supports differentiable scientific computing through two case studies: (i) scalability of 3D transforms and rotations and (ii) a JAX drone simulation that leverages SciPy’s Rotation for accurate integration of rotational dynamics. Our contributions have been merged into SciPy main and will ship in the next release, providing a framework-agnostic, production-grade basis for 3D spatial math in differentiable systems and ML.

[508] Lower Complexity Bounds for Nonconvex-Strongly-Convex Bilevel Optimization with First-Order Oracles

Kaiyi Ji

Main category: cs.LG

TL;DR: This paper develops new hard instances for bilevel optimization in the smooth nonconvex-strongly-convex setting, establishing improved lower bounds for both deterministic and stochastic first-order oracle models.

Details

Motivation: Progress on lower bounds for bilevel optimization has been limited due to the complexity of the bilevel structure, despite widespread study of upper bound guarantees.

Method: The authors develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models for the smooth nonconvex-strongly-convex setting.

Result: For deterministic case: Ω(κ^{3/2}ε^{-2}) oracle calls required; For stochastic case: Ω(κ^{5/2}ε^{-4}) stochastic oracle calls required. These strengthen known bounds in related settings.

Conclusion: The results expose substantial gaps between current upper and lower bounds for bilevel optimization, suggesting that even simplified regimes warrant further investigation to understand optimal complexity under standard first-order oracles.

Abstract: Although upper bound guarantees for bilevel optimization have been widely studied, progress on lower bounds has been limited due to the complexity of the bilevel structure. In this work, we focus on the smooth nonconvex-strongly-convex setting and develop new hard instances that yield nontrivial lower bounds under deterministic and stochastic first-order oracle models. In the deterministic case, we prove that any first-order zero-respecting algorithm requires at least $Ω(κ^{3/2}ε^{-2})$ oracle calls to find an $ε$-accurate stationary point, improving the optimal lower bounds known for single-level nonconvex optimization and for nonconvex-strongly-convex min-max problems. In the stochastic case, we show that at least $Ω(κ^{5/2}ε^{-4})$ stochastic oracle calls are necessary, again strengthening the best known bounds in related settings. Our results expose substantial gaps between current upper and lower bounds for bilevel optimization and suggest that even simplified regimes, such as those with quadratic lower-level objectives, warrant further investigation toward understanding the optimal complexity of bilevel optimization under standard first-order oracles.

[509] QiMeng-CRUX: Narrowing the Gap between Natural Language and Verilog via Core Refined Understanding eXpression

Lei Huang, Rui Zhang, Jiaming Guo, Yang Zhang, Di Huang, Shuyao Cheng, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Qi Guo, Yunji Chen

Main category: cs.LG

TL;DR: CRUX-V introduces a structured intermediate representation (CRUX) to bridge the gap between ambiguous natural language descriptions and precise Verilog code generation, achieving state-of-the-art performance.

Details

Motivation: Existing HDL generation approaches rely on ambiguous, redundant natural language descriptions that pose challenges for precise Verilog code generation.

Method: Two-stage training framework with Joint Expression Modeling and Dual-Space Optimization, using CRUX as structured intermediate representation between natural language and Verilog.

Result: CRUX-V achieves state-of-the-art performance among general models, particularly on challenging design tasks, and CRUX proves transferable to other code models.

Conclusion: CRUX effectively narrows the gap between free-form natural language and precise Verilog generation through structured intermediate representation.

Abstract: Large language models (LLMs) have shown promising capabilities in hardware description language (HDL) generation. However, existing approaches often rely on free-form natural language descriptions that are often ambiguous, redundant, and unstructured, which poses significant challenges for downstream Verilog code generation. We treat hardware code generation as a complex transformation from an open-ended natural language space to a domain-specific, highly constrained target space. To bridge this gap, we introduce Core Refined Understanding eXpression (CRUX), a structured intermediate space that captures the essential semantics of user intent while organizing the expression for precise Verilog code generation. We further design a two-stage training framework, comprising Joint Expression Modeling and Dual-Space Optimization, to enhance the quality of both CRUX and Verilog code. Experiments across multiple Verilog generation benchmarks demonstrate that our model, CRUX-V, achieves state-of-the-art performance among general models, particularly under challenging design tasks. Furthermore, the CRUX space proves transferable and beneficial when used as input prompts for other code models, highlighting its effectiveness in narrowing the gap between free-form natural language descriptions and precise Verilog generation.

[510] MoRE: Batch-Robust Multi-Omics Representations from Frozen Pre-trained Transformers

Audrey Pei-Hsuan Chen

Main category: cs.LG

TL;DR: MoRE is a framework that repurposes frozen pre-trained transformers for multi-omics integration using parameter-efficient fine-tuning, achieving competitive performance with fewer trainable parameters.

Details

Motivation: Address challenges in multi-omics representation learning including extreme dimensionality, modality heterogeneity, and batch effects, while leveraging pre-trained transformers' generalization capabilities.

Method: Attaches lightweight modality-specific adapters and fusion layer to frozen transformer backbone, optimizing masked modeling with contrastive and batch-invariant alignment losses.

Result: Achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models.

Conclusion: MoRE represents a practical step toward general-purpose omics foundation models through efficient multi-omics integration.

Abstract: Representation learning on multi-omics data is challenging due to extreme dimensionality, modality heterogeneity, and cohort-specific batch effects. While pre-trained transformer backbones have shown broad generalization capabilities in biological sequence modeling, their application to multi-omics integration remains underexplored. We present MoRE (Multi-Omics Representation Embedding), a framework that repurposes frozen pre-trained transformers to align heterogeneous assays into a shared latent space. Unlike purely generative approaches, MoRE employs a parameter-efficient fine-tuning (PEFT) strategy, prioritizing cross-sample and cross-modality alignment over simple sequence reconstruction. Specifically, MoRE attaches lightweight, modality-specific adapters and a task-adaptive fusion layer to the frozen backbone. It optimizes a masked modeling objective jointly with supervised contrastive and batch-invariant alignment losses, yielding structure-preserving embeddings that generalize across unseen cell types and platforms. We benchmark MoRE against established baselines, including scGPT, scVI, and Harmony with Scrublet, evaluating integration fidelity, rare population detection, and modality transfer. Our results demonstrate that MoRE achieves competitive batch robustness and biological conservation while significantly reducing trainable parameters compared to fully fine-tuned models. This work positions MoRE as a practical step toward general-purpose omics foundation models.

[511] Adam Simplified: Bias Correction Debunked

Sam Laing, Antonio Orvieto

Main category: cs.LG

TL;DR: Bias correction in Adam optimizer provides no performance improvement in optimal configurations and can be detrimental without proper learning rate scheduling.

Details

Motivation: To investigate the empirical necessity of bias correction in Adam optimizer, which is often assumed to be essential but poorly understood.

Method: Systematic ablation studies on vision and language modeling tasks, analyzing bias correction’s effects under different hyperparameter configurations and learning rate schedules.

Result: Bias correction shows no improvement in final test performance with optimal hyperparameters, and can harm performance without appropriate learning rate scheduling. It functions as implicit learning rate scheduling dependent on β₁, β₂ parameters.

Conclusion: The universal inclusion of bias correction in Adam is not justified; its effectiveness depends heavily on hyperparameter choices and learning rate scheduling.

Abstract: The Adam optimizer is a cornerstone of modern deep learning, yet the empirical necessity of each of its individual components is often taken for granted. This paper presents a focused investigation into the role of bias-correction, a feature whose contribution remains poorly understood. Through a series of systematic ablations on vision and language modelling tasks, we demonstrate that the conventional wisdom surrounding bias correction is misleading. In particular, we demonstrate that in the optimal hyper-parameter configuration, the inclusion of bias correction leads to no improvement in final test performance. Moreover, unless appropriate learning rate scheduling is implemented, the inclusion of bias correction can sometimes be detrimental to performance. We further reinterpret bias correction as a form of implicit learning rate scheduling whose behaviour is strongly dependent on the choice of smoothing hyper-parameters $β_1, β_2 \in [0,1)$. Our findings challenge the universal inclusion of this component.

cs.MA

[512] MTTR-A: Measuring Cognitive Recovery Latency in Multi-Agent Systems

Barak Or

Main category: cs.MA

TL;DR: The paper adapts classical reliability metrics (MTTR, MTBF) to measure cognitive recovery in multi-agent systems, introducing MTTR-A as a runtime measure for quantifying how quickly agentic workflows recover from reasoning drift.

Details

Motivation: Existing observability tools monitor system outputs but cannot quantify how rapidly agentic workflows recover once reasoning coherence is lost in autonomous multi-agent systems.

Method: Adapted classical reliability metrics into cognitive domain, defined MTTR-A as runtime measure of cognitive recovery latency. Conducted benchmark simulation using AG News corpus and LangGraph framework, modeling recovery latencies across reflex modes.

Result: Automated reflexes restored stability within ~6s on average, human-approval interventions required ~12s. Across 200 runs: median MTTR-A=6.21±2.14s, MTBF=6.7±2.14s, NRR=0.08, demonstrating measurable runtime resilience across reflex strategies.

Conclusion: Formalizes recovery latency as quantifiable property of distributed reasoning, establishing foundation for runtime dependability in agentic cognition and transforming cognitive recovery from ad-hoc process to standardized performance metric.

Abstract: Ensuring cognitive stability in autonomous multi-agent systems (MAS) is a central challenge for large-scale, distributed AI. While existing observability tools monitor system outputs, they cannot quantify how rapidly agentic workflows recover once reasoning coherence has been lost. We adapt classical reliability metrics-Mean Time-to-Recovery (MTTR), Mean Time Between Failures (MTBF), and related ratios-into the cognitive domain, defining MTTR-A (Mean Time-to-Recovery for Agentic Systems) as a runtime measure of cognitive recovery latency. MTTR-A quantifies the time required for a MAS to detect reasoning drift and restore consistent operation, capturing the recovery of reasoning coherence rather than infrastructural repair. A benchmark simulation using the AG~News corpus and the LangGraph orchestration framework was conducted, modeling recovery latencies across multiple reflex modes. Automated reflexes restored stability within approximately 6s on average, while human-approval interventions required about 12s. Across 200 runs, the median simulated MTTR-A was 6.21+-2.14s, MTBF=6.7+-2.14s, and NRR=0.08, demonstrating measurable runtime resilience across reflex strategies. By formalizing recovery latency as a quantifiable property of distributed reasoning-and deriving reliability bounds linking recovery time and cognitive uptime-this work establishes a foundation for runtime dependability in agentic cognition, transforming cognitive recovery from an ad-hoc process into a standardized, interpretable performance

[513] Resilient Charging Infrastructure via Decentralized Coordination of Electric Vehicles at Scale

Chuhao Qin, Alexandru Sorici, Andrei Olaru, Evangelos Pournaras, Adina Magda Florea

Main category: cs.MA

TL;DR: A collective learning framework for EV charging coordination that balances individual comfort against system efficiency, outperforming baselines in reducing travel and queuing time under contingencies.

Details

Motivation: Existing decentralized EV charging approaches struggle under severe contingencies like station outages or charging request surges, creating competition for limited slots and long queues that reduce driver comfort.

Method: Proposed a collective learning-based coordination framework where EVs are recommended adaptive charging behaviors that shift priority between comfort and efficiency, achieving Pareto-optimal trade-offs under varying conditions.

Result: Experiments with real-world data show the approach outperforms baselines, significantly reducing travel and queuing time. EVs that behave selfishly or altruistically at appropriate moments achieve shorter waiting times than those with consistent moderate behavior.

Conclusion: The framework demonstrates improved resilience and trustworthiness of decentralized EV charging infrastructure under high fractions of station outages and adversarial EVs.

Abstract: The rapid adoption of electric vehicles (EVs) introduces major challenges for decentralized charging control. Existing decentralized approaches efficiently coordinate a large number of EVs to select charging stations while reducing energy costs, preventing power peak and preserving driver privacy. However, they often struggle under severe contingencies, such as station outages or unexpected surges in charging requests. These situations create competition for limited charging slots, resulting in long queues and reduced driver comfort. To address these limitations, we propose a novel collective learning-based coordination framework that allows EVs to balance individual comfort on their selections against system-wide efficiency, i.e., the overall queues across all stations. In the framework, EVs are recommended for adaptive charging behaviors that shift priority between comfort and efficiency, achieving Pareto-optimal trade-offs under varying station capacities and dynamic spatio-temporal EV distribution. Experiments using real-world data from EVs and charging stations show that the proposed approach outperforms baseline methods, significantly reducing travel and queuing time. The results reveal that, under uncertain charging conditions, EV drivers that behave selfishly or altruistically at the right moments achieve shorter waiting time than those maintaining moderate behavior throughout. Our findings under high fractions of station outages and adversarial EVs further demonstrate improved resilience and trustworthiness of decentralized EV charging infrastructure.

[514] Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Ke Zhang, Xiaoning Zhao, Ce Zheng, Jiahong Ning, Dandan Zhu, Wenqi Zhang, Chen Sun, Toshiharu Sugawara

Main category: cs.MA

TL;DR: Tool-RoCo is a benchmark for evaluating LLM autonomy and cooperation in multi-agent systems using tool usage as a proxy for coordination, with four paradigms testing different autonomy levels.

Details

Motivation: Current LLM-based multi-agent systems rely on predefined orchestration and ignore agent autonomy. There's a need to systematically evaluate how LLMs can cooperate and self-organize in multi-agent scenarios.

Method: Proposes Tool-RoCo benchmark based on RoCo multi-robot cooperative benchmark. Uses tool usage to evaluate cooperation: agents select tools from candidate sets, receive feedback, and adjust selections. Tests four LLM paradigms: centralized cooperation, centralized self-organization, decentralized cooperation, and self-organization across three multi-robot tasks (SORT, PACK, CABINET).

Result: Cooperative tools accounted for only 7.09% of all tools, showing LLM-based agents rarely invoke others as assistants. Activation tools accounted for 96.42%, indicating LLMs tend to maintain active agents and seldom deactivate them for adaptive coordination.

Conclusion: Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi-agent tasks, revealing current limitations in cooperative behavior and adaptive coordination among LLM-based agents.

Abstract: This study proposes Tool-RoCo, a novel benchmark for evaluating large language models (LLMs) in long-term multi-agent cooperation based on RoCo, a multi-robot cooperative benchmark. Recent research on LLM-based multi-agent systems has relied on predefined orchestration, while ignoring agent autonomy. Tool-RoCo treats other agents as tools and introduces cooperative tools, leveraging tool usage to evaluate multi-agent cooperation and self-organization. Tool usage means that each agent (LLM) selects a tool from a candidate set based on the current state, receives feedback, and adjusts its selection in subsequent rounds. To evaluate different autonomy levels, we propose four LLM paradigms: (1) centralized cooperation, where a single LLM allocates tools to all agents; (2) centralized self-organization, where a central LLM autonomously activates agents while keeping others inactive; (3) decentralized cooperation, where each agent has its own LLM and calls tools based on local information; and (4) self-organization, where a randomly chosen initial agent can request collaboration, activating additional agents via tool calls. Tool-RoCo includes three multi-robot tasks, SORT, PACK, and CABINET, to measure format and parameter accuracy and agent coordination through tool usage. The results using several LLMs showed that cooperative tools accounted for only 7.09% of all tools, indicating that LLM-based agents rarely invoked others as assistants. Moreover, activation tools accounted for 96.42%, suggesting that current LLMs tend to maintain active agents while seldom deactivating them for adaptive coordination. Tool-RoCo provides a systematic benchmark to evaluate LLM autonomy and cooperation in multi-agent tasks. Code and Demo: https://github.com/ColaZhang22/Tool-Roco

[515] BAMAS: Structuring Budget-Aware Multi-Agent Systems

Liming Yang, Junyu Luo, Xuanzhe Liu, Yiling Lou, Zhenpeng Chen

Main category: cs.MA

TL;DR: BAMAS is a budget-aware multi-agent system that optimizes LLM selection and collaboration topology to reduce costs by up to 86% while maintaining performance.

Details

Motivation: Existing multi-agent systems rarely address budget constraints, making cost an important consideration for practical deployment as systems scale in complexity.

Method: BAMAS uses Integer Linear Programming to select optimal LLMs balancing performance and cost, then employs reinforcement learning to determine interaction topology, and finally instantiates the system based on these selections.

Result: BAMAS achieves comparable performance to state-of-the-art methods while reducing costs by up to 86% across three representative tasks.

Conclusion: BAMAS provides an effective approach for building cost-efficient multi-agent systems that maintain performance while significantly reducing deployment costs.

Abstract: Large language model (LLM)-based multi-agent systems have emerged as a powerful paradigm for enabling autonomous agents to solve complex tasks. As these systems scale in complexity, cost becomes an important consideration for practical deployment. However, existing work rarely addresses how to structure multi-agent systems under explicit budget constraints. In this paper, we propose BAMAS, a novel approach for building multi-agent systems with budget awareness. BAMAS first selects an optimal set of LLMs by formulating and solving an Integer Linear Programming problem that balances performance and cost. It then determines how these LLMs should collaborate by leveraging a reinforcement learning-based method to select the interaction topology. Finally, the system is instantiated and executed based on the selected agents and their collaboration topology. We evaluate BAMAS on three representative tasks and compare it with state-of-the-art agent construction methods. Results show that BAMAS achieves comparable performance while reducing cost by up to 86%.

[516] Policy Optimization and Multi-agent Reinforcement Learning for Mean-variance Team Stochastic Games

Junkai Hu, Li Xia

Main category: cs.MA

TL;DR: The paper studies mean-variance team stochastic games where agents share a common mean-variance objective but act independently. It addresses challenges of non-additive variance metrics and non-stationarity from simultaneous policy updates.

Details

Motivation: MV-TSG faces two main challenges: variance metric is neither additive nor Markovian in dynamic settings, and simultaneous policy updates create non-stationary environments, making dynamic programming inapplicable.

Method: Proposes sensitivity-based optimization approach with Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) using sequential updates, and extends to reinforcement learning with MV-MATRPO algorithm using trust region methods.

Result: Proves existence of deterministic Nash policy, convergence to first-order stationary points, and derives conditions for stationary points to be Nash equilibria and strict local optima. Numerical experiments on microgrid energy management demonstrate effectiveness.

Conclusion: The proposed sensitivity-based approach and algorithms successfully address MV-TSG challenges, providing theoretical guarantees and practical solutions for multi-agent systems with mean-variance objectives.

Abstract: We study a long-run mean-variance team stochastic game (MV-TSG), where each agent shares a common mean-variance objective for the system and takes actions independently to maximize it. MV-TSG has two main challenges. First, the variance metric is neither additive nor Markovian in a dynamic setting. Second, simultaneous policy updates of all agents lead to a non-stationary environment for each individual agent. Both challenges make dynamic programming inapplicable. In this paper, we study MV-TSGs from the perspective of sensitivity-based optimization. The performance difference and performance derivative formulas for joint policies are derived, which provide optimization information for MV-TSGs. We prove the existence of a deterministic Nash policy for this problem. Subsequently, we propose a Mean-Variance Multi-Agent Policy Iteration (MV-MAPI) algorithm with a sequential update scheme, where individual agent policies are updated one by one in a given order. We prove that the MV-MAPI algorithm converges to a first-order stationary point of the objective function. By analyzing the local geometry of stationary points, we derive specific conditions for stationary points to be (local) Nash equilibria, and further, strict local optima. To solve large-scale MV-TSGs in scenarios with unknown environmental parameters, we extend the idea of trust region methods to MV-MAPI and develop a multi-agent reinforcement learning algorithm named Mean-Variance Multi-Agent Trust Region Policy Optimization (MV-MATRPO). We derive a performance lower bound for each update of joint policies. Finally, numerical experiments on energy management in multiple microgrid systems are conducted.

[517] MAPF-HD: Multi-Agent Path Finding in High-Density Environments

Hiroya Makino, Seigo Ito

Main category: cs.MA

TL;DR: Proposes MAPF-HD framework and PHANS method for efficient multi-agent path finding in high-density environments, solving problems in seconds for large grids with 700+ cells.

Details

Motivation: Traditional MAPF methods become impractical in high-density environments due to excessive computation times (tens to hundreds of seconds), limiting scalability for real-world applications like warehouses and traffic management.

Method: PHANS (phased null-agent swapping) method uses heuristic approach to incrementally swap positions between agents and empty vertices, avoiding the computational complexity of ILP-based methods.

Result: Achieves solution times within a few seconds even for large environments with over 700 cells, significantly faster than ILP-based approaches.

Conclusion: The proposed method enables practical deployment of MAPF in high-density scenarios for applications like warehouse logistics, traffic management, and crowd control.

Abstract: Multi-agent path finding (MAPF) involves planning efficient paths for multiple agents to move simultaneously while avoiding collisions. In typical warehouse environments, agents are often sparsely distributed along aisles; however, increasing the agent density can improve space efficiency. When the agent density is high, it becomes necessary to optimize the paths not only for goal-assigned agents but also for those obstructing them. This study proposes a novel MAPF framework for high-density environments (MAPF-HD). Several studies have explored MAPF in similar settings using integer linear programming (ILP). However, ILP-based methods require substantial computation time to optimize all agent paths simultaneously. Even in small grid-based environments with fewer than $100$ cells, these computations can take tens to hundreds of seconds. Such high computational costs render these methods impractical for large-scale applications such as automated warehouses and valet parking. To address these limitations, we introduce the phased null-agent swapping (PHANS) method. PHANS employs a heuristic approach to incrementally swap positions between agents and empty vertices. This method solves the MAPF-HD problem within a few seconds, even in large environments containing more than $700$ cells. The proposed method has the potential to improve efficiency in various real-world applications such as warehouse logistics, traffic management, and crowd control. The implementation is available at https://github.com/ToyotaCRDL/MAPF-in-High-Density-Envs.

cs.MM

[518] Prompt-Aware Adaptive Elastic Weight Consolidation for Continual Learning in Medical Vision-Language Models

Ziyuan Gao, Philippe Morel

Main category: cs.MM

TL;DR: PA-EWC is a novel continual learning method that prevents catastrophic forgetting in medical AI by using prompt-guided parameter specialization, achieving up to 17.58% reduction in forgetting across diverse medical imaging datasets.

Details

Motivation: Medical AI systems face catastrophic forgetting when deployed in clinical settings, where models must learn new imaging protocols while retaining prior diagnostic capabilities, especially challenging for vision-language models that preserve complex cross-modal alignments.

Method: Systematically categorizes model parameters based on functional roles (visual-descriptive, spatial-guided, medical-semantic), uses adaptive Fisher Information computation with gradient stability analysis, and develops weighted complexity metrics based on medical terminology density.

Result: Reduces catastrophic forgetting by up to 17.58% compared to baselines, with 4.30% improvement on chest X-ray pathology localization and 6.06% on polyp segmentation across five medical imaging datasets (Kvasir-SEG, ISIC 2018, CheXlocalize, BUSI, CAMUS).

Conclusion: PA-EWC effectively addresses catastrophic forgetting in medical AI through prompt-aware parameter specialization, enabling models to adapt to new clinical requirements while preserving critical diagnostic knowledge across diverse imaging modalities.

Abstract: Medical AI systems face catastrophic forgetting when deployed in clinical settings, where models must learn new imaging protocols while retaining prior diagnostic capabilities. This challenge is particularly acute for medical vision-language models that must preserve complex cross-modal alignments between medical images and clinical terminology across diverse imaging modalities. We introduce Prompt- Aware Adaptive Elastic Weight Consolidation (PA-EWC), a novel continual learning approach that addresses catastrophic forgetting through prompt-guided parameter specialization. Our method systematically categorizes model parameters based on their functional roles in processing visual-descriptive, spatial-guided, and medical-semantic information, enabling targeted protection of critical knowledge while allowing adaptation to new clinical requirements. PA-EWC incorporates adaptive Fisher Information computation with gradient stability analysis and develops weighted complexity metrics based on medical terminology density. We evaluate our approach across five medical imaging datasets (Kvasir-SEG, ISIC 2018, CheXlocalize, BUSI, CAMUS) representing diverse modalities including endoscopy, dermoscopy, radiography, and ultrasound. Experimental results demonstrate that PA-EWC reduces catastrophic forgetting by up to 17.58% compared to baseline methods, with performance improvements of 4.30% on chest X-ray pathology localization and 6.06% on polyp segmentation.

[519] AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control

Xinyue Guo, Xiaoran Yang, Lipan Zhang, Jianxuan Yang, Zhao Wang, Jian Luan

Main category: cs.MM

TL;DR: AV-Edit is a generative sound effect editing framework that enables fine-grained audio editing in videos by leveraging visual, audio, and text semantics through multimodal pre-training and diffusion transformers.

Details

Motivation: Existing sound effect editing approaches rely solely on low-level signal processing or coarse text prompts, resulting in limited flexibility and suboptimal audio quality.

Method: Uses contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, then trains an editorial Multimodal Diffusion Transformer (MM-DiT) with correlation-based feature gating to remove irrelevant sounds and generate missing audio elements.

Result: Generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in sound effect editing and strong competitiveness in audio generation.

Conclusion: AV-Edit effectively addresses the limitations of existing approaches by jointly leveraging multimodal semantics for fine-grained sound effect editing in videos.

Abstract: Sound effect editing-modifying audio by adding, removing, or replacing elements-remains constrained by existing approaches that rely solely on low-level signal processing or coarse text prompts, often resulting in limited flexibility and suboptimal audio quality. To address this, we propose AV-Edit, a generative sound effect editing framework that enables fine-grained editing of existing audio tracks in videos by jointly leveraging visual, audio, and text semantics. Specifically, the proposed method employs a specially designed contrastive audio-visual masking autoencoder (CAV-MAE-Edit) for multimodal pre-training, learning aligned cross-modal representations. These representations are then used to train an editorial Multimodal Diffusion Transformer (MM-DiT) capable of removing visually irrelevant sounds and generating missing audio elements consistent with video content through a correlation-based feature gating training strategy. Furthermore, we construct a dedicated video-based sound editing dataset as an evaluation benchmark. Experiments demonstrate that the proposed AV-Edit generates high-quality audio with precise modifications based on visual content, achieving state-of-the-art performance in the field of sound effect editing and exhibiting strong competitiveness in the domain of audio generation.

[520] PixelatedScatter: Arbitrary-level Visual Abstraction for Large-scale Multiclass Scatterplots

Ziheng Guo, Tianxiang Wei, Zeyu Li, Lianghao Zhang, Sisi Li, Jiawan Zhang

Main category: cs.MM

TL;DR: Proposes a visual abstraction method for large-scale scatterplots that better preserves features in medium-to-low density regions through iso-density partitioning, pixel allocation, and distribution reconstruction.

Details

Motivation: Current scatterplot abstraction methods lose features in medium-to-low density regions, and overdraw is inevitable in large-scale scatterplots.

Method: Three-step approach: 1) Partition scatterplot into iso-density regions and equalize visual density, 2) Allocate pixels for different classes within each region, 3) Reconstruct data distribution based on pixels.

Result: User studies, quantitative and qualitative evaluations show the method better preserves features compared to previous methods, with special advantage for ultra-high dynamic range data distributions.

Conclusion: The proposed visual abstraction method provides better feature preservation across arbitrary abstraction levels for large-scale scatterplots, particularly in medium-to-low density regions.

Abstract: Overdraw is inevitable in large-scale scatterplots. Current scatterplot abstraction methods lose features in medium-to-low density regions. We propose a visual abstraction method designed to provide better feature preservation across arbitrary abstraction levels for large-scale scatterplots, particularly in medium-to-low density regions. The method consists of three closely interconnected steps: first, we partition the scatterplot into iso-density regions and equalize visual density; then, we allocate pixels for different classes within each region; finally, we reconstruct the data distribution based on pixels. User studies, quantitative and qualitative evaluations demonstrate that, compared to previous methods, our approach better preserves features and exhibits a special advantage when handling ultra-high dynamic range data distributions.

eess.AS

[521] Towards Audio Token Compression in Large Audio Language Models

Saurabhchand Bhati, Samuel Thomas, Hilde Kuehne, Rogerio Feris, James Glass

Main category: eess.AS

TL;DR: The paper proposes methods to compress audio tokens in Large Audio Language Models (LALMs) to address scalability issues caused by quadratic attention complexity and high audio token rates, enabling deployment on resource-constrained platforms.

Details

Motivation: LALMs face scalability limitations due to quadratic attention complexity and high audio token rates, making it difficult to handle long-form audio and deploy on edge devices.

Method: The authors use unsupervised segmentation and uniform average pooling to reduce audio tokens before the LLM decoder, and employ low-rank adapters to fine-tune the model to mitigate performance degradation from compression.

Result: Experimental results show compressed LALMs achieve performance close to frame-level LALMs while reducing input audio token count by up to three times before the LLM backbone.

Conclusion: The proposed compression techniques effectively address LALM scalability issues while maintaining performance, making them suitable for long-form audio processing and edge device deployment.

Abstract: Large Audio Language Models (LALMs) demonstrate impressive performance across diverse tasks, ranging from speech recognition to general audio understanding. However, their scalability is limited by the quadratic complexity of attention and the high token rates of audio signals. These challenges make it difficult to extend LALMs to long-form audio and to deploy them on resource-constrained platforms such as edge devices. In this paper, we explore techniques such as unsupervised segmentation, uniform average pooling, etc., to reduce the number of audio tokens generated by the LALM’s audio encoder but before they are consumed by the LLM decoder. To mitigate potential performance degradation introduced by the compressed representations, we employ low-rank adapters to finetune the model. We evaluate our proposed models on two tasks, automatic speech recognition and speech-to-speech translation tasks, that are dependent on effectively uncovering the underlying lexical content of the input signal and study the effect of downsampling on these tasks. Experimental results show that compressed LALMs can achieve performance closer to frame-level LALMs while reducing the input audio token count upto three times before the LLM backbone.

[522] RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Zhisheng Zheng, Xiaohang Sun, Tuan Dinh, Abhishek Yanamandra, Abhinav Jain, Zhu Liu, Sunil Hadap, Vimal Bhat, Manoj Aggarwal, Gerard Medioni, David Harwath

Main category: eess.AS

TL;DR: RosettaSpeech is a zero-shot speech-to-speech translation framework that eliminates the need for parallel speech corpora by using monolingual speech-text data with machine translation supervision, achieving state-of-the-art performance.

Details

Motivation: The scarcity of parallel speech corpora critically hampers speech-to-speech translation, forcing reliance on complex multi-stage pipelines. This work aims to simplify S2ST by reducing dependency on hard-to-acquire parallel speech data.

Method: Uses text as an intermediate bridge during training with monolingual speech-text data augmented by machine translation supervision, but functions as direct end-to-end speech-to-speech model at inference without needing parallel speech pairs.

Result: Achieves state-of-the-art results: ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English (27% and 14% relative gains). Single model delivers strong many-to-one translation (FR/ES/DE -> EN).

Conclusion: By prioritizing abundant parallel text over scarce parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for broader language coverage.

Abstract: The scarcity of parallel speech corpora critically hampers speech-to-speech translation (S2ST), often forcing reliance on complex, multi-stage pipelines. This paper introduces RosettaSpeech, a novel and simplified framework for zero-shot S2ST that is trained on monolingual speech-text data augmented by machine translation supervision. While our method leverages the linguistic knowledge inherent in text-based NMT models, it strictly eliminates the need for parallel speech-to-speech pairs. Our model uniquely uses text as an intermediate bridge during training but functions as a direct, end-to-end speech-to-speech model at inference. This streamlined approach achieves state-of-the-art results on standard benchmarks. For instance, on the CVSS-C test set, RosettaSpeech outperforms leading systems, achieving an ASR-BLEU score of 25.17 for German-to-English and 29.86 for Spanish-to-English-relative gains of over 27% and 14%, respectively. Furthermore, we demonstrate that a single model can deliver strong many-to-one translation performance (FR/ES/DE -> EN). We also provide a foundational analysis of how training data scaling impacts model performance. By prioritizing reliance on abundant parallel text rather than difficult-to-acquire parallel speech, RosettaSpeech offers a scalable path to creating high-quality, speaker-preserving S2ST for a much broader array of languages.

[523] Evaluation of an ITD-to-ILD Transformation as a Method to Restore the Spatial Benefit in Speech Intelligibility in Hearing Impaired Listeners

Timm-Jonas Bäumer, Johannes W. de Vries, Stephan Töpken, Richard C. Hendriks, Peyman Goli, Steven van de Par

Main category: eess.AS

TL;DR: Transforming low-frequency ITDs into ILDs can restore binaural benefits for hearing impaired listeners, improving speech intelligibility in complex listening environments.

Details

Motivation: Hearing impaired listeners often have limited sensitivity to ITDs, which reduces their speech intelligibility in everyday situations. The study aims to investigate if transforming ITDs into ILDs could reintroduce binaural benefits.

Method: Two experiments: 1) Measured ITD sensitivity thresholds using binaurally phase-shifted sinusoids at different frequencies; 2) Measured Speech Reception Thresholds (SRTs) using manipulated Head-Related Transfer Functions (HRTFs) in different binaural configurations.

Result: Removing ITDs decreased SRTs by ~1 dB. Substituting low-frequency ITDs with ILDs improved performance for lateral target speakers. Adding low-frequency ILDs while preserving ITDs significantly improved performance for speakers in all directions.

Conclusion: The transformation of low-frequency ITDs into ILDs is effective in restoring binaural benefits for HI listeners and should be implemented in hearing aids and cochlear implants.

Abstract: To improve speech intelligibility in complex everyday situations, the human auditory system partially relies on Interaural Time Differences (ITDs) and Interaural Level Differences (ILDs). However, hearing impaired (HI) listeners often exhibit limited sensitivity to ITDs, resulting in decreased speech intelligibility performance. This study aimed to investigate whether transforming low-frequency ITDs into ILDs could reintroduce a binaural benefit for HI listeners. We conducted two experiments with HI listeners. The first experiment used binaurally phase-shifted sinusoids at different frequencies to evaluate the HI listeners ITD sensitivity threshold. All subjects had an increased ITD threshold at higher frequencies, with different ITD sensitivities between the subjects in the lower frequencies. In the second experiment, Speech Reception Thresholds (SRTs) were measured in different binaural configurations by manipulating Head-Related Transfer Functions (HRTFs). The results showed that, despite the decreased ITD sensitivity, removing ITDs decreased SRTs by approximately 1 dB compared to the unprocessed baseline, where ITDs and ILDs are available. Furthermore, substituting low-frequency ITDs with ILDs yielded an improvement for a lateral target speaker. Adding the low-frequency ILDs while preserving the ITDs caused a significant improvement for speakers in all directions. These findings suggest that the proposed transformation method could be effective in restoring binaural benefits in HI listeners. The results of this study suggest the use of such transformation techniques to be implemented in hearing aids and cochlear implants, directly benefiting HI listeners.

[524] The Spheres Dataset: Multitrack Orchestral Recordings for Music Source Separation and Information Retrieval

Jaime Garcia-Martinez, David Diaz-Guerra, John Anderson, Ricardo Falcon-Perez, Pablo Cabañas-Molero, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas

Main category: eess.AS

TL;DR: The Spheres dataset provides multitrack orchestral recordings with 23 microphone channels for classical music source separation research, featuring Tchaikovsky and Mozart works plus chromatic scales and solo excerpts.

Details

Motivation: To advance machine learning research in music source separation and related MIR tasks within the classical music domain, which lacks comprehensive datasets with controlled recording conditions and isolated stems.

Method: Created a dataset using 23 microphones (close spot, main, ambient) during professional recordings of orchestral pieces by Colibrì Ensemble, capturing isolated stems with controlled bleeding and estimating room impulse responses for acoustic characterization.

Result: Baseline evaluations using X-UMX models show both potential and challenges for orchestral family separation and microphone debleeding, demonstrating the dataset’s utility for benchmarking separation tasks.

Conclusion: The Spheres dataset provides valuable resources for advancing source separation, localization, dereverberation, and immersive rendering research in classical music through its comprehensive multitrack recordings and acoustic analysis.

Abstract: This paper introduces The Spheres dataset, multitrack orchestral recordings designed to advance machine learning research in music source separation and related MIR tasks within the classical music domain. The dataset is composed of over one hour recordings of musical pieces performed by the Colibrì Ensemble at The Spheres recording studio, capturing two canonical works - Tchaikovsky’s Romeo and Juliet and Mozart’s Symphony No. 40 - along with chromatic scales and solo excerpts for each instrument. The recording setup employed 23 microphones, including close spot, main, and ambient microphones, enabling the creation of realistic stereo mixes with controlled bleeding and providing isolated stems for supervised training of source separation models. In addition, room impulse responses were estimated for each instrument position, offering valuable acoustic characterization of the recording space. We present the dataset structure, acoustic analysis, and baseline evaluations using X-UMX based models for orchestral family separation and microphone debleeding. Results highlight both the potential and the challenges of source separation in complex orchestral scenarios, underscoring the dataset’s value for benchmarking and for exploring new approaches to separation, localization, dereverberation, and immersive rendering of classical music.

eess.IV

[525] A Fractional Variational Approach to Spectral Filtering Using the Fourier Transform

Nelson H. T. Lemes, José Claudinei Ferreira, Higor V. M. Ferreira

Main category: eess.IV

TL;DR: A variational method using fractional derivatives in the frequency domain for Raman spectrum denoising, optimized via Shannon entropy to balance noise removal and feature preservation.

Details

Motivation: Fluorescence interference and noise obscure critical spectral features in Raman analysis, requiring effective denoising while preserving essential chemical information.

Method: Minimizes a functional with fractional derivatives, reformulated in frequency domain via Fourier transform, with optimization of regularization parameter and derivative order using Shannon entropy.

Result: The method produces an efficient, robust filter that effectively removes noise while preserving peak position, intensity, and area in both simulated Raman data and image processing.

Conclusion: The combination of variational approach, fractional derivatives, and Shannon entropy optimization creates an effective denoising filter that balances noise suppression with feature preservation.

Abstract: The interference of fluorescence signals and noise remains a significant challenge in Raman spectrum analysis, often obscuring subtle spectral features that are critical for accurate analysis. Inspired by variational methods similar to those used in image denoising, our approach minimizes a functional involving fractional derivatives to balance noise suppression with the preservation of essential chemical features of the signal, such as peak position, intensity, and area. The original problem is reformulated in the frequency domain through the Fourier transform, making the implementation simple and fast. In this work, we discuss the theoretical framework, practical implementation, and the advantages and limitations of this method in the context of {simulated} Raman data, as well as in image processing. The main contribution of this article is the combination of a variational approach in the frequency domain, the use of fractional derivatives, and the optimization of the {regularization parameter and} derivative order through the concept of Shannon entropy. This work explores how the fractional order, combined with the regularization parameter, affects noise removal and preserves the essential features of the spectrum {and image}. Finally, the study shows that the combination of the proposed strategies produces an efficient, robust, and easily implementable filter.

[526] Adversarial Multi-Task Learning for Liver Tumor Segmentation, Dynamic Enhancement Regression, and Classification

Xiaojiao Xiao, Qinmin Vivian Hu, Tae Hyun Kim, Guanghui Wang

Main category: eess.IV

TL;DR: MTI-Net is an end-to-end framework that simultaneously performs liver tumor segmentation, dynamic enhancement regression, and classification using multi-domain information fusion and task interaction modules.

Details

Motivation: No prior work has achieved liver tumor segmentation, dynamic enhancement regression, and classification simultaneously in an end-to-end framework due to lack of effective inter-task relevance capture and dynamic MRI information extraction mechanisms.

Method: Proposes Multi-Task Interaction adversarial learning Network (MTI-Net) with Multi-domain Information Entropy Fusion, task interaction module for higher-order consistency, task-driven discriminator, and shallow Transformer for dynamic MRI sequence relationships.

Result: Demonstrates high performance across multiple tasks on a dataset of 238 subjects, showing strong potential for clinical assessment of liver tumors.

Conclusion: MTI-Net provides an effective integrated framework for simultaneous liver tumor analysis tasks with improved dynamic MRI information extraction and inter-task synergy.

Abstract: Liver tumor segmentation, dynamic enhancement regression, and classification are critical for clinical assessment and diagnosis. However, no prior work has attempted to achieve these tasks simultaneously in an end-to-end framework, primarily due to the lack of an effective framework that captures inter-task relevance for mutual improvement and the absence of a mechanism to extract dynamic MRI information effectively. To address these challenges, we propose the Multi-Task Interaction adversarial learning Network (MTI-Net), a novel integrated framework designed to tackle these tasks simultaneously. MTI-Net incorporates Multi-domain Information Entropy Fusion (MdIEF), which utilizes entropy-aware, high-frequency spectral information to effectively integrate features from both frequency and spectral domains, enhancing the extraction and utilization of dynamic MRI data. The network also introduces a task interaction module that establishes higher-order consistency between segmentation and regression, thus fostering inter-task synergy and improving overall performance. Additionally, we designed a novel task-driven discriminator (TDD) to capture internal high-order relationships between tasks. For dynamic MRI information extraction, we employ a shallow Transformer network to perform positional encoding, which captures the relationships within dynamic MRI sequences. In experiments on a dataset of 238 subjects, MTI-Net demonstrates high performance across multiple tasks, indicating its strong potential for assisting in the clinical assessment of liver tumors. The code is available at: https://github.com/xiaojiao929/MTI-Net.

[527] Deep Parameter Interpolation for Scalar Conditioning

Chicago Y. Park, Michael T. McCann, Cristina Garcia-Cardona, Brendt Wohlberg, Ulugbek S. Kamilov

Main category: eess.IV

TL;DR: Deep Parameter Interpolation (DPI) enables neural networks to accept scalar inputs by dynamically interpolating between two parameter sets based on the scalar value, improving performance in diffusion and flow matching models.

Details

Motivation: Existing methods for incorporating scalar inputs in deep generative models either encode scalars as additional image inputs or restrict architecture choices by combining scalar and vector information in specific components, limiting flexibility.

Method: DPI maintains two learnable parameter sets within a single network and dynamically interpolates between them based on the scalar value during training and sampling, making it architecture-agnostic.

Result: DPI improves denoising performance and sample quality for both diffusion and flow matching models while maintaining computational efficiency comparable to standard scalar conditioning techniques.

Conclusion: DPI provides a simple, general-purpose method for adding scalar dependence to neural networks without restricting architecture choices, enhancing performance in generative modeling tasks.

Abstract: We propose deep parameter interpolation (DPI), a general-purpose method for transforming an existing deep neural network architecture into one that accepts an additional scalar input. Recent deep generative models, including diffusion models and flow matching, employ a single neural network to learn a time- or noise level-dependent vector field. Designing a network architecture to accurately represent this vector field is challenging because the network must integrate information from two different sources: a high-dimensional vector (usually an image) and a scalar. Common approaches either encode the scalar as an additional image input or combine scalar and vector information in specific network components, which restricts architecture choices. Instead, we propose to maintain two learnable parameter sets within a single network and to introduce the scalar dependency by dynamically interpolating between the parameter sets based on the scalar value during training and sampling. DPI is a simple, architecture-agnostic method for adding scalar dependence to a neural network. We demonstrate that our method improves denoising performance and enhances sample quality for both diffusion and flow matching models, while achieving computational efficiency comparable to standard scalar conditioning techniques. Code is available at https://github.com/wustl-cig/parameter_interpolation.

[528] Knowledge Distillation for Continual Learning of Biomedical Neural Fields

Wouter Visser, Jelmer M. Wolterink

Main category: eess.IV

TL;DR: Neural fields suffer from catastrophic forgetting when updated with new data. This paper analyzes this issue and proposes knowledge distillation to enable continual learning in neural fields, tested on cardiac MRI data.

Details

Motivation: Neural fields are used as continuous signal representations in biomedical imaging but cannot be easily extended like discrete representations. They suffer from catastrophic forgetting when presented with new data, limiting their practical application in incremental learning scenarios.

Method: The study examines catastrophic forgetting in neural fields and proposes knowledge distillation as a mitigation strategy. Experiments are conducted on cardiac cine MRI data in scenarios where data becomes available incrementally, testing when the spatiotemporal domain is enlarged or signal dimensionality increases.

Result: Experiments show that the amount of catastrophic forgetting depends significantly on the neural field model used. Knowledge distillation effectively mitigates catastrophic forgetting and enables continual learning in neural fields.

Conclusion: Knowledge distillation can successfully address catastrophic forgetting in neural fields, making continual learning possible. The effectiveness varies by model type, but distillation provides a viable strategy for extending neural fields incrementally.

Abstract: Neural fields are increasingly used as a light-weight, continuous, and differentiable signal representation in (bio)medical imaging. However, unlike discrete signal representations such as voxel grids, neural fields cannot be easily extended. As neural fields are, in essence, neural networks, prior signals represented in a neural field will degrade when the model is presented with new data due to catastrophic forgetting. This work examines the extent to which different neural field approaches suffer from catastrophic forgetting and proposes a strategy to mitigate this issue. We consider the scenario in which data becomes available incrementally, with only the most recent data available for neural field fitting. In a series of experiments on cardiac cine MRI data, we demonstrate how knowledge distillation mitigates catastrophic forgetting when the spatiotemporal domain is enlarged or the dimensionality of the represented signal is increased. We find that the amount of catastrophic forgetting depends, to a large extent, on the neural fields model used, and that distillation could enable continual learning in neural fields.

Wenwei Li, Lingyi Cai, Hui Gong, Qingming Luo, Anan Li

Main category: eess.IV

TL;DR: A deep learning framework for registering in-vivo two-photon and ex-vivo fluorescence microscopy images of neurons, addressing cross-modality appearance gaps, data scarcity, and tissue deformations through semantic-enhanced hybrid features and learnable geometric consistency.

Details

Motivation: Accurate registration of in-vivo and ex-vivo neuronal images is critical for structure-function analysis but challenged by cross-modality appearance gaps, limited annotated data, and severe tissue deformations.

Method: Uses semantic-enhanced hybrid feature descriptor combining local geometric features with DINOV3 vision foundation model, replaces RANSAC with learnable Geometric Consistency Confidence Module, and employs two-stage training with synthetic data pre-training and real data fine-tuning.

Result: Provides robust and accurate solution for high-precision registration in challenging biomedical imaging scenarios.

Conclusion: The framework enables large-scale correlative studies by overcoming key challenges in cross-modality neuronal image registration.

Abstract: Accurately registering in-vivo two-photon and ex-vivo fluorescence micro-optical sectioning tomography images of individual neurons is critical for structure-function analysis in neuroscience. This task is profoundly challenging due to a significant cross-modality appearance gap, the scarcity of annotated data and severe tissue deformations. We propose a novel deep learning framework to address these issues. Our method introduces a semantic-enhanced hybrid feature descriptor, which fuses the geometric precision of local features with the contextual robustness of a vision foundation model DINOV3 to bridge the modality gap. To handle complex deformations, we replace traditional RANSAC with a learnable Geometric Consistency Confidence Module, a novel classifier trained to identify and reject physically implausible correspondences. A data-efficient two-stage training strategy, involving pre-training on synthetically deformed data and fine-tuning on limited real data, overcomes the data scarcity problem. Our framework provides a robust and accurate solution for high-precision registration in challenging biomedical imaging scenarios, enabling large-scale correlative studies.

[530] Entropy Coding for Non-Rectangular Transform Blocks using Partitioned DCT Dictionaries for AV1

Priyanka Das, Tim Classen, Mathias Wien

Main category: eess.IV

TL;DR: This paper introduces an entropy coding method for efficiently coding transform coefficients in non-rectangular video codecs, addressing the limitations of current entropy coding schemes designed for DCT coefficients.

Details

Motivation: Current video codecs like VVC and AV1 use non-rectangular partitioning but lack proper transformation support. While a transformation technique using partitioned DCT bases shows promise, existing entropy coding schemes are not optimized for these coefficients since they're designed for regular DCT coefficients.

Method: The authors develop an entropy coding method that effectively models the properties of transform coefficients from non-rectangular partitioning, enabling efficient coding of these coefficients.

Result: The proposed entropy coding design offers significant theoretical rate savings, particularly for scenarios that differ more from traditional DCT, as estimated using conditional entropy in experimental setups.

Conclusion: The introduced entropy coding method efficiently handles transform coefficients from non-rectangular video coding, providing substantial rate savings while maintaining minimal decoder changes.

Abstract: Recent video codecs such as VVC and AV1 apply a Non-rectangular (NR) partitioning to combine prediction signals using a smooth blending around the boundary, followed by a rectangular transform on the whole block. The NR signal transformation is not yet supported. A transformation technique that applies the same partitioning to the 2D Discrete Cosine Transform (DCT) bases and finds a sparse representation of the NR signal in such a dictionary showed promising gains in an experimental setup outside the reference software. This method uses the regular inverse transformation at the decoder to reconstruct a rectangular signal and discards the signal outside the region of interest. This design is appealing due to the minimal changes required at the decoder. However, current entropy coding schemes are not well-suited for optimally encoding these coefficients because they are primarily designed for DCT coefficients. This work introduces an entropy coding method that efficiently codes these transform coefficients by effectively modeling their properties. The design offers significant theoretical rate savings, estimated using conditional entropy, particularly for scenarios that are more dissimilar to DCT in an experimental setup.

[531] LMLCC-Net: A Semi-Supervised Deep Learning Model for Lung Nodule Malignancy Prediction from CT Scans using a Novel Hounsfield Unit-Based Intensity Filtering

Tasnia Binte Mamun, Adhora Madhuri, Nusaiba Sobir, Taufiq Hasan

Main category: eess.IV

TL;DR: LMLCC-Net is a novel 3D CNN framework for classifying lung nodules in CT images using Hounsfield Unit-based intensity filtering, achieving 91.96% accuracy on LUNA16 dataset.

Details

Motivation: Early diagnosis of malignant pulmonary nodules in CT images can significantly reduce lung cancer mortality. Benign and malignant nodules have significant differences in HU intensity profiles that were not previously exploited in literature.

Method: Proposes LMLCC-Net with multiple branches using separate learnable HU-based intensity filtering stages to extract features from both intensity patterns and texture. Includes semi-supervised learning for ambiguous cases and a lightweight model variant.

Result: Achieves 91.96% classification accuracy, 92.94% sensitivity, and 94.07% AUC on LUNA16 dataset, showing improved performance compared to existing methods.

Conclusion: The proposed method can significantly help radiologists in pulmonary nodule classification and improve patient care by leveraging previously unexploited HU intensity differences between benign and malignant nodules.

Abstract: Lung cancer is the leading cause of patient mortality in the world. Early diagnosis of malignant pulmonary nodules in CT images can have a significant impact on reducing disease mortality and morbidity. In this work, we propose LMLCC-Net, a novel deep learning framework for classifying nodules from CT scan images using a 3D CNN, considering Hounsfield Unit (HU)-based intensity filtering. Benign and malignant nodules have significant differences in their intensity profile of HU, which was not exploited in the literature. Our method considers the intensity pattern as well as the texture for the prediction of malignancies. LMLCC-Net extracts features from multiple branches that each use a separate learnable HU-based intensity filtering stage. Various combinations of branches and learnable ranges of filters were explored to finally produce the best-performing model. In addition, we propose a semi-supervised learning scheme for labeling ambiguous cases and also developed a lightweight model to classify the nodules. The experimental evaluations are carried out on the LUNA16 dataset. The proposed LMLCC-Net was evaluated using the LUNA16 dataset. Our proposed method achieves a classification accuracy of 91.96%, a sensitivity of 92.94%, and an area under the curve of 94.07%, showing improved performance compared to existing methods The proposed method can have a significant impact in helping radiologists in the classification of pulmonary nodules and improving patient care.

[532] Generalizable cardiac substructures segmentation from contrast and non-contrast CTs using pretrained transformers

Aneesh Rangnekar, Nikhil Mankuzhy, Jonas Willmann, Chloe Choi, Abraham Wu, Maria Thor, Andreas Rimner, Harini Veeraraghavan

Main category: eess.IV

TL;DR: A hybrid transformer convolutional network was developed for robust cardiac substructure segmentation across varying CT imaging protocols and patient positions, achieving comparable accuracy to oracle models with 64% fewer training cases.

Details

Motivation: Automated AI segmentations deteriorate when applied to cases with different characteristics than training data, particularly in radiation treatment planning where imaging contrasts and scan positions vary.

Method: Developed hybrid transformer convolutional network trained on balanced distribution of contrast-enhanced and non-contrast CT scans from lung cancer patients, evaluated on held-out lung cancer patients and breast cancer patients in different positions.

Result: Balanced model achieved similar accuracy to oracle model (DSC: 0.82±0.10 vs 0.84±0.10 in Cohort I; 0.80±0.13 vs 0.81±0.12 in Cohort II) with 64% fewer training cases, outperforming TotalSegmentator and contrast-only models, and showed robustness to contrast and positioning variations.

Conclusion: Combining pretraining with balanced NCCT/CECT distribution enables reliable segmentation with substantially fewer labeled cases, demonstrating robust geometric and dosimetric accuracy essential for clinical deployment across varying imaging protocols.

Abstract: Automated AI segmentations for radiation treatment planning deteriorate when applied to cases with different characteristics than the training dataset. We developed a hybrid transformer convolutional network to segment cardiac substructures in lung and breast cancer patients with varying imaging contrasts and scan positions. Cohort I (56 contrast-enhanced CT [CECT], 124 non-contrast CT [NCCT] scans from lung cancer patients, supine position) was used to train an oracle model (180 cases), contrast-only model (56 CECTs), and balanced model (32 CECT, 32 NCCT). All models were evaluated on 60 held-out cohort I patients and 66 cohort II breast cancer patients (45 supine, 21 prone). Accuracy was measured using Dice similarity coefficient (DSC), 95th percentile Hausdorff distance (HD95), and dosimetric metrics, with TotalSegmentator as benchmark. Oracle and balanced models achieved similar accuracy (DSC: Oracle vs Balanced: Cohort I: 0.84 $\pm$ 0.10 vs 0.82 $\pm$ 0.10; Cohort II: 0.81 $\pm$ 0.12 vs 0.80 $\pm$ 0.13), both outperforming TotalSegmentator and the contrast-only models. The balanced model, using 64% fewer training cases, produced dosimetrically equivalent contours to manual delineations. It was robust to contrast variations (6 out of 8 substructures) and positioning variations (5 out of 8 substructures), with low correlation to patient age or body mass index. Our balanced model demonstrated robust geometric and dosimetric accuracy across varying imaging protocols and patient characteristics, which is essential for clinical deployment. Combining pretraining with balanced NCCT/CECT distribution enabled reliable segmentation with substantially fewer labeled cases than conventional approaches.

[533] DEMIST: Decoupled Multi-stream latent diffusion for Quantitative Myelin Map Synthesis

Jiacheng Wang, Hao Li, Xing Yao, Ahmad Toubasi, Taegan Vinarsky, Caroline Gheen, Joy Derwenskus, Chaoyang Jin, Richard Dortch, Junzhong Xu, Francesca Bagnato, Ipek Oguz

Main category: eess.IV

TL;DR: DEMIST synthesizes quantitative magnetization transfer (qMT) pool size ratio (PSR) maps from standard T1w and FLAIR images using a 3D latent diffusion model with three conditioning mechanisms, eliminating the need for specialized 20-30 minute qMT scans.

Details

Motivation: qMT imaging provides valuable myelin-sensitive biomarkers for multiple sclerosis assessment but requires specialized long scans (20-30 minutes), limiting clinical adoption. The goal is to generate PSR maps from standard clinical sequences.

Method: Two-stage approach: 1) Train separate autoencoders for PSR and anatomical images to learn aligned latent representations; 2) Train conditional diffusion model in latent space using frozen diffusion foundation backbone with three conditioning mechanisms: semantic tokens via cross-attention, spatial per-scale residual hints via 3D ControlNet, and adaptive LoRA-modulated attention. Includes edge-aware and alignment losses.

Result: Evaluated on 163 scans from 99 subjects using 5-fold cross-validation. Outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth.

Conclusion: DEMIST successfully synthesizes high-quality PSR maps from standard clinical images, providing a practical alternative to specialized qMT scans while maintaining quantitative accuracy and preserving lesion boundaries.

Abstract: Quantitative magnetization transfer (qMT) imaging provides myelin-sensitive biomarkers, such as the pool size ratio (PSR), which is valuable for multiple sclerosis (MS) assessment. However, qMT requires specialized 20-30 minute scans. We propose DEMIST to synthesize PSR maps from standard T1w and FLAIR images using a 3D latent diffusion model with three complementary conditioning mechanisms. Our approach has two stages: first, we train separate autoencoders for PSR and anatomical images to learn aligned latent representations. Second, we train a conditional diffusion model in this latent space on top of a frozen diffusion foundation backbone. Conditioning is decoupled into: (i) \textbf{semantic} tokens via cross-attention, (ii) \textbf{spatial} per-scale residual hints via a 3D ControlNet branch, and (iii) \textbf{adaptive} LoRA-modulated attention. We include edge-aware loss terms to preserve lesion boundaries and alignment losses to maintain quantitative consistency, while keeping the number of trainable parameters low and retaining the inductive bias of the pretrained model. We evaluate on 163 scans from 99 subjects using 5-fold cross-validation. Our method outperforms VAE, GAN and diffusion baselines on multiple metrics, producing sharper boundaries and better quantitative agreement with ground truth. Our code is publicly available at https://github.com/MedICL-VU/MS-Synthesis-3DcLDM.

[534] Diffusion Algorithm for Metalens Optical Aberration Correction

Harshana Weligampola, Yuanrui Chen, Weiheng Tang, Qi Guo, Stanley H. Chan

Main category: eess.IV

TL;DR: A dual-branch diffusion model using Stable Diffusion XL to reconstruct sharp full-color images from metalens-captured grayscale structure images and distorted color cues, overcoming severe chromatic aberrations.

Details

Motivation: Metalenses suffer from severe chromatic aberrations that make image reconstruction challenging, requiring algorithmic solutions to recover sharp, full-color images from distorted inputs.

Method: Uses a dual-branch diffusion model built on pre-trained Stable Diffusion XL to fuse information from sharp grayscale structure images and distorted color cue images captured by metalens systems.

Result: Significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing images.

Conclusion: The proposed algorithmic solution successfully addresses metalens chromatic aberration problems, enabling high-quality full-color image reconstruction from distorted metalens captures.

Abstract: Metalenses offer a path toward creating ultra-thin optical systems, but they inherently suffer from severe, spatially varying optical aberrations, especially chromatic aberration, which makes image reconstruction a significant challenge. This paper presents a novel algorithmic solution to this problem, designed to reconstruct a sharp, full-color image from two inputs: a sharp, bandpass-filtered grayscale structure image'' and a heavily distorted color cue’’ image, both captured by the metalens system. Our method utilizes a dual-branch diffusion model, built upon a pre-trained Stable Diffusion XL framework, to fuse information from the two inputs. We demonstrate through quantitative and qualitative comparisons that our approach significantly outperforms existing deblurring and pansharpening methods, effectively restoring high-frequency details while accurately colorizing the image.

Today’s Research Highlights

Table of Contents

cs.CL

[1] Democratizing LLM Efficiency: From Hyperscale Optimizations to Universal Deployability

[2] Harmonic Token Projection (HTP): A Vocabulary-Free, Training-Free, Deterministic, and Reversible Embedding Methodology

[3] A centroid based framework for text classification in itsm environments

[4] PIRA: Preference-Oriented Instruction-Tuned Reward Models with Dual Aggregation

[5] Structured Definitions and Segmentations for Legal Reasoning in LLMs: A Study on Indian Legal Data

[6] MindSET: Advancing Mental Health Benchmarking through Large-Scale Social Media Data

[7] Improved Visually Prompted Keyword Localisation in Real Low-Resource Settings

[8] Semantics Meet Signals: Dual Codebook Representationl Learning for Generative Recommendation

[9] Prompt Engineering Techniques for Context-dependent Text-to-SQL in Arabic

[10] Cognitive bias in LLM reasoning compromises interpretation of clinical oncology notes

[11] ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

[12] Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches

[13] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

[14] LLMs-Powered Accurate Extraction, Querying and Intelligent Management of Literature derived 2D Materials Data

[15] Memories Retrieved from Many Paths: A Multi-Prefix Framework for Robust Detection of Training Data Leakage in Large Language Models

[16] SAGE: An Agentic Explainer Framework for Interpreting SAE Features in Language Models

[17] Structured Prompting Enables More Robust, Holistic Evaluation of Language Models

[18] LightMem: Lightweight and Efficient Memory-Augmented Generation

[19] Length-MAX Tokenizer for Language Models

[20] Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

[21] Winning with Less for Low Resource Languages: Advantage of Cross-Lingual English_Persian Argument Mining Model over LLM Augmentation

[22] Emergence and Localisation of Semantic Role Circuits in LLMs

[23] Chatty-KG: A Multi-Agent AI System for On-Demand Conversational Question Answering over Knowledge Graphs

[24] TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Knowledge in Open Large Language Models

[25] Semantic Anchors in In-Context Learning: Why Small LLMs Cannot Flip Their Labels

[26] Context-Aware Pragmatic Metacognitive Prompting for Sarcasm Detection

[27] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

[28] Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

[29] MortgageLLM: Domain-Adaptive Pretraining with Residual Instruction Transfer, Alignment Tuning, and Task-Specific Routing

[30] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

[31] Can Finetuing LLMs on Small Human Samples Increase Heterogeneity, Alignment, and Belief-Action Coherence?

[32] Developing an Open Conversational Speech Corpus for the Isan Language

[33] PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

[34] Emergent Lexical Semantics in Neural Language Models: Testing Martin’s Law on LLM-Generated Text

[35] Training Introspective Behavior: Fine-Tuning Induces Reliable Internal State Detection in a 7B Model

[36] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

[37] Text-to-SQL as Dual-State Reasoning: Integrating Adaptive Context and Progressive Generation

[38] Odin: Oriented Dual-module Integration for Text-rich Network Representation Learning

[39] A Systematic Study of Model Merging Techniques in Large Language Models

[40] Hierarchical Ranking Neural Network for Long Document Readability Assessment

[41] Voice, Bias, and Coreference: An Interpretability Study of Gender in Speech Translation

[42] Bangla Sign Language Translation: Dataset Creation Challenges, Benchmarking and Prospects

[43] RoParQ: Paraphrase-Aware Alignment of Large Language Models Towards Robustness to Paraphrased Questions

[44] Auxiliary Metrics Help Decoding Skill Neurons in the Wild

[45] Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

[46] The author is dead, but what if they never lived? A reception experiment on Czech AI- and human-authored poetry

[47] Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

[48] Revisiting Generalization Across Difficulty Levels: It’s Not So Easy

[49] Evaluating Large Language Models for Radiology Natural Language Processing

[50] Fine-grained and Explainable Factuality Evaluation for Multimodal Summarization

[51] Scaling Efficient LLMs

[52] Gram2Vec: An Interpretable Document Vectorizer

[53] A Psychology-based Unified Dynamic Framework for Curriculum Learning

[54] Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

[55] BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

[56] Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

[57] Reasoning Transfer for an Extremely Low-Resource and Endangered Language: Bridging Languages Through Sample-Efficient Language Understanding

[58] The Distribution of Dependency Distance and Hierarchical Distance in Contemporary Written Japanese and Its Influencing Factors

[59] A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency

[60] Enhancing Large Language Models for Detecting Mental Manipulation via Annotation-Free Data Augmentation and Anti-Curriculum Distillation

[61] Web-Shepherd: Advancing PRMs for Reinforcing Web Agents

[62] UITron-Speech: Towards Automated GUI Agents Based on Speech Instructions

[63] The Structure-Content Trade-off in Knowledge Graph Retrieval

[64] Co-NAML-LSTUR: A Combined Model with Attentive Multi-View Learning and Long- and Short-term User Representations for News Recommendation

[65] On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

[66] CAMERA: Multi-Matrix Joint Compression for MoE Models via Micro-Expert Redundancy Analysis

[67] Uncovering Implicit Bias in Large Language Models with Concept Learning Dataset

[68] Exploring Cross-Lingual Knowledge Transfer via Transliteration-Based MLM Fine-Tuning for Critically Low-resource Chakma Language

[69] Prompt-R1: Collaborative Automatic Prompting Framework via End-to-end Reinforcement Learning

[70] AdvancedIF: Rubric-Based Benchmarking and Reinforcement Learning for Advancing LLM Instruction Following

[71] Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

[72] AICC: Parse HTML Finer, Make Models Better – A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

[73] CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

[74] DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

[75] A Systematic Analysis of Large Language Models with RAG-enabled Dynamic Prompting for Medical Error Detection and Correction

[76] MTA: A Merge-then-Adapt Framework for Personalized Large Language Model

[77] BengaliFig: A Low-Resource Challenge for Figurative and Culturally Grounded Reasoning in Bengali